Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 Inference on consumer GPU #11532

Open
4 tasks done
marvin-0042 opened this issue Jan 30, 2025 · 13 comments
Labels
enhancement New feature or request

Comments

@marvin-0042
Copy link

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Running DeekSeep-R1 or V3 inference needs 8xH100 80GB due to huge memory footprint, and it's very challenging to do R1 or V3 inference on single consumer GPU RAM (e.g. 24GB 4090) + limited CPU memory (say 32GB) with 685B MoE params even with low-bit quantization.

But since V3 and R1 has only 37B activated params (INT4 37B weights is 18.5GB), is it possible for the MoE inference to only load the 37B "activated experts (s)" related weights to GPU mem, and leave other non-activated or non-used expert's weight some in CPU memory(e.g.32GB), but majority (un-used) experts weights on disk because CPU memory is also limited, and only load/unload these experts when in use ?

DeepSeek-R1 MoE has 61 layers, current llama implementation will load n-gpu-layers of MoE to GPU memory (say 7 layers on 24GB 4090) while remaining layers of weights all on CPU memory, but when CPU only has limited memory (say 32GB), there will be huge swapping and make the inference speed extremely slow. but if we only load the "activated" expert (37B), it can fit into single 24GB 4090, also no need to do expensive swapping for unused expert on CPU memory/disk, i expect the 685B DeepSeek-R1 inference performance can be close to a 37B LLM.

I'm wondering if similar features is available or WIP inside llama.cpp or any other popular inference frameworks ?

Really appreciate your help!

Motivation

This will help llama.cpp users to run DeepSeek-R1, the best reasoning LLM by far, which has 685B MoE param but only 37B activated param, on consumer GPU device (4090 24GB) plus consumer CPU (i7/i9 32GB mem), with low-bit quantization, with an acceptable inference speed.

Possible Implementation

No response

@marvin-0042 marvin-0042 added the enhancement New feature or request label Jan 30, 2025
@ggerganov
Copy link
Member

ggerganov commented Jan 31, 2025

but if we only load the "activated" expert (37B), it can fit into single 24GB 4090, also no need to do expensive swapping for unused expert on CPU memory/disk, i expect the 685B DeepSeek-R1 inference performance can be close to a 37B LLM.

The problem is that you don't know in advance which would be the activated experts for a layer up until you reach that point. So when you reach it, you can't start computing immediately, but instead you would have to first move the data to the GPU which seems would introduce a lot of latency.

Nevertheless, I think this approach would be interesting to explore on Apple Silicon devices thanks to their unified memory and ability of the OS to dynamically collect unused GPU memory (see #11427). From what we learned recently, by default memory buffers on the GPU would be "garbage collected" (i.e. unwired) after ~1 second of not being used by the process (#10119). So if we distribute the expert tensors in each layer into separate buffers, we would auto-magically get sort of a load-balancing of the experts for free thanks to this feature (i.e. the hot/active experts would remain resident). The other problem that has to be solved is to trick Metal to allow us to map more data than it's maximum working size. I believe this can be solved by using multiple processes running on the same machine and communicating over the RPC backend (some initial experiments that I did already suggest that this would work). Still, this is very hand-wavy explanation and I could be missing some important detail, but it's something I think would be fun to play with in the future.

The MoE architecture is interesting and likely presents a lot of opportunities for clever optimizations.

@Dampfinchen
Copy link

I wonder if there is a possibility to predict which expert is likely to get used next via an algorithm or perhaps a small ML model, so that expert can be loaded beforehand.

@adamritter
Copy link

adamritter commented Feb 1, 2025

I was playing a bit with the DeepSeek 1.5 bit model on my 128GB Mac.

My basic result is that it can go up to 7 tok/s from 2 tok/s if only half of the experts are used (128 instead of 256):

I used this command: build/bin/llama-cli --model ~/Downloads/deepseek/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 8 --prio 2 --temp 0.6 --ctx-size 8192 --seed 3407 --n-gpu-layers 30 -no-cnv --prompt "<|User|>.....<|Assistant|>" --ctx-size 2048 --override-kv deepseek2.expert_used_count=int:8

(I know that the context size is quite small here)

I modified llm_build_moe_ffn to try it out by adding this to the expert selection:

selection_probs = ggml_view_2d(
    ctx,
    selection_probs,
    128,
    n_tokens,
    selection_probs->nb[0],  // stride for each "row"
    0                        // offset in bytes into the parent tensor
);

(it's the first time I'm using ggml, and just generated this code with the help of ChatGPT, so it may be totally stupid).

Of course the quality of the result goes down with this too simple method, but adding some logic to make the experts a bit more sticky (for example load only at most.1 new expert per token and not use the ones that we want to kick out from the cache) would maybe help.

(I was using the unsloth fork of llama.cpp)

@ghostplant
Copy link

How to set arguments to totally skip all 256 non-shared experts?

@jukofyork
Copy link
Contributor

I wonder if there is a possibility to predict which expert is likely to get used next via an algorithm or perhaps a small ML model, so that expert can be loaded beforehand.

You can probably use the expert gating tensors to so this as it's quite likely the hidden state vector direction at layer n is similar to the direction at later n+1 (and n+2).

@BrickBee
Copy link

BrickBee commented Feb 2, 2025

It might also improve the performance if a settings is added to pin the router/weighting network to either the VRAM or RAM (selectable). This can reduce the paging-in-from-disk time for each new token. It cannot be predicted which experts will be used for a token, but it's perfectly predictable that the expert selector will be used for each token.

@ipfgao
Copy link

ipfgao commented Feb 6, 2025

I believe that the architecture of MoE models still holds potential for further optimization. For edge users, a single query is unlikely to span multiple domains simultaneously. By further specializing experts and introducing multi-layered routing or weighting networks to control expert selection during training, while organizing experts into hierarchical categories based on industries or vertical domains, we can achieve more stable activation of relevant experts for each query. Such finer specialization not only reduces the size of individual expert models but also enhances dynamic adjustment efficiency and enables flexible combinations of model capabilities. Moreover, this approach not only supports inference of large models on personal computers or edge devices but also potentially enables fine-tuning or training of large models directly on personal computers or edge devices.

@wingenlit
Copy link

wingenlit commented Feb 6, 2025

It is actually possible to calculate differences for each MoE experts first, ship the diff file into VRAM, and dynamically craft a base expert to target MoE experts in parallel (in VRAM) at runtime?

@1994
Copy link

1994 commented Feb 10, 2025

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1_V3_tutorial.md

@Readon
Copy link

Readon commented Feb 13, 2025

Does llama.cpp support to offload experts to GPU standalone? If so we could offload all 8 shared experts to GPU as possible and leave the dynamically selected expert to CPU in memory. I guess that only 1 expert could be easily handled by CPU.

@inksong
Copy link

inksong commented Feb 14, 2025

@adamritter I have been following this optimization approach for quite some time now, and I’m currently stuck on how to correctly collect the ffn_moe_probs, ffn_moe_topk, and ffn_moe_weights for each layer of the MoE model during inference. Due to issues with the computation graph and the design of the GGML mechanism, my attempts can only be made after GGML_STATUS_SUCCESS. However, due to GGML’s reuse mechanism, many nodes in the computation graph are being reused. I am urgently and sincerely seeking help! All I need is to capture the callback during the inference process.

@ggerganov thx!

@lingster
Copy link

lingster commented Feb 18, 2025

looking at ktransformers, it seems like they have figured out which layers to load to GPU for improved performance:

https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml

Looks like this PR will allow you to offload specific tensors: #11397 (comment)

Will see if I can get this working on my rig in due course.

@Readon
Copy link

Readon commented Feb 21, 2025

looking at ktransformers, it seems like they have figured out which layers to load to GPU for improved performance:

https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml

Looks like this PR will allow you to offload specific tensors: #11397 (comment)

Will see if I can get this working on my rig in due course.

I could use it on my server with dual E5 v2 CPU + 2080Ti.
The improvement is significant, get about 3.6 tps while generating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests