-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 Inference on consumer GPU #11532
Comments
The problem is that you don't know in advance which would be the activated experts for a layer up until you reach that point. So when you reach it, you can't start computing immediately, but instead you would have to first move the data to the GPU which seems would introduce a lot of latency. Nevertheless, I think this approach would be interesting to explore on Apple Silicon devices thanks to their unified memory and ability of the OS to dynamically collect unused GPU memory (see #11427). From what we learned recently, by default memory buffers on the GPU would be "garbage collected" (i.e. unwired) after ~1 second of not being used by the process (#10119). So if we distribute the expert tensors in each layer into separate buffers, we would auto-magically get sort of a load-balancing of the experts for free thanks to this feature (i.e. the hot/active experts would remain resident). The other problem that has to be solved is to trick Metal to allow us to map more data than it's maximum working size. I believe this can be solved by using multiple processes running on the same machine and communicating over the RPC backend (some initial experiments that I did already suggest that this would work). Still, this is very hand-wavy explanation and I could be missing some important detail, but it's something I think would be fun to play with in the future. The MoE architecture is interesting and likely presents a lot of opportunities for clever optimizations. |
I wonder if there is a possibility to predict which expert is likely to get used next via an algorithm or perhaps a small ML model, so that expert can be loaded beforehand. |
I was playing a bit with the DeepSeek 1.5 bit model on my 128GB Mac. My basic result is that it can go up to 7 tok/s from 2 tok/s if only half of the experts are used (128 instead of 256): I used this command: build/bin/llama-cli --model ~/Downloads/deepseek/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 8 --prio 2 --temp 0.6 --ctx-size 8192 --seed 3407 --n-gpu-layers 30 -no-cnv --prompt "<|User|>.....<|Assistant|>" --ctx-size 2048 --override-kv deepseek2.expert_used_count=int:8 (I know that the context size is quite small here) I modified llm_build_moe_ffn to try it out by adding this to the expert selection:
(it's the first time I'm using ggml, and just generated this code with the help of ChatGPT, so it may be totally stupid). Of course the quality of the result goes down with this too simple method, but adding some logic to make the experts a bit more sticky (for example load only at most.1 new expert per token and not use the ones that we want to kick out from the cache) would maybe help. (I was using the unsloth fork of llama.cpp) |
How to set arguments to totally skip all 256 non-shared experts? |
You can probably use the expert gating tensors to so this as it's quite likely the hidden state vector direction at layer n is similar to the direction at later n+1 (and n+2). |
It might also improve the performance if a settings is added to pin the router/weighting network to either the VRAM or RAM (selectable). This can reduce the paging-in-from-disk time for each new token. It cannot be predicted which experts will be used for a token, but it's perfectly predictable that the expert selector will be used for each token. |
I believe that the architecture of MoE models still holds potential for further optimization. For edge users, a single query is unlikely to span multiple domains simultaneously. By further specializing experts and introducing multi-layered routing or weighting networks to control expert selection during training, while organizing experts into hierarchical categories based on industries or vertical domains, we can achieve more stable activation of relevant experts for each query. Such finer specialization not only reduces the size of individual expert models but also enhances dynamic adjustment efficiency and enables flexible combinations of model capabilities. Moreover, this approach not only supports inference of large models on personal computers or edge devices but also potentially enables fine-tuning or training of large models directly on personal computers or edge devices. |
It is actually possible to calculate differences for each MoE experts first, ship the diff file into VRAM, and dynamically craft a base expert to target MoE experts in parallel (in VRAM) at runtime? |
Does llama.cpp support to offload experts to GPU standalone? If so we could offload all 8 shared experts to GPU as possible and leave the dynamically selected expert to CPU in memory. I guess that only 1 expert could be easily handled by CPU. |
@adamritter I have been following this optimization approach for quite some time now, and I’m currently stuck on how to correctly collect the ffn_moe_probs, ffn_moe_topk, and ffn_moe_weights for each layer of the MoE model during inference. Due to issues with the computation graph and the design of the GGML mechanism, my attempts can only be made after GGML_STATUS_SUCCESS. However, due to GGML’s reuse mechanism, many nodes in the computation graph are being reused. I am urgently and sincerely seeking help! All I need is to capture the callback during the inference process. @ggerganov thx! |
looking at ktransformers, it seems like they have figured out which layers to load to GPU for improved performance: Looks like this PR will allow you to offload specific tensors: #11397 (comment) Will see if I can get this working on my rig in due course. |
I could use it on my server with dual E5 v2 CPU + 2080Ti. |
Prerequisites
Feature Description
Running DeekSeep-R1 or V3 inference needs 8xH100 80GB due to huge memory footprint, and it's very challenging to do R1 or V3 inference on single consumer GPU RAM (e.g. 24GB 4090) + limited CPU memory (say 32GB) with 685B MoE params even with low-bit quantization.
But since V3 and R1 has only 37B activated params (INT4 37B weights is 18.5GB), is it possible for the MoE inference to only load the 37B "activated experts (s)" related weights to GPU mem, and leave other non-activated or non-used expert's weight some in CPU memory(e.g.32GB), but majority (un-used) experts weights on disk because CPU memory is also limited, and only load/unload these experts when in use ?
DeepSeek-R1 MoE has 61 layers, current llama implementation will load n-gpu-layers of MoE to GPU memory (say 7 layers on 24GB 4090) while remaining layers of weights all on CPU memory, but when CPU only has limited memory (say 32GB), there will be huge swapping and make the inference speed extremely slow. but if we only load the "activated" expert (37B), it can fit into single 24GB 4090, also no need to do expensive swapping for unused expert on CPU memory/disk, i expect the 685B DeepSeek-R1 inference performance can be close to a 37B LLM.
I'm wondering if similar features is available or WIP inside llama.cpp or any other popular inference frameworks ?
Really appreciate your help!
Motivation
This will help llama.cpp users to run DeepSeek-R1, the best reasoning LLM by far, which has 685B MoE param but only 37B activated param, on consumer GPU device (4090 24GB) plus consumer CPU (i7/i9 32GB mem), with low-bit quantization, with an acceptable inference speed.
Possible Implementation
No response
The text was updated successfully, but these errors were encountered: