Feature Request: MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 Inference on consumer GPU #11532

marvin-0042 · 2025-01-30T23:48:43Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Running DeekSeep-R1 or V3 inference needs 8xH100 80GB due to huge memory footprint, and it's very challenging to do R1 or V3 inference on single consumer GPU RAM (e.g. 24GB 4090) + limited CPU memory (say 32GB) with 685B MoE params even with low-bit quantization.

But since V3 and R1 has only 37B activated params (INT4 37B weights is 18.5GB), is it possible for the MoE inference to only load the 37B "activated experts (s)" related weights to GPU mem, and leave other non-activated or non-used expert's weight some in CPU memory(e.g.32GB), but majority (un-used) experts weights on disk because CPU memory is also limited, and only load/unload these experts when in use ?

DeepSeek-R1 MoE has 61 layers, current llama implementation will load n-gpu-layers of MoE to GPU memory (say 7 layers on 24GB 4090) while remaining layers of weights all on CPU memory, but when CPU only has limited memory (say 32GB), there will be huge swapping and make the inference speed extremely slow. but if we only load the "activated" expert (37B), it can fit into single 24GB 4090, also no need to do expensive swapping for unused expert on CPU memory/disk, i expect the 685B DeepSeek-R1 inference performance can be close to a 37B LLM.

I'm wondering if similar features is available or WIP inside llama.cpp or any other popular inference frameworks ?

Really appreciate your help!

Motivation

This will help llama.cpp users to run DeepSeek-R1, the best reasoning LLM by far, which has 685B MoE param but only 37B activated param, on consumer GPU device (4090 24GB) plus consumer CPU (i7/i9 32GB mem), with low-bit quantization, with an acceptable inference speed.

Possible Implementation

No response

ggerganov · 2025-01-31T08:50:12Z

but if we only load the "activated" expert (37B), it can fit into single 24GB 4090, also no need to do expensive swapping for unused expert on CPU memory/disk, i expect the 685B DeepSeek-R1 inference performance can be close to a 37B LLM.

The problem is that you don't know in advance which would be the activated experts for a layer up until you reach that point. So when you reach it, you can't start computing immediately, but instead you would have to first move the data to the GPU which seems would introduce a lot of latency.

Nevertheless, I think this approach would be interesting to explore on Apple Silicon devices thanks to their unified memory and ability of the OS to dynamically collect unused GPU memory (see #11427). From what we learned recently, by default memory buffers on the GPU would be "garbage collected" (i.e. unwired) after ~1 second of not being used by the process (#10119). So if we distribute the expert tensors in each layer into separate buffers, we would auto-magically get sort of a load-balancing of the experts for free thanks to this feature (i.e. the hot/active experts would remain resident). The other problem that has to be solved is to trick Metal to allow us to map more data than it's maximum working size. I believe this can be solved by using multiple processes running on the same machine and communicating over the RPC backend (some initial experiments that I did already suggest that this would work). Still, this is very hand-wavy explanation and I could be missing some important detail, but it's something I think would be fun to play with in the future.

The MoE architecture is interesting and likely presents a lot of opportunities for clever optimizations.

Dampfinchen · 2025-01-31T15:57:29Z

I wonder if there is a possibility to predict which expert is likely to get used next via an algorithm or perhaps a small ML model, so that expert can be loaded beforehand.

adamritter · 2025-02-01T03:51:25Z

I was playing a bit with the DeepSeek 1.5 bit model on my 128GB Mac.

My basic result is that it can go up to 7 tok/s from 2 tok/s if only half of the experts are used (128 instead of 256):

I used this command: build/bin/llama-cli --model ~/Downloads/deepseek/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf --cache-type-k q4_0 --threads 8 --prio 2 --temp 0.6 --ctx-size 8192 --seed 3407 --n-gpu-layers 30 -no-cnv --prompt "<｜User｜>.....<｜Assistant｜>" --ctx-size 2048 --override-kv deepseek2.expert_used_count=int:8

(I know that the context size is quite small here)

I modified llm_build_moe_ffn to try it out by adding this to the expert selection:

selection_probs = ggml_view_2d(
    ctx,
    selection_probs,
    128,
    n_tokens,
    selection_probs->nb[0],  // stride for each "row"
    0                        // offset in bytes into the parent tensor
);

(it's the first time I'm using ggml, and just generated this code with the help of ChatGPT, so it may be totally stupid).

Of course the quality of the result goes down with this too simple method, but adding some logic to make the experts a bit more sticky (for example load only at most.1 new expert per token and not use the ones that we want to kick out from the cache) would maybe help.

(I was using the unsloth fork of llama.cpp)

ghostplant · 2025-02-01T06:58:08Z

How to set arguments to totally skip all 256 non-shared experts?

jukofyork · 2025-02-01T11:12:45Z

I wonder if there is a possibility to predict which expert is likely to get used next via an algorithm or perhaps a small ML model, so that expert can be loaded beforehand.

You can probably use the expert gating tensors to so this as it's quite likely the hidden state vector direction at layer n is similar to the direction at later n+1 (and n+2).

BrickBee · 2025-02-02T20:26:13Z

It might also improve the performance if a settings is added to pin the router/weighting network to either the VRAM or RAM (selectable). This can reduce the paging-in-from-disk time for each new token. It cannot be predicted which experts will be used for a token, but it's perfectly predictable that the expert selector will be used for each token.

ipfgao · 2025-02-06T05:48:43Z

I believe that the architecture of MoE models still holds potential for further optimization. For edge users, a single query is unlikely to span multiple domains simultaneously. By further specializing experts and introducing multi-layered routing or weighting networks to control expert selection during training, while organizing experts into hierarchical categories based on industries or vertical domains, we can achieve more stable activation of relevant experts for each query. Such finer specialization not only reduces the size of individual expert models but also enhances dynamic adjustment efficiency and enables flexible combinations of model capabilities. Moreover, this approach not only supports inference of large models on personal computers or edge devices but also potentially enables fine-tuning or training of large models directly on personal computers or edge devices.

wingenlit · 2025-02-06T20:06:21Z

It is actually possible to calculate differences for each MoE experts first, ship the diff file into VRAM, and dynamically craft a base expert to target MoE experts in parallel (in VRAM) at runtime?

1994 · 2025-02-10T11:25:19Z

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1_V3_tutorial.md

Readon · 2025-02-13T08:40:31Z

Does llama.cpp support to offload experts to GPU standalone? If so we could offload all 8 shared experts to GPU as possible and leave the dynamically selected expert to CPU in memory. I guess that only 1 expert could be easily handled by CPU.

inksong · 2025-02-14T18:51:18Z

@adamritter I have been following this optimization approach for quite some time now, and I’m currently stuck on how to correctly collect the ffn_moe_probs, ffn_moe_topk, and ffn_moe_weights for each layer of the MoE model during inference. Due to issues with the computation graph and the design of the GGML mechanism, my attempts can only be made after GGML_STATUS_SUCCESS. However, due to GGML’s reuse mechanism, many nodes in the computation graph are being reused. I am urgently and sincerely seeking help! All I need is to capture the callback during the inference process.

@ggerganov thx!

lingster · 2025-02-18T22:57:34Z

looking at ktransformers, it seems like they have figured out which layers to load to GPU for improved performance:

https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml

Looks like this PR will allow you to offload specific tensors: #11397 (comment)

Will see if I can get this working on my rig in due course.

Readon · 2025-02-21T02:46:52Z

looking at ktransformers, it seems like they have figured out which layers to load to GPU for improved performance:

https://github.com/kvcache-ai/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml

Looks like this PR will allow you to offload specific tensors: #11397 (comment)

Will see if I can get this working on my rig in due course.

I could use it on my server with dual E5 v2 CPU + 2080Ti.
The improvement is significant, get about 3.6 tps while generating.

marvin-0042 added the enhancement New feature or request label Jan 30, 2025

rick-github mentioned this issue Feb 6, 2025

[Feature Request] Dynamic Hierarchical Memory Management for MOE Models ollama/ollama#8861

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 Inference on consumer GPU #11532

Feature Request: MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 Inference on consumer GPU #11532

marvin-0042 commented Jan 30, 2025

ggerganov commented Jan 31, 2025 •

edited

Loading

Dampfinchen commented Jan 31, 2025

adamritter commented Feb 1, 2025 •

edited

Loading

ghostplant commented Feb 1, 2025

jukofyork commented Feb 1, 2025

BrickBee commented Feb 2, 2025

ipfgao commented Feb 6, 2025

wingenlit commented Feb 6, 2025 •

edited

Loading

1994 commented Feb 10, 2025

Readon commented Feb 13, 2025

inksong commented Feb 14, 2025

lingster commented Feb 18, 2025 •

edited

Loading

Readon commented Feb 21, 2025

Feature Request: MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 Inference on consumer GPU #11532

Feature Request: MoE only load activated expert(s) to GPU while rest non-used experts are not loaded (to CPU/GPU) for DeekSeek-R1 Inference on consumer GPU #11532

Comments

marvin-0042 commented Jan 30, 2025

Prerequisites

Feature Description

Motivation

Possible Implementation

ggerganov commented Jan 31, 2025 • edited Loading

Dampfinchen commented Jan 31, 2025

adamritter commented Feb 1, 2025 • edited Loading

ghostplant commented Feb 1, 2025

jukofyork commented Feb 1, 2025

BrickBee commented Feb 2, 2025

ipfgao commented Feb 6, 2025

wingenlit commented Feb 6, 2025 • edited Loading

1994 commented Feb 10, 2025

Readon commented Feb 13, 2025

inksong commented Feb 14, 2025

lingster commented Feb 18, 2025 • edited Loading

Readon commented Feb 21, 2025

ggerganov commented Jan 31, 2025 •

edited

Loading

adamritter commented Feb 1, 2025 •

edited

Loading

wingenlit commented Feb 6, 2025 •

edited

Loading

lingster commented Feb 18, 2025 •

edited

Loading