-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use pure CPU for main model but CuBLAS for drafting? #1349
Comments
If you don't want it to use blas for prompt processing, try setting |
I think I tried changing blas batch size to "don't use" from the interface. Do you think I can enable CuBLAS, set batch size to -1 and offloaded layers to 0 to get "not using GPU" (no CUDA, no memory)? UPD: for the simplest trivial test case, this seems to be true, GPU is not used when no batching; investigating further… |
Strange. Firstly, I cannot get a 100% reliable benchmark, since I feel the model is swapping differently each time… Anyway, what "BLAS Batch Size" is actually doing for "Use CPU" mode? Because when you tell "select CUDA with no batching" I assume it would at best work just as "Use CPU with no batching", but is this different from CPU with normal batching (256-512 tokens)? I see it prints by batches in the console, but again – for DeepSeek I cannot tell, whether the total speed really different or not (meaning, "no batching" is doing by incrementing +16 tokens; while batching processes everything at once – but is that actually ends up faster?) Also, does "CuBLAS with 0 layers +batching" vs "CPU +batching" should generate yellow text at the same speed? Or GPU is also used for generation and not only for prompt processing? Currently, CuBLAS with batching has in fact the worst speed! My one-shot benchmarks made from dedicated mode in GUI with 256 context: CPU:
CuBLAS no batch:
CuBLAS with batch:
As you see, CUDA batch (with 0 layers) is 5 times slower than CPU or "no batching". |
You can still have batching when using CPU mode, it will just use the tinyblas sgemm for it. If you are using the cublas backend selection, then it will be using GPU for batching if the batch size is large enough. Setting blasbatchsize to -1 will prevent that, making it basically the same as running on CPU for single token generation. |
I took 4096 context and loaded CuBLAS with zero layers and 512 batch,
Then I've set blas batch size to 32 there (the smallest positive value available). But, Umm…
Then I've enabled MMAP and tried 512 batch again:
More or less the same…
As if better, but! I tried Vulkcan (without batch), and again, and again and again:
Fluctuations are too significant. What is happening? Is that because of MoE swapping? How do you think is more reliable to freeze the prompt between runs? |
I believe you are probably running out of memory and hitting disk swap somehow. Either that or its needing more GPU memory than you have, and triggering the sysmem fallback |
I definitely run out of RAM (128 Gb, with swapfile 1Tb on separate fastest NVMe my motherboard support), and I have System Fallback always enabled of course. GPU memory is 12 Gb dedicated with shared half of RAM which is 64 Gb, but I did not see (in Task Manager) that it is used here. |
I ran the model at zero temp from an empty prompt for it to generate a bunch of text. Then I had cut it at 1500 tokens and composed a saved story to generate 500 tokens more, with 4k context. Here are result of different runs (with re-launch) of that exact story (greedily) from Vulkan without batch:
I think these can be considered pretty close. So it will be possible to compare different strategies, okay. Meanwhile, I thought of another question: is it theoretically possible to use llama.cpp to load only a part of layers from GGUF on one machine, and the other layers on the second machine – to run the model by shards, transmitting intermediate results? Is that possible, given the structure of llama.cpp? (Meaning, how hard is to hook between layers asynchronously to substitute their outputs with externally-computed data?) |
"Shared" GPU memory is not real vram btw. it's just regular RAM which needs to copy data to and fro. So it will be very very slow. And when RAM runs out, it will hit disk swap which will make it even slower. For the other thing you are talking about, it's kind of possible with llama RPC (network inference), however that is not implemented in KoboldCpp. https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md |
In the Task Manager, I don't see "shared" GPU is used. It shows 0.
It is not "very" slow, it is still faster than using CPU only: when generating images with Stable Diffusion, using Shared memory is 10x slower than Dedicated VRAM, but using CPU with RAM is 10x slower than Shared. Thus, pure CPU versus full GPU is around 100x slower (at the time I made benchmarks; specifically, I have not benched quantized versions of SD and how they perform on CPU). But here it is not the case, because when koboldcpp tries to process DeepSeek on my GPU – it stalls badly, probably because its layers are way too big.
There is no way I can add more than 128 Gb of RAM, so swapping is unavoidable.
I want to maximize the speed as much as possible for DeepSeek. I see several aspects:
My benchmarking continues… Of course, there are other options available:
|
After finding out that CLBlast can use batch 512 with more-or-less the same speed as pure CPU, I decided to try benchmarking drafting with CLBlast rather than CuBLAS without batches. …Turns out, Distill-Qwen-7B is AWFUL at predicting!! Most of the time it fails right away, occasionally guessing correctly just 1 token. Also I feel having 16 blas threads is better than 8 blas threads for me, but for normal generation threads 8 is still the best (physical cores count). |
I wanna try benching on Linux, possibly with bare-minimun console installation (only koboldcpp, then running browser on another machine from LAN). What distro is better to use? Like, "what linux is the best for kobold"? I believe CUDA is not needed here, which simplifies things. |
Can I use Alpine Linux to run pre-compiled koboldcpp binary? But kernel linker returns Am I out of luck here, or those libraries would somehow be installed if I will do a full Alpine installation? |
It looks like your environment does not have GLIBC installed. KoboldCpp needs GLIBC 2.8+ so make sure you have that. I'm not familiar with Alpine but I think a stripped down debian/ubuntu would probably be better. Also for completeness, make sure you are on the correct architecture. You can check by running |
Thanks, the problem was indeed in GLIBC ! Googling a lot, I resolved it by After that, I made another 192 Gb partition on the same NVMe where I have a pagefile for Windows, added it as swap to Alpine that was booted from USB, and mounted my main NTFS partition with DeepSeek model. Koboldcpp was copied to With
Then another idea hit me: run Windows in Safe Mode to lower its RAM usage!
Task Manager shows 236 Gb of virtual memory and 50-70% CPU during BLAS (also, fans were spinning harder than I heard with Alpine), then 30-50% CPU when generating. Umm… I do not understand. Somehow, BLAS for Windows is x3.8 faster than Alpine, but at the same time, generation for Alpine is x2.6 times faster than Windows!? UPD: Noticed that the output text is different for them, despite I have top_k=1 and temp=0.01. Strange. |
Could possibly be affected by your thread count |
@aleksusklim Its because musl, which is the c library Alpine uses, favors correctness and small footprint over speed and performance. GLIBC on the other hand, though a huge beast by comparison, is typically much more performant. Almost all of alpine is built against musl, not GLIBC. so even though your KCPP might be built against GLIBC, many of your other supporting libraries are bound to musl |
I am trying to run large
DeepSeek-R1-UD-Q2_K_XL
model with "Use CPU" while having a big pagefile.The model loads fine (even after using 100% of RAM) and runs slow but usable, putting a heavy load on CPU and descent load on NVMe. If I switch BLAS backend to CuBLAS or Vulcan even with 0 offloaded layers – prompt processing would take forever, with almost 0% load on both CPU and GPU, with occasional 5% spikes! Seems like it is unable to effectively use computation power, due to the model being very huge.
I tried to run small
DeepSeek-R1-Distill-Qwen-7B-Q5_K_M
with CuBLAS and all offloaded layers. Works perfectly, very fast, fully in VRAM.Now I want to use them both in drafting: large model as "Use CPU" and the small model with CuBLAS fully offloaded. Can I do that without forcing "CuBLAS with 0 offloading" on the main model, since for DeepSeek it hurts performance so much?
I also tried to play with "Quantize KV-cache" option, but for main DeepSeek model it throws an error about incompatible sizes of hidden dimensions that prevents the required FlashAttention to be applied. Is this because of MoE architecture?
I can collect required logs if needed.
The text was updated successfully, but these errors were encountered: