How to use pure CPU for main model but CuBLAS for drafting? #1349

aleksusklim · 2025-02-04T07:42:47Z

I am trying to run large DeepSeek-R1-UD-Q2_K_XL model with "Use CPU" while having a big pagefile.
The model loads fine (even after using 100% of RAM) and runs slow but usable, putting a heavy load on CPU and descent load on NVMe. If I switch BLAS backend to CuBLAS or Vulcan even with 0 offloaded layers – prompt processing would take forever, with almost 0% load on both CPU and GPU, with occasional 5% spikes! Seems like it is unable to effectively use computation power, due to the model being very huge.

I tried to run small DeepSeek-R1-Distill-Qwen-7B-Q5_K_M with CuBLAS and all offloaded layers. Works perfectly, very fast, fully in VRAM.

Now I want to use them both in drafting: large model as "Use CPU" and the small model with CuBLAS fully offloaded. Can I do that without forcing "CuBLAS with 0 offloading" on the main model, since for DeepSeek it hurts performance so much?

I also tried to play with "Quantize KV-cache" option, but for main DeepSeek model it throws an error about incompatible sizes of hidden dimensions that prevents the required FlashAttention to be applied. Is this because of MoE architecture?

I can collect required logs if needed.

The text was updated successfully, but these errors were encountered:

LostRuins · 2025-02-04T15:31:35Z

If you don't want it to use blas for prompt processing, try setting --blasbatchsize to -1. Does that work the way you expect it to?

aleksusklim · 2025-02-04T16:32:44Z

I think I tried changing blas batch size to "don't use" from the interface. Do you think I can enable CuBLAS, set batch size to -1 and offloaded layers to 0 to get "not using GPU" (no CUDA, no memory)?
I'm sure this not gonna work, but I'll double-check with DeepSeek again.

UPD: for the simplest trivial test case, this seems to be true, GPU is not used when no batching; investigating further…

aleksusklim · 2025-02-04T23:22:53Z

Strange. Firstly, I cannot get a 100% reliable benchmark, since I feel the model is swapping differently each time…
Is this because "DeeepSeek has a lot of experts with some of them active each time" ending up selecting those experts pretty randomly, and, e.g. when the next token required all brand new experts – then everything swaps out and stalls; while if the experts are more or less staying the same – I see a really good generation speed?

Anyway, what "BLAS Batch Size" is actually doing for "Use CPU" mode? Because when you tell "select CUDA with no batching" I assume it would at best work just as "Use CPU with no batching", but is this different from CPU with normal batching (256-512 tokens)? I see it prints by batches in the console, but again – for DeepSeek I cannot tell, whether the total speed really different or not (meaning, "no batching" is doing by incrementing +16 tokens; while batching processes everything at once – but is that actually ends up faster?)

Also, does "CuBLAS with 0 layers +batching" vs "CPU +batching" should generate yellow text at the same speed? Or GPU is also used for generation and not only for prompt processing? Currently, CuBLAS with batching has in fact the worst speed!

My one-shot benchmarks made from dedicated mode in GUI with 256 context:

CPU:

==========
Namespace(analyze='', benchmark='stdout', blasbatchsize=-1, blasthreads=8, chatcompletionsadapter=None, config=None, contextsize=256, debugmode=0, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, failsafe=False, flashattention=False, forceversion=0, foreground=False, gpulayers=0, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='D:/GGUF/DeepSeek/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=True, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdquant=False, sdt5xxl='', sdthreads=7, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=8, ttsgpu=False, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecublas=['lowvram', '0', 'nommq'], usemlock=False, usemmap=False, usevulkan=None, version=False, websearch=False, whispermodel='')
==========

Processing Prompt (156 / 156 tokens)
Generating (100 / 100 tokens)
[03:45:49] CtxLimit:256/256, Amt:100/100, Init:1.53s, Process:86.25s (552.9ms/T = 1.81T/s), Generate:72.71s (727.1ms/T = 1.38T/s), Total:158.95s (0.63T/s)
Benchmark Completed - v1.82.4 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cublas_Args=['lowvram', '0', 'nommq'] Tensor_Split=None BlasThreads=8 BlasBatchSize=-1 FlashAttention=False KvCache=0
Timestamp: 2025-02-04 22:45:49.044083+00:00
Backend: koboldcpp_cublas.dll
Layers: 0
Model: DeepSeek-R1-UD-Q2_K_XL-00001-of-00005
MaxCtx: 256
GenAmount: 100
-----
ProcessingTime: 86.246s
ProcessingSpeed: 1.81T/s
GenerationTime: 72.707s
GenerationSpeed: 1.38T/s
TotalTime: 158.953s
Output:  1 1 1 1
-----
===

CuBLAS no batch:

==========
Namespace(analyze='', benchmark='stdout', blasbatchsize=256, blasthreads=8, chatcompletionsadapter=None, config=None, contextsize=256, debugmode=0, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, failsafe=False, flashattention=False, forceversion=0, foreground=False, gpulayers=0, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='D:/GGUF/DeepSeek/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=True, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdquant=False, sdt5xxl='', sdthreads=7, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=8, ttsgpu=False, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=True, usecublas=None, usemlock=False, usemmap=False, usevulkan=None, version=False, websearch=False, whispermodel='')
==========

Processing Prompt (156 / 156 tokens)
Generating (100 / 100 tokens)
[03:55:24] CtxLimit:256/256, Amt:100/100, Init:0.52s, Process:74.51s (477.6ms/T = 2.09T/s), Generate:90.80s (908.0ms/T = 1.10T/s), Total:165.31s (0.60T/s)
Benchmark Completed - v1.82.4 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cublas_Args=None Tensor_Split=None BlasThreads=8 BlasBatchSize=256 FlashAttention=False KvCache=0
Timestamp: 2025-02-04 22:55:24.810446+00:00
Backend: koboldcpp_default.dll
Layers: 0
Model: DeepSeek-R1-UD-Q2_K_XL-00001-of-00005
MaxCtx: 256
GenAmount: 100
-----
ProcessingTime: 74.511s
ProcessingSpeed: 2.09T/s
GenerationTime: 90.796s
GenerationSpeed: 1.10T/s
TotalTime: 165.307s
Output:  1 1 1 1
-----
===

CuBLAS with batch:

==========
Namespace(analyze='', benchmark='stdout', blasbatchsize=256, blasthreads=8, chatcompletionsadapter=None, config=None, contextsize=256, debugmode=0, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, failsafe=False, flashattention=False, forceversion=0, foreground=False, gpulayers=0, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='D:/GGUF/DeepSeek/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=True, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdquant=False, sdt5xxl='', sdthreads=7, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=8, ttsgpu=False, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecublas=['lowvram', '0', 'nommq'], usemlock=False, usemmap=False, usevulkan=None, version=False, websearch=False, whispermodel='')
==========

Processing Prompt [BLAS] (156 / 156 tokens)
Generating (100 / 100 tokens)
[04:14:29] CtxLimit:256/256, Amt:100/100, Init:1.60s, Process:515.08s (3301.8ms/T = 0.30T/s), Generate:320.45s (3204.5ms/T = 0.31T/s), Total:835.53s (0.12T/s)
Benchmark Completed - v1.82.4 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cublas_Args=['lowvram', '0', 'nommq'] Tensor_Split=None BlasThreads=8 BlasBatchSize=256 FlashAttention=False KvCache=0
Timestamp: 2025-02-04 23:14:29.484161+00:00
Backend: koboldcpp_cublas.dll
Layers: 0
Model: DeepSeek-R1-UD-Q2_K_XL-00001-of-00005
MaxCtx: 256
GenAmount: 100
-----
ProcessingTime: 515.082s
ProcessingSpeed: 0.30T/s
GenerationTime: 320.447s
GenerationSpeed: 0.31T/s
TotalTime: 835.529s
Output:  1 1 1 1
-----
===

As you see, CUDA batch (with 0 layers) is 5 times slower than CPU or "no batching".
I will bench again, with longer context and different backends…

LostRuins · 2025-02-05T06:03:57Z

You can still have batching when using CPU mode, it will just use the tinyblas sgemm for it. If you are using the cublas backend selection, then it will be using GPU for batching if the batch size is large enough. Setting blasbatchsize to -1 will prevent that, making it basically the same as running on CPU for single token generation.

aleksusklim · 2025-02-06T12:48:20Z

I took 4096 context and loaded CuBLAS with zero layers and 512 batch, ['lowvram', '0', 'mmq']

ProcessingTime: 3874.190s
ProcessingSpeed: 1.03T/s
GenerationTime: 739.764s
GenerationSpeed: 0.14T/s
TotalTime: 4613.954s

Then I've set blas batch size to 32 there (the smallest positive value available). But, Umm…
It took 9 (nine) hours and I've decided to close it at

Processing Prompt [BLAS] (2944 / 3996 tokens)

Then I've enabled MMAP and tried 512 batch again:

ProcessingTime: 3795.436s
ProcessingSpeed: 1.05T/s
GenerationTime: 751.329s
GenerationSpeed: 0.13T/s
TotalTime: 4546.765s

More or less the same…
Okay, here are CPU only without batch and without mmap:

ProcessingTime: 3071.600s
ProcessingSpeed: 1.30T/s
GenerationTime: 209.842s
GenerationSpeed: 0.48T/s
TotalTime: 3281.442s

As if better, but! I tried Vulkcan (without batch), and again, and again and again:

ProcessingTime: 2493.455s
ProcessingSpeed: 1.60T/s
GenerationTime: 178.561s
GenerationSpeed: 0.56T/s
TotalTime: 2672.016s

ProcessingTime: 1371.289s
ProcessingSpeed: 2.91T/s
GenerationTime: 199.173s
GenerationSpeed: 0.50T/s
TotalTime: 1570.462s

ProcessingTime: 2205.632s
ProcessingSpeed: 1.81T/s
GenerationTime: 185.895s
GenerationSpeed: 0.54T/s
TotalTime: 2391.527s

ProcessingTime: 1752.347s
ProcessingSpeed: 2.28T/s
GenerationTime: 136.565s
GenerationSpeed: 0.73T/s
TotalTime: 1888.912s

Fluctuations are too significant. What is happening? Is that because of MoE swapping?
Is your Benchmark button generating a random text each time?

How do you think is more reliable to freeze the prompt between runs?
I afraid even if I'll give "a a a a a a a a" as prompt just to bench it – quite possible that DeepSeek will not activate most of its experts here, and there will be less swapping that in real runs, so such benchmark might be misleading (e.g. if some algo is less tolerant to swapping while showing too optimistic result when there are less swapping happening).

LostRuins · 2025-02-06T13:03:09Z

I believe you are probably running out of memory and hitting disk swap somehow. Either that or its needing more GPU memory than you have, and triggering the sysmem fallback

aleksusklim · 2025-02-06T13:06:36Z

I definitely run out of RAM (128 Gb, with swapfile 1Tb on separate fastest NVMe my motherboard support), and I have System Fallback always enabled of course. GPU memory is 12 Gb dedicated with shared half of RAM which is 64 Gb, but I did not see (in Task Manager) that it is used here.

aleksusklim · 2025-02-07T21:24:56Z

I ran the model at zero temp from an empty prompt for it to generate a bunch of text. Then I had cut it at 1500 tokens and composed a saved story to generate 500 tokens more, with 4k context.

Here are result of different runs (with re-launch) of that exact story (greedily) from Vulkan without batch:

Processing Prompt (1500 / 1500 tokens)
Generating (500 / 500 tokens)
[01:35:21] CtxLimit:2001/4096, Amt:500/500, Init:0.19s, Process:3013.58s (2009.1ms/T = 0.50T/s), Generate:3788.39s (7576.8ms/T = 0.13T/s), Total:6801.97s (0.07T/s)

Processing Prompt (1500 / 1500 tokens)
Generating (500 / 500 tokens)
[13:04:54] CtxLimit:2001/4096, Amt:500/500, Init:0.28s, Process:3189.43s (2126.3ms/T = 0.47T/s), Generate:3820.16s (7640.3ms/T = 0.13T/s), Total:7009.59s (0.07T/s)

Processing Prompt (1500 / 1500 tokens)
Generating (500 / 500 tokens)
[15:59:00] CtxLimit:2001/4096, Amt:500/500, Init:0.09s, Process:3073.97s (2049.3ms/T = 0.49T/s), Generate:4105.20s (8210.4ms/T = 0.12T/s), Total:7179.17s (0.07T/s)

Processing Prompt (1500 / 1500 tokens)
Generating (500 / 500 tokens)
[18:48:56] CtxLimit:2001/4096, Amt:500/500, Init:0.10s, Process:2997.20s (1998.1ms/T = 0.50T/s), Generate:4134.66s (8269.3ms/T = 0.12T/s), Total:7131.86s (0.07T/s)

Processing Prompt (1500 / 1500 tokens)
Generating (500 / 500 tokens)
[01:13:21] CtxLimit:2001/4096, Amt:500/500, Init:0.12s, Process:3135.95s (2090.6ms/T = 0.48T/s), Generate:3931.76s (7863.5ms/T = 0.13T/s), Total:7067.70s (0.07T/s)

I think these can be considered pretty close. So it will be possible to compare different strategies, okay.

Meanwhile, I thought of another question: is it theoretically possible to use llama.cpp to load only a part of layers from GGUF on one machine, and the other layers on the second machine – to run the model by shards, transmitting intermediate results?
Because, only server-type motherboards may support more than 192 Gb of RAM, while DeepSeek R1 demands for more (at least 210-230 for Q2). So I thought if two computers with 128 Gb RAM could load a half of all layers (or better, the "main" machine holds the majority of layers while the helper machine computes the rest and just sends the results back) – then the total speed will be better because there will be no swapping.

Is that possible, given the structure of llama.cpp? (Meaning, how hard is to hook between layers asynchronously to substitute their outputs with externally-computed data?)
For now, I am not asking to implement it in kobold, I'm just curious for a proof-of-concept. Or is there already exists a sharded computation like that?

LostRuins · 2025-02-08T01:28:31Z

"Shared" GPU memory is not real vram btw. it's just regular RAM which needs to copy data to and fro. So it will be very very slow. And when RAM runs out, it will hit disk swap which will make it even slower.

For the other thing you are talking about, it's kind of possible with llama RPC (network inference), however that is not implemented in KoboldCpp. https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/README.md

aleksusklim · 2025-02-08T13:16:19Z

"Shared" GPU memory is not real vram btw. it's just regular RAM which needs to copy data to and fro.

In the Task Manager, I don't see "shared" GPU is used. It shows 0.
Contrary to when layer offload is more than zero, and when the batching is used (for Mistral Large). And contrary to image generation programs (like Forge) that can specifically target shared memory rather than RAM.

So it will be very very slow.

It is not "very" slow, it is still faster than using CPU only: when generating images with Stable Diffusion, using Shared memory is 10x slower than Dedicated VRAM, but using CPU with RAM is 10x slower than Shared. Thus, pure CPU versus full GPU is around 100x slower (at the time I made benchmarks; specifically, I have not benched quantized versions of SD and how they perform on CPU).

But here it is not the case, because when koboldcpp tries to process DeepSeek on my GPU – it stalls badly, probably because its layers are way too big.

And when RAM runs out, it will hit disk swap

There is no way I can add more than 128 Gb of RAM, so swapping is unavoidable.
Technically I may have up to 3 fastest NVMe for each of them to have a pagefile, letting Windows use them in a RAID-0 like manner, which at max may give me up to 3x speed boost (but I highly doubt that even if I set three pagefiles with 256 Gb each, I will really get anywhere faster than with just one of 1Tb – since currently Task Manager shows around 50-70% of "disk usage" when swapping, which I think means that I am not hitting the NVMe physical speed limit yet).

which will make it even slower

I want to maximize the speed as much as possible for DeepSeek. I see several aspects:

Make RAM access faster (mmap? Threads?)
Make SSD access faster (tweak MoE? Batch?)
Find a fastest backend (Vulkan? Cuda?)
Use a distilled model for drafting from VRAM (consider to upgrade my GPU to make it even faster?)

My benchmarking continues…

Of course, there are other options available:

Use a smaller quant?
Wait for a "MoE distilled" model (with a smaller number of experts that uses less RAM)
Upgrade to DDR5 or even at server-type motherboard (very expensive)
Sharding to other machine (thanks for the link, I will dig into!)

aleksusklim · 2025-02-09T19:46:21Z

After finding out that CLBlast can use batch 512 with more-or-less the same speed as pure CPU, I decided to try benchmarking drafting with CLBlast rather than CuBLAS without batches.

…Turns out, Distill-Qwen-7B is AWFUL at predicting!! Most of the time it fails right away, occasionally guessing correctly just 1 token.
I set draft amount to 8 (too optimistically…) which caused my final generation speed to drop from 7 sec/token to 19 sec/token!

Also I feel having 16 blas threads is better than 8 blas threads for me, but for normal generation threads 8 is still the best (physical cores count).

aleksusklim · 2025-02-11T23:13:32Z

I wanna try benching on Linux, possibly with bare-minimun console installation (only koboldcpp, then running browser on another machine from LAN).

What distro is better to use? Like, "what linux is the best for kobold"?
If not, then just "what linux will eat the least RAM" so that the model might get more than on Windows 10 currently.

I believe CUDA is not needed here, which simplifies things.
Any suggestions? (Note: I have no "preferred" linux myself, I only use Windows; but I'm sure I am capable of basic Linux installation, just need to know whichever to grab for this).

aleksusklim · 2025-02-17T11:07:04Z

Can I use Alpine Linux to run pre-compiled koboldcpp binary?
I've booted up the ISO from their "extended" release into VM as a RAM-running system, then downloaded koboldcpp-linux-x64-nocuda binary as /tmp/koboldcpp

But kernel linker returns -sh: ./koboldcpp: not found meaning there is missing a proper loader.
By opening ELF in notepad I can see /lib64/ld-linux-x86-64.so.2 mentioned in the header, while this system has only /lib/ld-musl-x86_64.so.1
Running it against koboldcpp prints a list of errors with bunch of symbols not found, like __memcpy_chk:

Am I out of luck here, or those libraries would somehow be installed if I will do a full Alpine installation?

LostRuins · 2025-02-17T14:08:17Z

It looks like your environment does not have GLIBC installed. KoboldCpp needs GLIBC 2.8+ so make sure you have that. I'm not familiar with Alpine but I think a stripped down debian/ubuntu would probably be better.

Also for completeness, make sure you are on the correct architecture. You can check by running -uname m, The precompiled binaries only work on x86_64, not ARM. Lastly, make sure the file is set to executable with chmod +x

aleksusklim · 2025-02-17T22:41:40Z

Thanks, the problem was indeed in GLIBC !

Googling a lot, I resolved it by apk add gcompat and https://github.com/sgerrand/alpine-pkg-glibc/ (since after just gcompat some python's internals keep throwing similar error)
Installed with flags apk add --force-non-repository --force-overwrite glibc-2.35-r1.apk

After that, I made another 192 Gb partition on the same NVMe where I have a pagefile for Windows, added it as swap to Alpine that was booted from USB, and mounted my main NTFS partition with DeepSeek model.

Koboldcpp was copied to /tmp again, since NTFS is read-only (unless I install more packages, which I haven't tried).
I run my test .kcpps, but koboldcpp crashed due to "mmap: device not found" (or something like that), so I had to change the config to disable mmap.

With top (on tty2) I see koboldcpp eats VSZ 233g, %VSZ 117%, but %CPU merely 15-20%
(I don't know how to query disk load in % as in Windows)
One run:

Processing Prompt (1501 / 1501 tokens)
Generating (500 / 500 tokens)
[02:10:57] CtxLimit:2001/4096, Amt:500/500, Init:0.38s, Process:1736.65s (1157.0ms/T = 0.86T/s), Generate:1449.54s (2899.1ms/T = 0.34T/s), Total:3186.18s (0.16T/s)

Then another idea hit me: run Windows in Safe Mode to lower its RAM usage!
Here is what I got, using the same config (no mmap) and version (1.84.2) "nocuda", with the browser opened on another machine just as for Alpine:

Processing Prompt (1501 / 1501 tokens)
Generating (500 / 500 tokens)
[03:33:19] CtxLimit:2001/4096, Amt:500/500, Init:0.13s, Process:450.71s (300.3ms/T = 3.33T/s), Generate:3832.04s (7664.1ms/T = 0.13T/s), Total:4282.76s (0.12T/s)

Task Manager shows 236 Gb of virtual memory and 50-70% CPU during BLAS (also, fans were spinning harder than I heard with Alpine), then 30-50% CPU when generating.

Umm… I do not understand. Somehow, BLAS for Windows is x3.8 faster than Alpine, but at the same time, generation for Alpine is x2.6 times faster than Windows!?
(My config specifies 8 threads for generation but 16 threads for BLAS – but it was the same exact config)

UPD: Noticed that the output text is different for them, despite I have top_k=1 and temp=0.01. Strange.

LostRuins · 2025-02-18T10:53:10Z

Could possibly be affected by your thread count

MrReplikant · 2025-02-21T02:44:35Z

@aleksusklim Its because musl, which is the c library Alpine uses, favors correctness and small footprint over speed and performance. GLIBC on the other hand, though a huge beast by comparison, is typically much more performant.

Almost all of alpine is built against musl, not GLIBC. so even though your KCPP might be built against GLIBC, many of your other supporting libraries are bound to musl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use pure CPU for main model but CuBLAS for drafting? #1349

How to use pure CPU for main model but CuBLAS for drafting? #1349

aleksusklim commented Feb 4, 2025

LostRuins commented Feb 4, 2025

aleksusklim commented Feb 4, 2025 •

edited

Loading

aleksusklim commented Feb 4, 2025

LostRuins commented Feb 5, 2025

aleksusklim commented Feb 6, 2025

LostRuins commented Feb 6, 2025

aleksusklim commented Feb 6, 2025 •

edited

Loading

aleksusklim commented Feb 7, 2025

LostRuins commented Feb 8, 2025

aleksusklim commented Feb 8, 2025

aleksusklim commented Feb 9, 2025

aleksusklim commented Feb 11, 2025

aleksusklim commented Feb 17, 2025

LostRuins commented Feb 17, 2025

aleksusklim commented Feb 17, 2025 •

edited

Loading

LostRuins commented Feb 18, 2025

MrReplikant commented Feb 21, 2025 •

edited

Loading

How to use pure CPU for main model but CuBLAS for drafting? #1349

How to use pure CPU for main model but CuBLAS for drafting? #1349

Comments

aleksusklim commented Feb 4, 2025

LostRuins commented Feb 4, 2025

aleksusklim commented Feb 4, 2025 • edited Loading

aleksusklim commented Feb 4, 2025

LostRuins commented Feb 5, 2025

aleksusklim commented Feb 6, 2025

LostRuins commented Feb 6, 2025

aleksusklim commented Feb 6, 2025 • edited Loading

aleksusklim commented Feb 7, 2025

LostRuins commented Feb 8, 2025

aleksusklim commented Feb 8, 2025

aleksusklim commented Feb 9, 2025

aleksusklim commented Feb 11, 2025

aleksusklim commented Feb 17, 2025

LostRuins commented Feb 17, 2025

aleksusklim commented Feb 17, 2025 • edited Loading

LostRuins commented Feb 18, 2025

MrReplikant commented Feb 21, 2025 • edited Loading

aleksusklim commented Feb 4, 2025 •

edited

Loading

aleksusklim commented Feb 6, 2025 •

edited

Loading

aleksusklim commented Feb 17, 2025 •

edited

Loading

MrReplikant commented Feb 21, 2025 •

edited

Loading