CUDA Kernel Compatibility Error with Tesla V100 (Volta, sm_70) GPUs #1390

deepseven · 2025-02-25T08:38:49Z

I'm experiencing CUDA compatibility issues when trying to run KoboldCpp with Tesla V100 GPUs (Volta architecture, CUDA compute capability 7.0). Despite the log suggesting arch 700 is supported, the kernel fails to load.

Environment
Hardware: NVIDIA DGX node with 8x Tesla V100-SXM2-32GB GPUs (Volta architecture)
GPU Compute Capability: 7.0 (sm_70)
OS: Ubuntu 22.04.5 LTS
KoboldCpp versions tried: 1.84.2 and 1.83.1
CUDA version: 12.4.131 (from OpenCL info)

Error Message
When using mmq with CUDA, I get the following error:

CopyERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900

This is confusing because the error message states that the binary was compiled for arch 700 (among others), yet it claims there's no compatible device code for arch 700.

Steps to Reproduce

Configure KoboldCpp with CUDA backend using "usecublas": ["normal", "0", "mmq", "1"]
Attempt to load a DeepSeek R1 model
During inference, the error occurs repeatedly

Additional Information

I've tried multiple context sizes from 4096 to 8192
Setting nommq instead of mmq prevents the specific error, but causes very slow performance (2 tokens/sec)

LostRuins · 2025-02-25T09:29:26Z

Which binary are you using?

deepseven · 2025-02-25T10:43:37Z

CUDA 12 for Linux.

LostRuins · 2025-02-25T10:53:58Z

When you use nommq mode, how many --gpulayers did you set? Do you see GPU utilization (i.e. are the weights offloaded to GPU)

LostRuins · 2025-02-25T11:01:29Z

The main difference I can see is that upstream builds (with cmake) only for compute capability targets 50;61;70;75;80 whereas i build with -arch=all on linux. There might be some edge condition that's not being correctly selected for, though since both support cc 7.0 directly i'm not entirely sure.

LostRuins · 2025-02-25T11:08:18Z

Also just wondering are you running with flash attention enabled.

Possibly related: #1362

deepseven · 2025-02-25T12:09:34Z

Thanks for your quick response!

To answer your questions:

I set --gpulayers to 62 (all model layers) and I can confirm I'm seeing GPU utilization, so the weights are definitely being offloaded to the GPU.
I do not have flash attention enabled.

Re: #1362, I initially thought it might be related, but I'm experiencing the same problem with both the latest and previous release.

LostRuins · 2025-02-25T16:11:17Z

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-FHHL-16GB           On  | 00000000:01:00.0 Off |                    0 |
| N/A   30C    P0              28W / 150W |   1332MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           On  |   00000000:C1:00.0 Off |                    0 |
| N/A   29C    P0             36W /  250W |    3238MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           On  |   00000000:E1:00.0 Off |                  Off |
| N/A   25C    P0             35W /  250W |    3352MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

I couldn't find an equivalent setup for your NVIDIA DGX node with 8x Tesla V100-SXM2-32GB GPUs, the best I could rent was 2xV100 PCIE. But for me, I just tested a Llama3 8B model and everything worked fine and I did not encounter any issues for both cu11 and cu12 on this setup. I also tried a single V100 with an older version of cuda and that worked too.

What does your nvidia-smi show? Also, have you tried any older versions of koboldcpp before 1.83 and do they work? (Try v1.77 since that predates a lot of the new cuda refactors, maybe try a smaller model just to check first.)

LostRuins · 2025-02-27T10:59:38Z

Maybe you can also see if the same thing happens in upstream llama.cpp?

At the risk of overstepping, I'll just add a quick ping to @JohannesGaessler here even though I can't repro any issues on a regular Nvidia v100.

There might be something different about the NVIDIA DGX node with 8x Tesla V100-SXM2-32GB, but I cannot get my hands on one, or rent your GPU anywhere to test.

tldr: was compiled for cuda with -arch=all, and user gets CopyERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900

JohannesGaessler · 2025-02-27T17:37:39Z

Should be fixed by ggml-org#12098 . What I assume happened is that the koboldcpp code was compiled with GGML_CUDA_FORCE_MMQ and until now simply no one noticed that this causes issues with V100s.

JohannesGaessler · 2025-02-27T17:40:47Z

@LostRuins How did you test the code? The bug should only manifest for large batches where MMQ template specializations with > 64 parallel tokens per CUDA block are selected.

deepseven · 2025-02-27T18:49:03Z

@LostRuins Yes, I can confirm this is an actual DGX setup with 8x V100s. Here's the nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-32GB           Off |   00000000:06:00.0 Off |                    0 |
| N/A   37C    P0             58W /  300W |   13894MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           Off |   00000000:07:00.0 Off |                    0 |
| N/A   41C    P0             58W /  300W |    9706MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           Off |   00000000:0A:00.0 Off |                    0 |
| N/A   41C    P0             59W /  300W |    9544MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           Off |   00000000:0B:00.0 Off |                    0 |
| N/A   39C    P0             59W /  300W |    9706MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  Tesla V100-SXM2-32GB           Off |   00000000:85:00.0 Off |                    0 |
| N/A   40C    P0             57W /  300W |    9706MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  Tesla V100-SXM2-32GB           Off |   00000000:86:00.0 Off |                    0 |
| N/A   40C    P0             59W /  300W |    9544MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  Tesla V100-SXM2-32GB           Off |   00000000:89:00.0 Off |                    0 |
| N/A   43C    P0             62W /  300W |    9706MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  Tesla V100-SXM2-32GB           Off |   00000000:8A:00.0 Off |                    0 |
| N/A   38C    P0             57W /  300W |    9650MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

I haven't tried versions before 1.83, but I'll give 1.77 a try with a smaller model as you suggested.

I'll wait for the fix from PR ggml-org#12098 to be incorporated. Do you have an estimate for when the fix might make it into a KoboldCpp release?

LostRuins · 2025-02-28T04:05:19Z

@JohannesGaessler thanks for your reply.

Indeed it's compiled with GGML_CUDA_FORCE_MMQ. For koboldcpp the batch size varies based on what fields the user sends in their request, but it will never exceed a configured batch size parameter (which defaults to batch size of 512), same values for both physical and logical batch (n_batch and n_ubatch)

@deepseven I usually do binary releases about once every 2-3 weeks. I'll let you know again when its out.

LostRuins · 2025-03-01T09:53:42Z

@deepseven can you try the latest release and see if that works

deepseven · 2025-03-01T20:44:47Z

Yes, tried it and it works perfectly now. Thanks!

JohannesGaessler mentioned this issue Feb 27, 2025

CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ ggml-org/llama.cpp#12098

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Kernel Compatibility Error with Tesla V100 (Volta, sm_70) GPUs #1390

CUDA Kernel Compatibility Error with Tesla V100 (Volta, sm_70) GPUs #1390

deepseven commented Feb 25, 2025

LostRuins commented Feb 25, 2025

deepseven commented Feb 25, 2025

LostRuins commented Feb 25, 2025

LostRuins commented Feb 25, 2025

LostRuins commented Feb 25, 2025

deepseven commented Feb 25, 2025

LostRuins commented Feb 25, 2025

LostRuins commented Feb 27, 2025

JohannesGaessler commented Feb 27, 2025

JohannesGaessler commented Feb 27, 2025

deepseven commented Feb 27, 2025 •

edited

Loading

LostRuins commented Feb 28, 2025

LostRuins commented Mar 1, 2025

deepseven commented Mar 1, 2025

CUDA Kernel Compatibility Error with Tesla V100 (Volta, sm_70) GPUs #1390

CUDA Kernel Compatibility Error with Tesla V100 (Volta, sm_70) GPUs #1390

Comments

deepseven commented Feb 25, 2025

LostRuins commented Feb 25, 2025

deepseven commented Feb 25, 2025

LostRuins commented Feb 25, 2025

LostRuins commented Feb 25, 2025

LostRuins commented Feb 25, 2025

deepseven commented Feb 25, 2025

LostRuins commented Feb 25, 2025

LostRuins commented Feb 27, 2025

JohannesGaessler commented Feb 27, 2025

JohannesGaessler commented Feb 27, 2025

deepseven commented Feb 27, 2025 • edited Loading

LostRuins commented Feb 28, 2025

LostRuins commented Mar 1, 2025

deepseven commented Mar 1, 2025

deepseven commented Feb 27, 2025 •

edited

Loading