Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Kernel Compatibility Error with Tesla V100 (Volta, sm_70) GPUs #1390

Open
deepseven opened this issue Feb 25, 2025 · 14 comments · Fixed by ggml-org/llama.cpp#12098
Open

Comments

@deepseven
Copy link

I'm experiencing CUDA compatibility issues when trying to run KoboldCpp with Tesla V100 GPUs (Volta architecture, CUDA compute capability 7.0). Despite the log suggesting arch 700 is supported, the kernel fails to load.

Environment
Hardware: NVIDIA DGX node with 8x Tesla V100-SXM2-32GB GPUs (Volta architecture)
GPU Compute Capability: 7.0 (sm_70)
OS: Ubuntu 22.04.5 LTS
KoboldCpp versions tried: 1.84.2 and 1.83.1
CUDA version: 12.4.131 (from OpenCL info)

Error Message
When using mmq with CUDA, I get the following error:

CopyERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900

This is confusing because the error message states that the binary was compiled for arch 700 (among others), yet it claims there's no compatible device code for arch 700.

Steps to Reproduce

  • Configure KoboldCpp with CUDA backend using "usecublas": ["normal", "0", "mmq", "1"]
  • Attempt to load a DeepSeek R1 model
  • During inference, the error occurs repeatedly

Additional Information

  • I've tried multiple context sizes from 4096 to 8192
  • Setting nommq instead of mmq prevents the specific error, but causes very slow performance (2 tokens/sec)
@LostRuins
Copy link
Owner

Which binary are you using?

@deepseven
Copy link
Author

CUDA 12 for Linux.

@LostRuins
Copy link
Owner

When you use nommq mode, how many --gpulayers did you set? Do you see GPU utilization (i.e. are the weights offloaded to GPU)

@LostRuins
Copy link
Owner

The main difference I can see is that upstream builds (with cmake) only for compute capability targets 50;61;70;75;80 whereas i build with -arch=all on linux. There might be some edge condition that's not being correctly selected for, though since both support cc 7.0 directly i'm not entirely sure.

@LostRuins
Copy link
Owner

Also just wondering are you running with flash attention enabled.

Possibly related: #1362

@deepseven
Copy link
Author

Thanks for your quick response!

To answer your questions:

  • I set --gpulayers to 62 (all model layers) and I can confirm I'm seeing GPU utilization, so the weights are definitely being offloaded to the GPU.
  • I do not have flash attention enabled.

Re: #1362, I initially thought it might be related, but I'm experiencing the same problem with both the latest and previous release.

@LostRuins
Copy link
Owner

Image

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-FHHL-16GB           On  | 00000000:01:00.0 Off |                    0 |
| N/A   30C    P0              28W / 150W |   1332MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           On  |   00000000:C1:00.0 Off |                    0 |
| N/A   29C    P0             36W /  250W |    3238MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-PCIE-16GB           On  |   00000000:E1:00.0 Off |                  Off |
| N/A   25C    P0             35W /  250W |    3352MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

I couldn't find an equivalent setup for your NVIDIA DGX node with 8x Tesla V100-SXM2-32GB GPUs, the best I could rent was 2xV100 PCIE. But for me, I just tested a Llama3 8B model and everything worked fine and I did not encounter any issues for both cu11 and cu12 on this setup. I also tried a single V100 with an older version of cuda and that worked too.

What does your nvidia-smi show? Also, have you tried any older versions of koboldcpp before 1.83 and do they work? (Try v1.77 since that predates a lot of the new cuda refactors, maybe try a smaller model just to check first.)

@LostRuins
Copy link
Owner

Maybe you can also see if the same thing happens in upstream llama.cpp?

At the risk of overstepping, I'll just add a quick ping to @JohannesGaessler here even though I can't repro any issues on a regular Nvidia v100.

There might be something different about the NVIDIA DGX node with 8x Tesla V100-SXM2-32GB, but I cannot get my hands on one, or rent your GPU anywhere to test.

tldr: was compiled for cuda with -arch=all, and user gets CopyERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 700. ggml-cuda.cu was compiled for: 500,520,530,600,610,620,700,720,750,800,860,870,890,900

@JohannesGaessler
Copy link

Should be fixed by ggml-org#12098 . What I assume happened is that the koboldcpp code was compiled with GGML_CUDA_FORCE_MMQ and until now simply no one noticed that this causes issues with V100s.

@JohannesGaessler
Copy link

@LostRuins How did you test the code? The bug should only manifest for large batches where MMQ template specializations with > 64 parallel tokens per CUDA block are selected.

@deepseven
Copy link
Author

deepseven commented Feb 27, 2025

@LostRuins Yes, I can confirm this is an actual DGX setup with 8x V100s. Here's the nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-32GB           Off |   00000000:06:00.0 Off |                    0 |
| N/A   37C    P0             58W /  300W |   13894MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           Off |   00000000:07:00.0 Off |                    0 |
| N/A   41C    P0             58W /  300W |    9706MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           Off |   00000000:0A:00.0 Off |                    0 |
| N/A   41C    P0             59W /  300W |    9544MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           Off |   00000000:0B:00.0 Off |                    0 |
| N/A   39C    P0             59W /  300W |    9706MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  Tesla V100-SXM2-32GB           Off |   00000000:85:00.0 Off |                    0 |
| N/A   40C    P0             57W /  300W |    9706MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  Tesla V100-SXM2-32GB           Off |   00000000:86:00.0 Off |                    0 |
| N/A   40C    P0             59W /  300W |    9544MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  Tesla V100-SXM2-32GB           Off |   00000000:89:00.0 Off |                    0 |
| N/A   43C    P0             62W /  300W |    9706MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  Tesla V100-SXM2-32GB           Off |   00000000:8A:00.0 Off |                    0 |
| N/A   38C    P0             57W /  300W |    9650MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

I haven't tried versions before 1.83, but I'll give 1.77 a try with a smaller model as you suggested.

I'll wait for the fix from PR ggml-org#12098 to be incorporated. Do you have an estimate for when the fix might make it into a KoboldCpp release?

@LostRuins
Copy link
Owner

@JohannesGaessler thanks for your reply.

Indeed it's compiled with GGML_CUDA_FORCE_MMQ. For koboldcpp the batch size varies based on what fields the user sends in their request, but it will never exceed a configured batch size parameter (which defaults to batch size of 512), same values for both physical and logical batch (n_batch and n_ubatch)

@deepseven I usually do binary releases about once every 2-3 weeks. I'll let you know again when its out.

@LostRuins
Copy link
Owner

@deepseven can you try the latest release and see if that works

@deepseven
Copy link
Author

Yes, tried it and it works perfectly now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants