-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Kernel Compatibility Error with Tesla V100 (Volta, sm_70) GPUs #1390
Comments
Which binary are you using? |
CUDA 12 for Linux. |
When you use nommq mode, how many |
The main difference I can see is that upstream builds (with cmake) only for compute capability targets 50;61;70;75;80 whereas i build with |
Also just wondering are you running with flash attention enabled. Possibly related: #1362 |
Thanks for your quick response! To answer your questions:
Re: #1362, I initially thought it might be related, but I'm experiencing the same problem with both the latest and previous release. |
I couldn't find an equivalent setup for your NVIDIA DGX node with 8x Tesla V100-SXM2-32GB GPUs, the best I could rent was 2xV100 PCIE. But for me, I just tested a Llama3 8B model and everything worked fine and I did not encounter any issues for both cu11 and cu12 on this setup. I also tried a single V100 with an older version of cuda and that worked too. What does your nvidia-smi show? Also, have you tried any older versions of koboldcpp before 1.83 and do they work? (Try v1.77 since that predates a lot of the new cuda refactors, maybe try a smaller model just to check first.) |
Maybe you can also see if the same thing happens in upstream llama.cpp? At the risk of overstepping, I'll just add a quick ping to @JohannesGaessler here even though I can't repro any issues on a regular Nvidia v100. There might be something different about the NVIDIA DGX node with 8x Tesla V100-SXM2-32GB, but I cannot get my hands on one, or rent your GPU anywhere to test. tldr: was compiled for cuda with |
Should be fixed by ggml-org#12098 . What I assume happened is that the koboldcpp code was compiled with |
@LostRuins How did you test the code? The bug should only manifest for large batches where MMQ template specializations with > 64 parallel tokens per CUDA block are selected. |
@LostRuins Yes, I can confirm this is an actual DGX setup with 8x V100s. Here's the nvidia-smi output:
I haven't tried versions before 1.83, but I'll give 1.77 a try with a smaller model as you suggested. I'll wait for the fix from PR ggml-org#12098 to be incorporated. Do you have an estimate for when the fix might make it into a KoboldCpp release? |
@JohannesGaessler thanks for your reply. Indeed it's compiled with @deepseven I usually do binary releases about once every 2-3 weeks. I'll let you know again when its out. |
@deepseven can you try the latest release and see if that works |
Yes, tried it and it works perfectly now. Thanks! |
I'm experiencing CUDA compatibility issues when trying to run KoboldCpp with Tesla V100 GPUs (Volta architecture, CUDA compute capability 7.0). Despite the log suggesting arch 700 is supported, the kernel fails to load.
Environment
Hardware: NVIDIA DGX node with 8x Tesla V100-SXM2-32GB GPUs (Volta architecture)
GPU Compute Capability: 7.0 (sm_70)
OS: Ubuntu 22.04.5 LTS
KoboldCpp versions tried: 1.84.2 and 1.83.1
CUDA version: 12.4.131 (from OpenCL info)
Error Message
When using mmq with CUDA, I get the following error:
This is confusing because the error message states that the binary was compiled for arch 700 (among others), yet it claims there's no compatible device code for arch 700.
Steps to Reproduce
Additional Information
The text was updated successfully, but these errors were encountered: