-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL]fix mul_mat_vec_q error #9939
Conversation
Are we considering setting warp size of 32 for all mmvq kernels? Why not just change the default warp size for all Intel GPUs instead of using a separate define |
crash logllama_new_context_with_model: freq_scale = 1
Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) Qwen2.5-32B-Instruct-Q3_K_M.gguf 14.8 GB size. but i think is not related to #9612 |
This issue is due to the memory is not enough. |
WARP_SIZE=16 could speed up common cases on Intel GPU. For Intel GPU, WARP_SIZE is defined as 16. |
@NeoZhangJianyu Indeed. I did discuss this with the author when they were working on it. Kernels which use the Q_K datatype was having problems and we had to revert to 32 to fix them with a performance penalty. Will the onednn GEMM path fix this? onednn gemm was implemented by the same author in ggml-sycl. I think this entire issue is related to an old version of the driver(Working driver version is 1.3.30872; 1.3.30714 is affected). |
Got it, I was hoping that this would work the same way with the Vulkan backend, where system RAM is used when there's a lack of VRAM memory |
I hope to fix the all issues of Q_K by this PR. Some users are blocked by it. For better optimization by oneDNN GEMM, it is another big topic. I suggest to create another discussion for it. Maybe there are more solutions for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: arthw <[email protected]>
Co-authored-by: arthw <[email protected]>
Co-authored-by: arthw <[email protected]>
It fix the issues:
#9612
#9106
When WARP_SIZE is changed to 16 for Intel GPU, the mul_mat_vec_q() would be in die loop.