Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL]fix mul_mat_vec_q error #9939

Merged
merged 1 commit into from
Oct 21, 2024

Conversation

NeoZhangJianyu
Copy link
Collaborator

It fix the issues:
#9612
#9106

When WARP_SIZE is changed to 16 for Intel GPU, the mul_mat_vec_q() would be in die loop.

@github-actions github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Oct 18, 2024
@qnixsynapse
Copy link
Contributor

Are we considering setting warp size of 32 for all mmvq kernels? Why not just change the default warp size for all Intel GPUs instead of using a separate define QK_WARP_SIZE here?

@characharm
Copy link

crash log

llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global |
|
| | | | |compute|Max work|sub |mem |
|

ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Arc A770 Graphics 1.5 512 1024 32 16704M 1.3.30714

Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error)
Exception caught at file:S:/LLM/SYCL/llama.cpp/ggml/src/ggml-sycl.cpp, line:434, func:operator()
SYCL error: CHECK_TRY_ERROR((*stream) .memset(ctx->dev_ptr, value, buffer->size) .wait()): Meet error in this line code!
in function ggml_backend_sycl_buffer_clear at S:/LLM/SYCL/llama.cpp/ggml/src/ggml-sycl.cpp:434
S:\LLM\SYCL\llama.cpp\ggml\src\ggml-sycl\common.hpp:107: SYCL error

Qwen2.5-32B-Instruct-Q3_K_M.gguf 14.8 GB size. but i think is not related to #9612
#9106

@NeoZhangJianyu
Copy link
Collaborator Author

NeoZhangJianyu commented Oct 19, 2024

crash log
llama_new_context_with_model: freq_scale = 1 [SYCL] call ggml_check_sycl ggml_check_sycl: GGML_SYCL_DEBUG: 0 ggml_check_sycl: GGML_SYCL_F16: no found 1 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | |

ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Arc A770 Graphics 1.5 512 1024 32 16704M 1.3.30714
Native API failed. Native API returns: -999 (Unknown PI error) -999 (Unknown PI error) Exception caught at file:S:/LLM/SYCL/llama.cpp/ggml/src/ggml-sycl.cpp, line:434, func:operator() SYCL error: CHECK_TRY_ERROR((*stream) .memset(ctx->dev_ptr, value, buffer->size) .wait()): Meet error in this line code! in function ggml_backend_sycl_buffer_clear at S:/LLM/SYCL/llama.cpp/ggml/src/ggml-sycl.cpp:434 S:\LLM\SYCL\llama.cpp\ggml\src\ggml-sycl\common.hpp:107: SYCL error

Qwen2.5-32B-Instruct-Q3_K_M.gguf 14.8 GB size. but i think is not related to #9612 #9106

This issue is due to the memory is not enough.
The model is 14.8GB, but the Arc770 only has 16GB.
When loading the model, there is additional memory usage, so the whole needed memory is more then 16GB.
Please use smaller models.

@NeoZhangJianyu
Copy link
Collaborator Author

Are we considering setting warp size of 32 for all mmvq kernels? Why not just change the default warp size for all Intel GPUs instead of using a separate define QK_WARP_SIZE here?

WARP_SIZE=16 could speed up common cases on Intel GPU.
But it obviously has side effect. So QK_WARP_SIZE is used for the cases need 32 value.
This PR is used to fix the known issues.

For Intel GPU, WARP_SIZE is defined as 16.
For the case need 32, use QK_WARP_SIZE.
It's created by a9554e2.
I have discussed with the author, and decide to keep WARP_SIZE=16.

@qnixsynapse
Copy link
Contributor

qnixsynapse commented Oct 19, 2024

@NeoZhangJianyu Indeed. I did discuss this with the author when they were working on it. Kernels which use the Q_K datatype was having problems and we had to revert to 32 to fix them with a performance penalty.

Will the onednn GEMM path fix this? onednn gemm was implemented by the same author in ggml-sycl. I think this entire issue is related to an old version of the driver(Working driver version is 1.3.30872; 1.3.30714 is affected).

@characharm
Copy link

This issue is due to the memory is not enough.

Got it, I was hoping that this would work the same way with the Vulkan backend, where system RAM is used when there's a lack of VRAM memory

@NeoZhangJianyu
Copy link
Collaborator Author

@NeoZhangJianyu Indeed. I did discuss this with the author when they were working on it. Kernels which use the Q_K datatype was having problems and we had to revert to 32 to fix them with a performance penalty.

Will the onednn GEMM path fix this? onednn gemm was implemented by the same author in ggml-sycl. I think this entire issue is related to an old version of the driver(Working driver version is 1.3.30872; 1.3.30714 is affected).

I hope to fix the all issues of Q_K by this PR. Some users are blocked by it.

For better optimization by oneDNN GEMM, it is another big topic. I suggest to create another discussion for it. Maybe there are more solutions for it.

Copy link
Contributor

@luoyu-intel luoyu-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NeoZhangJianyu NeoZhangJianyu merged commit 1db8c84 into ggml-org:master Oct 21, 2024
53 checks passed
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw added a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw added a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants