Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Intel Arc - not working at all #9106

Closed
Eugeniusz-Gienek opened this issue Aug 20, 2024 · 28 comments
Closed

Bug: Intel Arc - not working at all #9106

Eugeniusz-Gienek opened this issue Aug 20, 2024 · 28 comments
Assignees
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) stale SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Comments

@Eugeniusz-Gienek
Copy link

What happened?

Going through the manual - SYCL I mean. Everything compiles okay. Running it always thows an error. Can't make it work. OS used: Linux Gentoo. P.S. docker doesn't work either. P.P.S. device IS listed in the list.

Name and Version

./build/bin/llama-cli --version

version: 3609 (2f3c146)
built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

# ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
Log start
main: build = 3609 (2f3c1466)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: seed  = 1724182694
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0                                                  llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0                                               llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)                                llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'                                            llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'                                          llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48                                                 ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes                                                          ggml_sycl_init: found 2 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU                                    llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048                                            llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1                                               [SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A380 Graphics|    1.3|    128|    1024|   32|  6064M|            1.3.29735|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 630|    1.3|     24|     256|   32| 46333M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
Unexpected pattern!
UNREACHABLE executed at /var/tmp/portage/dev-util/spirv-llvm-translator-15.0.0-r1/work/SPIRV-LLVM-Translator-15.0.0/lib/SPIRV/SPIRVUtil.cpp:2037!
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A380 Graphics':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/username/llama/ggml/src/ggml-sycl.cpp, line:2722
@Eugeniusz-Gienek Eugeniusz-Gienek added bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Aug 20, 2024
@Eugeniusz-Gienek
Copy link
Author

Additionally:

# ./build/bin/llama-ls-sycl-device                               found 2 SYCL devices:                                                                      |  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A380 Graphics|    1.3|    128|    1024|   32|  6064M|            1.3.29735|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 630|    1.3|     24|     256|   32| 46333M|            1.3.29735|

And

# sycl-ls                                                        [opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A380 Graphics OpenCL 3.0 NEO  [24.22.029735]
[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 630 OpenCL 3.0 NEO  [24.22.029735]
[level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A380 Graphics 1.3 [1.3.29735]
[level_zero:gpu][level_zero:1] Intel(R) Level-Zero, Intel(R) UHD Graphics 630 1.3 [1.3.29735

@Eugeniusz-Gienek
Copy link
Author

...and just in case:

# lspci -k
[...]
03:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A380] (rev 05)
        Subsystem: ASRock Incorporation Device 6006
        Kernel driver in use: i915
        Kernel modules: xe
00:02.0 VGA compatible controller: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630]
        Subsystem: Hewlett-Packard Company Device 83ee
        Kernel driver in use: i915

@qnixsynapse
Copy link
Contributor

Run ./test-backend-ops -b SYCL0 and paste the output here.

@NeoZhangJianyu
Copy link
Collaborator

@Eugeniusz-Giene
Did you build it with FP16?
If yes, please build it with FP32 firstly.

@airMeng
Copy link
Collaborator

airMeng commented Aug 22, 2024

@ClarkChin08 please give a review

@ngxson ngxson added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Aug 22, 2024
@ProjectMoon
Copy link

I am having this exact same issue, on Gentoo, with the exact same LLVM spirv library.

I built llama.cpp according to the SYCL docs, using the example build script, which builds with FP32 by default. So I have this issue when using FP32.

@ClarkChin08
Copy link
Contributor

What happened?

Going through the manual - SYCL I mean. Everything compiles okay. Running it always thows an error. Can't make it work. OS used: Linux Gentoo. P.S. docker doesn't work either. P.P.S. device IS listed in the list.

Name and Version

./build/bin/llama-cli --version

version: 3609 (2f3c146) built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

# ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
Log start
main: build = 3609 (2f3c1466)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: seed  = 1724182694
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0                                                  llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0                                               llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)                                llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'                                            llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'                                          llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48                                                 ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes                                                          ggml_sycl_init: found 2 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU                                    llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048                                            llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1                                               [SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
found 2 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A380 Graphics|    1.3|    128|    1024|   32|  6064M|            1.3.29735|
| 1| [level_zero:gpu:1]|                 Intel UHD Graphics 630|    1.3|     24|     256|   32| 46333M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
Unexpected pattern!
UNREACHABLE executed at /var/tmp/portage/dev-util/spirv-llvm-translator-15.0.0-r1/work/SPIRV-LLVM-Translator-15.0.0/lib/SPIRV/SPIRVUtil.cpp:2037!
The program was built for 1 devices
Build program log for 'Intel(R) Arc(TM) A380 Graphics':
 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/username/llama/ggml/src/ggml-sycl.cpp, line:2722
-sm none -mg 0

Hi, you use '-sm none -mg 0', This will keep the model running on the GPU-0 and the memory is only 6064M, Please change the command to '-sm layer' instead of '-sm none -mg 0' , this will split the graph to two GPUs.

@NeoZhangJianyu
Copy link
Collaborator

@Eugeniusz-Gienek
I see your CPU is i5-8500.
For Intel Arc GPU, it's supported from 11th Core CPU or newer CPU.
I'm not sure your CPU + Arc 380 are supported by oneAPI well. Looks like no.

@Eugeniusz-Gienek
Copy link
Author

@Eugeniusz-Gienek
I see your CPU is i5-8500.
For Intel Arc GPU, it's supported from 11th Core CPU or newer CPU.
I'm not sure your CPU + Arc 380 are supported by oneAPI well. Looks like no.

Oh that's a very important information which I would assume belongs to SYCL installation manual... Thanks for sharing!

@ProjectMoon
Copy link

Hmm, I have a 10th gen CPU. That might be why mine does not work, either.

@Eugeniusz-Gienek
Copy link
Author

To be honest I have bought specifically Intel Arc A380 with intent to use on a HP Prodesk 600 G4 SFF as there are not a lot of options out there for SFF format. If I knew in advance it won't work I wouldn't buy it. Anyways, I've managed to return it to seller and out of curiosity checked if the onboard Intel GPU would work at least somehow. The only way for it to work was to use Vulkan, which on that GPU is TERRIBLY slow. So, to sum up - nope, you probably not want to use any Intel GPU if Your CPU is not Intel 11th gen+. Ah, and actually I've switched to another PC for llama.cpp with Intel 10th gen and NVidia 3060, which works great with GPT4ALL but to my surprise didn't work that good with llama.cpp as it appears to be that if one wants to run a webserver the RAM requrement is also a thing, not only VRam - thus for a 8Gb llama model my 16Gb is .... not enough. Anyways, now it is a waiting game for additional RAM to appear and maybe finally the local GPT webserver will be functional in the network.

@NeoZhangJianyu
Copy link
Collaborator

It's limited by oneAPI/SYCL support in fact.

Some motherboard of 10th Core support Arc GPU by support resize bar feature.
Please check: https://www.intel.com/content/www/us/en/support/articles/000090831/graphics.html

@Eugeniusz-Gienek
Copy link
Author

I would propose to include this information in the manual - in order for potential new users to not be frustrated and purchase the hardware which won't fulfill the expectations... What do You think? Does it make sense?

@piDack
Copy link
Contributor

piDack commented Aug 28, 2024

I suggest you might consider using the Vulkan version of llama.cpp, as it is compatible with almost all graphics cards.Intel's GPU software & driver is bad

@Eugeniusz-Gienek
Copy link
Author

I suggest you might consider using the Vulkan version of llama.cpp, as it is compatible with almost all graphics cards.Intel's GPU software & driver is bad

This is exactly what I did to be honest but that implementation is extremely slow...

@NeoZhangJianyu
Copy link
Collaborator

I would propose to include this information in the manual - in order for potential new users to not be frustrated and purchase the hardware which won't fulfill the expectations... What do You think? Does it make sense?

Yes.

resize bar feature could be detected in BIOS.
It's hard to list all supported hardware for it.

If your PC with 10th Core CPU has resize bar, it could support Arc GPU.

@NeoZhangJianyu
Copy link
Collaborator

i7 8700K on GIGABYTE Z370d motherboard by enable "Above 4G Decoding" in BOIS.
Arc 770 is supported.
But due to miss the "resize bar" feature, the performance is 50%.

@jshield
Copy link

jshield commented Sep 11, 2024

technically resizable BARs have been supported since PCIe 2.0, you can most likely enable them by modifying the DSDTs/DXE of your UEFI BIOS, there is a framework for that here: https://github.com/xCuri0/ReBarUEFI

if you do have Above 4G Decoding as an option, the linux kernel will attempt itself to resize the BARs when loading the drivers, although I'm not sure if the ARC drivers do that. amdgpu drivers do however.

@NineMeowICT
Copy link

NineMeowICT commented Oct 2, 2024

@NeoZhangJianyu
This is mine:
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) i3-9100 CPU @ 3.60GHz OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.35.30872.22]
[level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.5 [1.3.30872]

The command that works properly:
./llama-server -m /media/ninemeow/Delta/NLP/Lumimaid-v0.2-8B-Q6_K-imat.gguf -c 48000

And I got this error after I specified -ngl option:

MKL Warning: Incompatible OpenCL driver version. GPU performance may be reduced.
Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
Exception caught at file:/home/ninemeow/llama.cpp/ggml/src/ggml-sycl.cpp, line:3438, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *main_stream, oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at /home/ninemeow/llama.cpp/ggml/src/ggml-sycl.cpp:3438
/home/ninemeow/llama.cpp/ggml/src/ggml-sycl/common.hpp:107: SYCL error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

llama.cpp: build: 3848 (c919d5d)
oneAPI: 24.2.1
Intel OpenCL compute runtime: 24.35.30872.22
Kernel: 6.8
ReBAR and Above 4G Decoding is enabled. (I use i3-9100 though. B360 chipset supports ReBAR in fact and ASUS has pushed a BIOS update to support ReBAR in software )

My A770 can run ComfyUI flawlessly in the meantime, so it doesn't look like a oneAPI compatibility issue.

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Oct 12, 2024

@NineMeowICT
Could you rm the folder "build" and compile again with the default setting?
And share the build cmd and running cmd.

Additional, does it work well with llama2-7b-q4 as the example in SYCL.md?

@NineMeowICT
Copy link

@NeoZhangJianyu
I had tried what you said and got a same error.

build cmd: ./examples/sycl/build.sh
running cmd: ZES_ENABLE_SYSMAN=1 ./llama-server -m /media/ninemeow/Delta/NLP/Lumimaid-v0.2-8B-Q6_K-imat.gguf -ngl 33 -c 48000

But it can pass run-llama2.sh test.

@NeoZhangJianyu
Copy link
Collaborator

@NineMeowICT
I reproduce this issue with b3848.
I am checking it.

@qnixsynapse
Copy link
Contributor

@NineMeowICT Seems like #9612 (comment)

@NeoZhangJianyu
Copy link
Collaborator

@NineMeowICT
Before we fix it, please try with commit id: c35e586
it should work well.

@NeoZhangJianyu
Copy link
Collaborator

@Eugeniusz-Gienek
Could you test with latest code?
I have fixed it.

@Eugeniusz-Gienek
Copy link
Author

Thank You, I will test it hopefully today after work :)

@NineMeowICT
Copy link

@NeoZhangJianyu Thank you. I just built it and tested several models and no errors occurred.

@github-actions github-actions bot added the stale label Dec 3, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) stale SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

No branches or pull requests