[SYCL] Optimize mul_mat for Q4_0 on Intel GPU #12035

NeoZhangJianyu · 2025-02-23T09:52:40Z

Optimize MUL_MAT Q4_0 on Intel GPU.

Change the number of threads of kernel function.
Reorder the Q4 block to separate quantized weights and dequantize scaler.
execute to reorder once during compute graph.
Detect the Intel device type for optimize and save the result in ggml_backend_sycl_context.
Show the result of detect hardware support optimize feature during startup.

  SYCL Optimization Feature:
  |ID|        Device Type|Reorder|
  |--|-------------------|-------|
  | 0| [level_zero:gpu:0]|      Y|

Save the optimized status in tensor's extra.
The optimize is only for Q4_0. The framework is easy to extend to QX_Y in the feature.
Add environment variable GGML_SYCL_DISABLE_OPT to debug.
It will be shown in startup:
Running with Environment Variables:

  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0

It's passed in local CI.

Here is the performance increasing on Intel GPUs (dGPU and iGPU since MTL):
(Test with llama-2-7b.Q4_0.gguf)

GPU	Base tokens/s	Increased tokens/s
PVC 1550	39	73
Flex 170	39	50
Arc770	42	55
MTL	13	16
ARL-H	14	17

For iGPU which is older than MTL, the optimize doesn't increase performance. Skip to support them. Need more study.

For none Intel GPU, the GPU optimize feature detect doesn't support none Intel GPUs in code.
It's unknown the code is good on none Intel GPUs. It could be verified.

I hope this optimize is the seed of optimization on Intel GPU.
This solution is not the better solution on Intel GPU.
There is still huge potential of Intel GPU. Need more study work in the feature.

Thank @luoyu-intel and @airMeng for the solution contribution.

… in tensor->extra, make CI passed

examples/sycl/run-llama2.sh

ggml/src/ggml-sycl/ggml-sycl.cpp

ggml/src/ggml-sycl/sycl_hw.cpp

ggml/src/ggml-sycl/sycl_hw.hpp

…de for GGML_SYCL_DISABLE_OPT

Alcpz

@NeoZhangJianyu I was finishing a review and just clicked the merge button. I disagree with some of your approaches for the backend. Please at least answer the comments I am leaving here.

Alcpz · 2025-02-24T11:55:27Z

ggml/src/ggml-sycl/common.hpp

+    // int     nsm;                // number of streaming multiprocessors
+    // size_t  smpb;               // max. shared memory per block


Since there is no use for this right now, it could be good to start removing these as well.

Alcpz · 2025-02-24T11:56:24Z

ggml/src/ggml-sycl/common.hpp

+         arch == syclex::architecture::intel_gpu_arl_u ||
+         arch == syclex::architecture::intel_gpu_arl_s ||
+         arch == syclex::architecture::intel_gpu_arl_h ||
+         arch == syclex::architecture::intel_gpu_bmg_g21 ||


I have access to a BMG gpu, I'll reply later with perf numbers, since I guess you'd want to add them to the README.

Alcpz · 2025-02-24T11:57:35Z

ggml/src/ggml-sycl/dequantize.hpp

+    const dfloat d = (const dfloat)*((const sycl::half*)d_ptr+ib);
+
+    const int vui = *((const uint8_t *)qs+iqs);


The overall contribution is great. I was doing similar work for the Q4_K quantization, and this is quite helpful.

Alcpz · 2025-02-24T12:00:52Z

ggml/src/ggml-sycl/dequantize.hpp

+#ifdef GGML_SYCL_F16
+    // v = v - {8.0f, 8.0f};
+    // v = v * {d, d};
+    v.s0() = (v.s0() - 8.0f) * d;
+    v.s1() = (v.s1() - 8.0f) * d;
+
+#else
+    v.x() = (v.x() - 8.0f) * d;
+    v.y() = (v.y() - 8.0f) * d;
+#endif // GGML_SYCL_F16


A lot of the new code has the same functionality, with very minor differences in how dawta is accessed. I worry about the combinatorial explosion of having duplicated code in order to maintain support for all non-reordered and reordered quants. Long term, I think it's best for the backend to avoid going in this direction.

Alcpz · 2025-02-24T12:10:52Z

ggml/src/ggml-sycl/dmmv.cpp

+    const int tid = item_ct1.get_local_id(2);
+
+
+    const int ncols_left = ncols % (QK4_0*WARP_SIZE);


I see multiple references to QK4_0 directly in here instead of using qk. Is this intended?
My understanding is that you chose the block size to distribute work nicely between the threads and then process the non-aligned columns later, but I was expecting it to be based on the templated qk, and not on a specific quantization.

I've seen that below you only have this enabled for Q4_0, I still think this could be generalized.

Alcpz · 2025-02-24T12:58:13Z

ggml/src/ggml-sycl/ggml-sycl.cpp

@@ -3570,6 +3389,7 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
        ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst);
    } else if (use_dequantize_mul_mat_vec) {
        ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false);
+        // save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream());


Suggested change

// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream());

Alcpz · 2025-02-24T13:02:13Z

ggml/src/ggml-sycl/ggml-sycl.cpp

@@ -4251,10 +4071,72 @@ catch (sycl::exception const &exc) {
  std::exit(1);
 }

+void reorder_qw(char *data_device, const int ncols, const int nrows,
+                size_t size, size_t offset, dpct::queue_ptr stream) {
+    auto tmp_buf = sycl::malloc_shared<char>(size, *stream);


tmp_buf doesn't seem to be used on the host side. Unless I am missing something, this can be a device memory pointer.

Alcpz · 2025-02-24T13:07:53Z

ggml/src/ggml-sycl/ggml-sycl.cpp

+    GGML_ASSERT((size % sizeof(block_q4_0) == 0));
+    GGML_ASSERT((offset % sizeof(block_q4_0) == 0));


Suggested change

GGML_ASSERT((size % sizeof(block_q4_0) == 0));

GGML_ASSERT((offset % sizeof(block_q4_0) == 0));

GGML_ASSERT(size % sizeof(block_q4_0) == 0);

GGML_ASSERT(offset % sizeof(block_q4_0) == 0);

These are also probably nicer at the beginning of the function, no point on allocating device memory and copying data if an error is found.

Alcpz · 2025-02-24T13:15:52Z

ggml/src/ggml-sycl/ggml-sycl.cpp

+    GGML_ASSERT((size % sizeof(block_q4_0) == 0));
+    GGML_ASSERT((offset % sizeof(block_q4_0) == 0));
+    int offset_blks = offset / sizeof(block_q4_0);
+    auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2;;


Suggested change

auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2;;

auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2;

Alcpz · 2025-02-24T13:29:12Z

ggml/src/ggml-sycl/ggml-sycl.cpp

 static ggml_status ggml_backend_sycl_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {
    ggml_backend_sycl_context * sycl_ctx = (ggml_backend_sycl_context *)backend->context;
    ggml_sycl_set_main_device(sycl_ctx->device);

+    if (!g_ggml_sycl_disable_optimize) optimize_graph_once(cgraph, sycl_ctx);


According to #8432 (mentioning because @luoyu-intel was part of the discussion), the suggested approach for having different data layouts is to modify both the set and get tensors from the backend. If we did this, we would not need to do this data reorganization. Have you looked into this approach to see if it was possible?

qnixsynapse · 2025-02-24T14:49:05Z

@Alcpz I am worried that this PR will break RPC(since it adds tensor->extra in the init tensor function again). I added a review but it was never attended for some reason.

Alcpz · 2025-02-24T14:50:06Z

@qnixsynapse , if you can test if it works or not, I am happy to revert it if needed.

qnixsynapse · 2025-02-24T14:53:32Z

I do not have multiple PCs with Intel GPUs unfortunately. But it was previous disabled because of the mentioned reason. Please see #5277 (reply in thread) .

Alcpz · 2025-02-24T15:22:51Z

I will find out if I can set up something on my side. Thanks for your help!

Alcpz · 2025-02-24T17:32:05Z

@qnixsynapse I've confirmed that the RPC server is broken by this build. Thank you for your pointer:

❯ bin/rpc-server -p 50052
create_backend: using SYCL backend
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|         Intel Data Center GPU Max 1100|  12.60|    448|    1024|   32| 51539M|         1.6.32224+14|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
Starting RPC server on 127.0.0.1:50052, backend memory: 49152 MB
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
./sources/llama.cpp-master/ggml/src/ggml-sycl/ggml-sycl.cpp:4120: GGML_ASSERT(extra) failed
[New LWP 2471816]
Registering SYCL extensions for gdb
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007faece36842f in __GI___wait4 (pid=2472291, stat_loc=0x7fff0df43b5c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007faece36842f in __GI___wait4 (pid=2472291, stat_loc=0x7fff0df43b5c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007faece7f8ae6 in ggml_print_backtrace () from ./sources/llama.cpp-master/build/bin/libggml-base.so
#2  0x00007faece7f8a80 in ggml_abort () from ./sources/llama.cpp-master/build/bin/libggml-base.so
#3  0x00007faece90a1ce in optimize_graph_once(ggml_cgraph*, ggml_backend_sycl_context*) () from ./sources/llama.cpp-master/build/bin/libggml-sycl.so
#4  0x00007faece92f509 in ggml_backend_sycl_graph_compute(ggml_backend*, ggml_cgraph*) () from sources/llama.cpp-master/build/bin/libggml-sycl.so
#5  0x00007faece80e437 in ggml_backend_graph_compute () from ./sources/llama.cpp-master/build/bin/libggml-base.so
#6  0x00007faecebfbfda in rpc_server::graph_compute(std::vector<unsigned char, std::allocator<unsigned char> > const&, rpc_msg_graph_compute_rsp&) () from ./sources/llama.cpp-master/build/bin/libggml-rpc.so
#7  0x00007faecebfcaf5 in ggml_backend_rpc_start_server () from ./sources/llama.cpp-master/build/bin/libggml-rpc.so
#8  0x0000000000404f42 in main ()
[Inferior 1 (process 2471814) detached]
[1]    2471814 IOT instruction  bin/rpc-server -p 50052

As you suspected, due to the extra field in the tensors. If I set GGML_SYCL_DISABLE_OPT=1 I've confirmed that it works as expected, since we don't use the extra tensors that way. Despite the patch only breaking for models that use q4_0 I think we should revert this a find a way to have these optimizations enabled without the use of the extra tensors.

Edit: Added working logs

❯ bin/rpc-server -p 50052
create_backend: using SYCL backend
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 1
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|         Intel Data Center GPU Max 1100|  12.60|    448|    1024|   32| 51539M|         1.6.32224+14|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
Starting RPC server on 127.0.0.1:50052, backend memory: 49152 MB
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552

bin/llama-cli -m ./llama3-med42-8b-q4_0.gguf -p "Hello my name is" --repeat-penalty 1.0 -n 64 --rpc 127.0.0.1:50052 -ngl 99
build: 4767 (08d59862) with gcc (GCC) 12.3.0 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device RPC[127.0.0.1:50052] (RPC[127.0.0.1:50052]) - 49152 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 291 tensors from ./llama3-med42-8b-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama

// .. Removed to avoid verbosity

llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama

// ...

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: RPC[127.0.0.1:50052] model buffer size =  4155.99 MiB
load_tensors:   CPU_Mapped model buffer size =   281.81 MiB

// ...



system_info: n_threads = 48 (n_threads_batch = 48) / 96 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 503630248
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system

Hello my name is


> hello
It seems like you started to introduce yourself, but it got cut off. No worries! If you'd like to continue, you can simply say, "My name is [your name], and..." or "I'm [your name], and..." If you're feeling
>

qnixsynapse · 2025-02-25T10:34:28Z

@Alcpz Thank you so much for taking the initiative to test it.

As you suspected, due to the extra field in the tensors. If I set GGML_SYCL_DISABLE_OPT=1 I've confirmed that it works as expected, since we don't use the extra tensors that way. Despite the patch only breaking for models that use q4_0 I think we should revert this a find a way to have these optimizations enabled without the use of the extra tensors.

Yes indeed.

* opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <[email protected]>

arthw and others added 5 commits February 14, 2025 22:13

opt performance by reorder for Intel GPU

78e232a

detect hw type and save opt feature, and print opt feature

7a6b48d

correct name

63e5285

support optimize graph once when compute graph, record the opt status…

5cfde90

… in tensor->extra, make CI passed

add env variable GGML_SYCL_DISABLE_OPT for debug

4eaab12

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Feb 23, 2025

NeoZhangJianyu requested a review from airMeng February 23, 2025 09:53

airMeng reviewed Feb 24, 2025

View reviewed changes

examples/sycl/run-llama2.sh Show resolved Hide resolved

ggml/src/ggml-sycl/ggml-sycl.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-sycl/sycl_hw.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-sycl/sycl_hw.hpp Outdated Show resolved Hide resolved

NeoZhangJianyu added 3 commits February 24, 2025 15:51

use syclex::architecture replace the custom hw define, update the gui…

0e91e0e

…de for GGML_SYCL_DISABLE_OPT

add performance data

b3570b9

mv getrows functions to separeted files

f111721

github-actions bot added the documentation Improvements or additions to documentation label Feb 24, 2025

airMeng approved these changes Feb 24, 2025

View reviewed changes

NeoZhangJianyu added 2 commits February 24, 2025 20:39

Merge branch 'master' into opt_reorder

c541d6a

fix global variables

30ddc90

NeoZhangJianyu merged commit 08d5986 into ggml-org:master Feb 24, 2025
47 checks passed

Alcpz reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Optimize mul_mat for Q4_0 on Intel GPU #12035

[SYCL] Optimize mul_mat for Q4_0 on Intel GPU #12035

NeoZhangJianyu commented Feb 23, 2025 •

edited

Loading

Alcpz left a comment

Alcpz Feb 24, 2025

Alcpz Feb 24, 2025

Alcpz Feb 24, 2025

Alcpz Feb 24, 2025

Alcpz Feb 24, 2025

Alcpz Feb 24, 2025

Alcpz Feb 24, 2025

Alcpz Feb 24, 2025

Alcpz Feb 24, 2025

Alcpz Feb 24, 2025

Alcpz Feb 24, 2025

qnixsynapse commented Feb 24, 2025

Alcpz commented Feb 24, 2025 •

edited

Loading

qnixsynapse commented Feb 24, 2025

Alcpz commented Feb 24, 2025

Alcpz commented Feb 24, 2025 •

edited

Loading

qnixsynapse commented Feb 25, 2025

		// int nsm; // number of streaming multiprocessors
		// size_t smpb; // max. shared memory per block

		const dfloat d = (const dfloat)((const sycl::half)d_ptr+ib);

		const int vui = ((const uint8_t )qs+iqs);

		const int tid = item_ct1.get_local_id(2);


		const int ncols_left = ncols % (QK4_0*WARP_SIZE);

		GGML_ASSERT((size % sizeof(block_q4_0) == 0));
		GGML_ASSERT((offset % sizeof(block_q4_0) == 0));

	auto qs_ptr = (uint8_t)data_device + offset_blks QK4_0 / 2;;
	auto qs_ptr = (uint8_t)data_device + offset_blks QK4_0 / 2;

[SYCL] Optimize mul_mat for Q4_0 on Intel GPU #12035

[SYCL] Optimize mul_mat for Q4_0 on Intel GPU #12035

Conversation

NeoZhangJianyu commented Feb 23, 2025 • edited Loading

Alcpz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qnixsynapse commented Feb 24, 2025

Alcpz commented Feb 24, 2025 • edited Loading

qnixsynapse commented Feb 24, 2025

Alcpz commented Feb 24, 2025

Alcpz commented Feb 24, 2025 • edited Loading

qnixsynapse commented Feb 25, 2025

NeoZhangJianyu commented Feb 23, 2025 •

edited

Loading

Alcpz commented Feb 24, 2025 •

edited

Loading

Alcpz commented Feb 24, 2025 •

edited

Loading