Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Optimize mul_mat for Q4_0 on Intel GPU #12035

Merged
merged 10 commits into from
Feb 24, 2025

Conversation

NeoZhangJianyu
Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu commented Feb 23, 2025

Optimize MUL_MAT Q4_0 on Intel GPU.

  • Change the number of threads of kernel function.
  • Reorder the Q4 block to separate quantized weights and dequantize scaler.
    execute to reorder once during compute graph.
  • Detect the Intel device type for optimize and save the result in ggml_backend_sycl_context.
  • Show the result of detect hardware support optimize feature during startup.
  SYCL Optimization Feature:
  |ID|        Device Type|Reorder|
  |--|-------------------|-------|
  | 0| [level_zero:gpu:0]|      Y|
  • Save the optimized status in tensor's extra.
  • The optimize is only for Q4_0. The framework is easy to extend to QX_Y in the feature.
  • Add environment variable GGML_SYCL_DISABLE_OPT to debug.
    It will be shown in startup:
    Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0

It's passed in local CI.

Here is the performance increasing on Intel GPUs (dGPU and iGPU since MTL):
(Test with llama-2-7b.Q4_0.gguf)

GPU Base tokens/s Increased tokens/s
PVC 1550 39 73
Flex 170 39 50
Arc770 42 55
MTL 13 16
ARL-H 14 17

For iGPU which is older than MTL, the optimize doesn't increase performance. Skip to support them. Need more study.

For none Intel GPU, the GPU optimize feature detect doesn't support none Intel GPUs in code.
It's unknown the code is good on none Intel GPUs. It could be verified.

I hope this optimize is the seed of optimization on Intel GPU.
This solution is not the better solution on Intel GPU.
There is still huge potential of Intel GPU. Need more study work in the feature.

Thank @luoyu-intel and @airMeng for the solution contribution.

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Feb 23, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 24, 2025
@NeoZhangJianyu NeoZhangJianyu merged commit 08d5986 into ggml-org:master Feb 24, 2025
47 checks passed
Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NeoZhangJianyu I was finishing a review and just clicked the merge button. I disagree with some of your approaches for the backend. Please at least answer the comments I am leaving here.

Comment on lines +197 to +198
// int nsm; // number of streaming multiprocessors
// size_t smpb; // max. shared memory per block
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no use for this right now, it could be good to start removing these as well.

arch == syclex::architecture::intel_gpu_arl_u ||
arch == syclex::architecture::intel_gpu_arl_s ||
arch == syclex::architecture::intel_gpu_arl_h ||
arch == syclex::architecture::intel_gpu_bmg_g21 ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have access to a BMG gpu, I'll reply later with perf numbers, since I guess you'd want to add them to the README.

Comment on lines +49 to +51
const dfloat d = (const dfloat)*((const sycl::half*)d_ptr+ib);

const int vui = *((const uint8_t *)qs+iqs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall contribution is great. I was doing similar work for the Q4_K quantization, and this is quite helpful.

Comment on lines +56 to +65
#ifdef GGML_SYCL_F16
// v = v - {8.0f, 8.0f};
// v = v * {d, d};
v.s0() = (v.s0() - 8.0f) * d;
v.s1() = (v.s1() - 8.0f) * d;

#else
v.x() = (v.x() - 8.0f) * d;
v.y() = (v.y() - 8.0f) * d;
#endif // GGML_SYCL_F16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the new code has the same functionality, with very minor differences in how dawta is accessed. I worry about the combinatorial explosion of having duplicated code in order to maintain support for all non-reordered and reordered quants. Long term, I think it's best for the backend to avoid going in this direction.

const int tid = item_ct1.get_local_id(2);


const int ncols_left = ncols % (QK4_0*WARP_SIZE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see multiple references to QK4_0 directly in here instead of using qk. Is this intended?
My understanding is that you chose the block size to distribute work nicely between the threads and then process the non-aligned columns later, but I was expecting it to be based on the templated qk, and not on a specific quantization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen that below you only have this enabled for Q4_0, I still think this could be generalized.

@@ -3570,6 +3389,7 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst);
} else if (use_dequantize_mul_mat_vec) {
ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false);
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream());

@@ -4251,10 +4071,72 @@ catch (sycl::exception const &exc) {
std::exit(1);
}

void reorder_qw(char *data_device, const int ncols, const int nrows,
size_t size, size_t offset, dpct::queue_ptr stream) {
auto tmp_buf = sycl::malloc_shared<char>(size, *stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmp_buf doesn't seem to be used on the host side. Unless I am missing something, this can be a device memory pointer.

Comment on lines +4080 to +4081
GGML_ASSERT((size % sizeof(block_q4_0) == 0));
GGML_ASSERT((offset % sizeof(block_q4_0) == 0));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GGML_ASSERT((size % sizeof(block_q4_0) == 0));
GGML_ASSERT((offset % sizeof(block_q4_0) == 0));
GGML_ASSERT(size % sizeof(block_q4_0) == 0);
GGML_ASSERT(offset % sizeof(block_q4_0) == 0);

These are also probably nicer at the beginning of the function, no point on allocating device memory and copying data if an error is found.

GGML_ASSERT((size % sizeof(block_q4_0) == 0));
GGML_ASSERT((offset % sizeof(block_q4_0) == 0));
int offset_blks = offset / sizeof(block_q4_0);
auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2;;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2;;
auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2;

static ggml_status ggml_backend_sycl_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {
ggml_backend_sycl_context * sycl_ctx = (ggml_backend_sycl_context *)backend->context;
ggml_sycl_set_main_device(sycl_ctx->device);

if (!g_ggml_sycl_disable_optimize) optimize_graph_once(cgraph, sycl_ctx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to #8432 (mentioning because @luoyu-intel was part of the discussion), the suggested approach for having different data layouts is to modify both the set and get tensors from the backend. If we did this, we would not need to do this data reorganization. Have you looked into this approach to see if it was possible?

@qnixsynapse
Copy link
Contributor

@Alcpz I am worried that this PR will break RPC(since it adds tensor->extra in the init tensor function again). I added a review but it was never attended for some reason.

@Alcpz
Copy link
Collaborator

Alcpz commented Feb 24, 2025

@qnixsynapse , if you can test if it works or not, I am happy to revert it if needed.

@qnixsynapse
Copy link
Contributor

I do not have multiple PCs with Intel GPUs unfortunately. But it was previous disabled because of the mentioned reason. Please see #5277 (reply in thread) .

@Alcpz
Copy link
Collaborator

Alcpz commented Feb 24, 2025

I will find out if I can set up something on my side. Thanks for your help!

@Alcpz
Copy link
Collaborator

Alcpz commented Feb 24, 2025

@qnixsynapse I've confirmed that the RPC server is broken by this build. Thank you for your pointer:

❯ bin/rpc-server -p 50052
create_backend: using SYCL backend
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|         Intel Data Center GPU Max 1100|  12.60|    448|    1024|   32| 51539M|         1.6.32224+14|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
Starting RPC server on 127.0.0.1:50052, backend memory: 49152 MB
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
./sources/llama.cpp-master/ggml/src/ggml-sycl/ggml-sycl.cpp:4120: GGML_ASSERT(extra) failed
[New LWP 2471816]
Registering SYCL extensions for gdb
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007faece36842f in __GI___wait4 (pid=2472291, stat_loc=0x7fff0df43b5c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007faece36842f in __GI___wait4 (pid=2472291, stat_loc=0x7fff0df43b5c, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007faece7f8ae6 in ggml_print_backtrace () from ./sources/llama.cpp-master/build/bin/libggml-base.so
#2  0x00007faece7f8a80 in ggml_abort () from ./sources/llama.cpp-master/build/bin/libggml-base.so
#3  0x00007faece90a1ce in optimize_graph_once(ggml_cgraph*, ggml_backend_sycl_context*) () from ./sources/llama.cpp-master/build/bin/libggml-sycl.so
#4  0x00007faece92f509 in ggml_backend_sycl_graph_compute(ggml_backend*, ggml_cgraph*) () from sources/llama.cpp-master/build/bin/libggml-sycl.so
#5  0x00007faece80e437 in ggml_backend_graph_compute () from ./sources/llama.cpp-master/build/bin/libggml-base.so
#6  0x00007faecebfbfda in rpc_server::graph_compute(std::vector<unsigned char, std::allocator<unsigned char> > const&, rpc_msg_graph_compute_rsp&) () from ./sources/llama.cpp-master/build/bin/libggml-rpc.so
#7  0x00007faecebfcaf5 in ggml_backend_rpc_start_server () from ./sources/llama.cpp-master/build/bin/libggml-rpc.so
#8  0x0000000000404f42 in main ()
[Inferior 1 (process 2471814) detached]
[1]    2471814 IOT instruction  bin/rpc-server -p 50052

As you suspected, due to the extra field in the tensors. If I set GGML_SYCL_DISABLE_OPT=1 I've confirmed that it works as expected, since we don't use the extra tensors that way. Despite the patch only breaking for models that use q4_0 I think we should revert this a find a way to have these optimizations enabled without the use of the extra tensors.

Edit: Added working logs

❯ bin/rpc-server -p 50052
create_backend: using SYCL backend
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 1
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|         Intel Data Center GPU Max 1100|  12.60|    448|    1024|   32| 51539M|         1.6.32224+14|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
Starting RPC server on 127.0.0.1:50052, backend memory: 49152 MB
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
Client connection closed
Accepted client connection, free_mem=51539607552, total_mem=51539607552
bin/llama-cli -m ./llama3-med42-8b-q4_0.gguf -p "Hello my name is" --repeat-penalty 1.0 -n 64 --rpc 127.0.0.1:50052 -ngl 99
build: 4767 (08d59862) with gcc (GCC) 12.3.0 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device RPC[127.0.0.1:50052] (RPC[127.0.0.1:50052]) - 49152 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 291 tensors from ./llama3-med42-8b-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama

// .. Removed to avoid verbosity

llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama

// ...

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: RPC[127.0.0.1:50052] model buffer size =  4155.99 MiB
load_tensors:   CPU_Mapped model buffer size =   281.81 MiB

// ...



system_info: n_threads = 48 (n_threads_batch = 48) / 96 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 503630248
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system

Hello my name is


> hello
It seems like you started to introduce yourself, but it got cut off. No worries! If you'd like to continue, you can simply say, "My name is [your name], and..." or "I'm [your name], and..." If you're feeling
> 

@qnixsynapse
Copy link
Contributor

@Alcpz Thank you so much for taking the initiative to test it.

As you suspected, due to the extra field in the tensors. If I set GGML_SYCL_DISABLE_OPT=1 I've confirmed that it works as expected, since we don't use the extra tensors that way. Despite the patch only breaking for models that use q4_0 I think we should revert this a find a way to have these optimizations enabled without the use of the extra tensors.

Yes indeed.

orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025
* opt performance by reorder for Intel GPU

* detect hw type and save opt feature, and print opt feature

* correct name

* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed

* add env variable GGML_SYCL_DISABLE_OPT for debug

* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT

* add performance data

* mv getrows functions to separeted files

* fix global variables

---------

Co-authored-by: arthw <[email protected]>
arthw added a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
* opt performance by reorder for Intel GPU

* detect hw type and save opt feature, and print opt feature

* correct name

* support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed

* add env variable GGML_SYCL_DISABLE_OPT for debug

* use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT

* add performance data

* mv getrows functions to separeted files

* fix global variables

---------

Co-authored-by: arthw <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants