-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Optimize mul_mat for Q4_0 on Intel GPU #12035
Conversation
… in tensor->extra, make CI passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NeoZhangJianyu I was finishing a review and just clicked the merge button. I disagree with some of your approaches for the backend. Please at least answer the comments I am leaving here.
// int nsm; // number of streaming multiprocessors | ||
// size_t smpb; // max. shared memory per block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no use for this right now, it could be good to start removing these as well.
arch == syclex::architecture::intel_gpu_arl_u || | ||
arch == syclex::architecture::intel_gpu_arl_s || | ||
arch == syclex::architecture::intel_gpu_arl_h || | ||
arch == syclex::architecture::intel_gpu_bmg_g21 || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have access to a BMG gpu, I'll reply later with perf numbers, since I guess you'd want to add them to the README.
const dfloat d = (const dfloat)*((const sycl::half*)d_ptr+ib); | ||
|
||
const int vui = *((const uint8_t *)qs+iqs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall contribution is great. I was doing similar work for the Q4_K quantization, and this is quite helpful.
#ifdef GGML_SYCL_F16 | ||
// v = v - {8.0f, 8.0f}; | ||
// v = v * {d, d}; | ||
v.s0() = (v.s0() - 8.0f) * d; | ||
v.s1() = (v.s1() - 8.0f) * d; | ||
|
||
#else | ||
v.x() = (v.x() - 8.0f) * d; | ||
v.y() = (v.y() - 8.0f) * d; | ||
#endif // GGML_SYCL_F16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the new code has the same functionality, with very minor differences in how dawta is accessed. I worry about the combinatorial explosion of having duplicated code in order to maintain support for all non-reordered and reordered quants. Long term, I think it's best for the backend to avoid going in this direction.
const int tid = item_ct1.get_local_id(2); | ||
|
||
|
||
const int ncols_left = ncols % (QK4_0*WARP_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see multiple references to QK4_0
directly in here instead of using qk
. Is this intended?
My understanding is that you chose the block size to distribute work nicely between the threads and then process the non-aligned columns later, but I was expecting it to be based on the templated qk
, and not on a specific quantization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen that below you only have this enabled for Q4_0
, I still think this could be generalized.
@@ -3570,6 +3389,7 @@ static void ggml_sycl_mul_mat(ggml_backend_sycl_context & ctx, const ggml_tensor | |||
ggml_sycl_mul_mat_batched_sycl(ctx, src0, src1, dst); | |||
} else if (use_dequantize_mul_mat_vec) { | |||
ggml_sycl_op_mul_mat(ctx, src0, src1, dst, ggml_sycl_op_dequantize_mul_mat_vec, false); | |||
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// save_tensor_txt("1/dst_1.txt", (float*) dst->data, src0->ne[1], sizeof(float), ctx.stream()); |
@@ -4251,10 +4071,72 @@ catch (sycl::exception const &exc) { | |||
std::exit(1); | |||
} | |||
|
|||
void reorder_qw(char *data_device, const int ncols, const int nrows, | |||
size_t size, size_t offset, dpct::queue_ptr stream) { | |||
auto tmp_buf = sycl::malloc_shared<char>(size, *stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmp_buf
doesn't seem to be used on the host side. Unless I am missing something, this can be a device memory pointer.
GGML_ASSERT((size % sizeof(block_q4_0) == 0)); | ||
GGML_ASSERT((offset % sizeof(block_q4_0) == 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GGML_ASSERT((size % sizeof(block_q4_0) == 0)); | |
GGML_ASSERT((offset % sizeof(block_q4_0) == 0)); | |
GGML_ASSERT(size % sizeof(block_q4_0) == 0); | |
GGML_ASSERT(offset % sizeof(block_q4_0) == 0); |
These are also probably nicer at the beginning of the function, no point on allocating device memory and copying data if an error is found.
GGML_ASSERT((size % sizeof(block_q4_0) == 0)); | ||
GGML_ASSERT((offset % sizeof(block_q4_0) == 0)); | ||
int offset_blks = offset / sizeof(block_q4_0); | ||
auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2;; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2;; | |
auto qs_ptr = (uint8_t*)data_device + offset_blks * QK4_0 / 2; |
static ggml_status ggml_backend_sycl_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) { | ||
ggml_backend_sycl_context * sycl_ctx = (ggml_backend_sycl_context *)backend->context; | ||
ggml_sycl_set_main_device(sycl_ctx->device); | ||
|
||
if (!g_ggml_sycl_disable_optimize) optimize_graph_once(cgraph, sycl_ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to #8432 (mentioning because @luoyu-intel was part of the discussion), the suggested approach for having different data layouts is to modify both the set and get tensors from the backend. If we did this, we would not need to do this data reorganization. Have you looked into this approach to see if it was possible?
@Alcpz I am worried that this PR will break RPC(since it adds tensor->extra in the init tensor function again). I added a review but it was never attended for some reason. |
@qnixsynapse , if you can test if it works or not, I am happy to revert it if needed. |
I do not have multiple PCs with Intel GPUs unfortunately. But it was previous disabled because of the mentioned reason. Please see #5277 (reply in thread) . |
I will find out if I can set up something on my side. Thanks for your help! |
@qnixsynapse I've confirmed that the RPC server is broken by this build. Thank you for your pointer:
As you suspected, due to the extra field in the tensors. If I set Edit: Added working logs
|
@Alcpz Thank you so much for taking the initiative to test it.
Yes indeed. |
* opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <[email protected]>
* opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <[email protected]>
Optimize MUL_MAT Q4_0 on Intel GPU.
execute to reorder once during compute graph.
It will be shown in startup:
Running with Environment Variables:
It's passed in local CI.
Here is the performance increasing on Intel GPUs (dGPU and iGPU since MTL):
(Test with llama-2-7b.Q4_0.gguf)
For iGPU which is older than MTL, the optimize doesn't increase performance. Skip to support them. Need more study.
For none Intel GPU, the GPU optimize feature detect doesn't support none Intel GPUs in code.
It's unknown the code is good on none Intel GPUs. It could be verified.
I hope this optimize is the seed of optimization on Intel GPU.
This solution is not the better solution on Intel GPU.
There is still huge potential of Intel GPU. Need more study work in the feature.
Thank @luoyu-intel and @airMeng for the solution contribution.