-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : change fallback type IQ4_NL -> Q4_0 #8489
Conversation
This will result in (possibly significantly) higher ppl, eventually somebody will notice and waste a lot of time trying to figure why. |
@slaren hmm I think it's worth it for the increased compatibility. It would be easy to find an unaffected model to do an apples-to-apples comparison when evaluating perplexity. We did the same thing in #2148 , so if it's a concern we could tag on a similar warning during the quantization process when the fallback is triggered, or even add such a warning when calculating perplexity. |
I am not opposed to the change, but it is not a minor change and should be much more clear (it's not even in the PR title). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF
Builds and quantizes successfully, output Q3_K_S model with Q4_0 fallback tensors successfully.
However, for some reason it still did not work on Vulkan, perhaps due to an unrelated issue: GGML_ASSERT: ggml/src/ggml-vulkan.cpp:3446: n_as <= 8
, pinging @0cc4m if they wish to take a look. May be MoE related?
Running in llama-cli, I can observe the tensor mix is indeed changed from
llama_model_loader: - type f32: 108 tensors
llama_model_loader: - type q3_K: 241 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq4_nl: 27 tensors
to
llama_model_loader: - type f32: 108 tensors
llama_model_loader: - type q4_0: 27 tensors
llama_model_loader: - type q3_K: 241 tensors
llama_model_loader: - type q6_K: 1 tensors
Removing that Assert and generating with a very short prompt provides coherent output, so I can confirm that the new type mix is working, though perhaps this model is not the best to test with due to unrelated blockers.
I don't have the system resources required to run Qwen 72B myself, but if anyone knows of another model that uses quantized fallbacks i'll be happy to test that!
@LostRuins There's a bunch of very tiny models with weird sizes, one of which is https://huggingface.co/delphi-suite/v0-llama2-12.8m (12.8M, yes millions), with an embedding size of 384 (not a multiple of 256). If you were asking for bigger models, there's https://huggingface.co/1bitLLM/bitnet_b1_58-3B (3B) which has an embedding size of 3200 (not a multiple of 256). (for a cursed model there's also https://huggingface.co/1bitLLM/bitnet_b1_58-xl (1.3B) which has an FFN size of 5460, which is not even a multiple of 32 (there are also tiny cursed models in https://huggingface.co/delphi-suite, but by necessity (small size) instead of accidentally)) |
d92f6b4
to
f6ea7a0
Compare
I moved the styling changes in a separate PR (#8502) and updated this one to contain just the type change |
I am not a huge friend of this change. |
@sorasoras may I ask why? I doubt the perplexity difference of an already heavily compressed q2k model is all that much different with a fallback of one q4_nl to q4_0 tensor. also remember that the current fallback for larger quants like q4_k_s is simply good old q5_0... and nobody has an issue with that.
If opposed to q4_0, then simply fall back to q5_0 like larger quants, the size difference will be under a few percent at most anyway |
even in Q2KS, fallback to Iq4nl vs q4_0 could have outsize influence on PPL. Fallback to Q4_0 should be default behavior but only with option to enable IQ4NL fallback. A kind of special case but It does affect quite a bit in term of PPL at translation with Qwen1 14B |
I would also suggest checking the assumption that this change will not increase the perplexity significantly. As far as I know, the main reason iq4_nl exists is precisely to be used as a fallback quant when the row size is not a multiple of 256. If this use case is removed, iq4_nl may as well be removed entirely. |
@slaren I would think that IQ4_NL would be the superior option when combined with other I-Quant mixes and distributed as an iquant model. But ideally keeping within it's own category - for the same reason we avoid taking K-quanted tensors and mixing them into a q5_0 or q4_1 model. I have not done my own perplexity tests but @sorasoras did highlight a change from Q2KS 5.1283 to Q2KS 5.1090 with IQ4NL on Qwen, or about 0.3% difference - to me that represents a mostly negligible change, but for those who find that important they can continue using the dedicated i-quanted model types. |
I don't see why that is ideal. I understand that the main motivation for the request to make this change is to workaround the issue of the Vulkan backend not supporting IQ quants. I am not that we should be making these changes based on the limitations of some backends. Quoting ikawrakow in the PR where he introduced IQ4_NL:
|
I am not opposed to the existence of IQ4_NL quants, but then the resulting format should not be considered a K-Quant. When people do |
Note that |
You are making decision to increase Q2KS PPL for a lot of backend just to make few backend usable. |
If Vulkan really is the only backend that lacks iq4_nl support and it's really only one quant that's needed, might be better to put in the effort to implement it there. I haven't looked at iq quants yet cause I don't have that much time and new quants are tedious to implement. |
Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes. |
Good points on both sides. Ultimately, if you're going to damage more-popular backends, I think it makes more sense to isolate this change to Vulkan, no? |
This can't be isolated to a single backend, cause it affects new models that get quantized and shared. The Vulkan backend supports k-quants, but it does not support the latest q2_k and q3_k (?) models cause they use a single iq4_nl tensor. That's what this is about. |
Admittedly, I might not understand exactly what's going on here, but I hear this sort of statement a lot around here ("can't"), and that's rarely true. I think it should be replaced with, "I haven't figured out how to yet" (see the recent situation involving breaking stuff due to renaming - there were ways to do it without breaking things - they just didn't think of them). Considering this is proposing damaging more-popular backends to benefit the less-popular, I think it's worth brainstorming ways to do what you say can't be done. Again, I admit this is above my paygrade, but I have a hard time believing it's impossible. For example, why can't one convert the tensor to Q4_0 if they want to run Vulkan? Also, why can't Vulkan run IQ4_NL in the first place? Is it something llama.cpp can implement/fix/adapt/whatever? Just brainstorming |
Obsoleted by #8613 |
#7805 (comment)
Fallback to
Q4_0
instead ofIQ4_NL
due to lack of implementation in of the latter in some backends