Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : change fallback type IQ4_NL -> Q4_0 #8489

Closed
wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Member

#7805 (comment)

Fallback to Q4_0 instead of IQ4_NL due to lack of implementation in of the latter in some backends

@ggerganov ggerganov requested a review from LostRuins July 15, 2024 07:28
@slaren
Copy link
Member

slaren commented Jul 15, 2024

This will result in (possibly significantly) higher ppl, eventually somebody will notice and waste a lot of time trying to figure why.

@LostRuins
Copy link
Collaborator

LostRuins commented Jul 15, 2024

@slaren hmm I think it's worth it for the increased compatibility. It would be easy to find an unaffected model to do an apples-to-apples comparison when evaluating perplexity.

We did the same thing in #2148 , so if it's a concern we could tag on a similar warning during the quantization process when the fallback is triggered, or even add such a warning when calculating perplexity.

@slaren
Copy link
Member

slaren commented Jul 15, 2024

I am not opposed to the change, but it is not a minor change and should be much more clear (it's not even in the PR title).

Copy link
Collaborator

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF

Builds and quantizes successfully, output Q3_K_S model with Q4_0 fallback tensors successfully.

However, for some reason it still did not work on Vulkan, perhaps due to an unrelated issue: GGML_ASSERT: ggml/src/ggml-vulkan.cpp:3446: n_as <= 8, pinging @0cc4m if they wish to take a look. May be MoE related?

Running in llama-cli, I can observe the tensor mix is indeed changed from

llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type q3_K:  241 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:   27 tensors

to

llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type q4_0:   27 tensors
llama_model_loader: - type q3_K:  241 tensors
llama_model_loader: - type q6_K:    1 tensors

Removing that Assert and generating with a very short prompt provides coherent output, so I can confirm that the new type mix is working, though perhaps this model is not the best to test with due to unrelated blockers.

I don't have the system resources required to run Qwen 72B myself, but if anyone knows of another model that uses quantized fallbacks i'll be happy to test that!

@compilade
Copy link
Collaborator

I don't have the system resources required to run Qwen 72B myself, but if anyone knows of another model that uses quantized fallbacks i'll be happy to test that!

@LostRuins There's a bunch of very tiny models with weird sizes, one of which is https://huggingface.co/delphi-suite/v0-llama2-12.8m (12.8M, yes millions), with an embedding size of 384 (not a multiple of 256).

If you were asking for bigger models, there's https://huggingface.co/1bitLLM/bitnet_b1_58-3B (3B) which has an embedding size of 3200 (not a multiple of 256).

(for a cursed model there's also https://huggingface.co/1bitLLM/bitnet_b1_58-xl (1.3B) which has an FFN size of 5460, which is not even a multiple of 32 (there are also tiny cursed models in https://huggingface.co/delphi-suite, but by necessity (small size) instead of accidentally))

@ggerganov ggerganov force-pushed the gg/quantize-fallback branch from d92f6b4 to f6ea7a0 Compare July 16, 2024 07:01
@ggerganov ggerganov changed the title llama : valign + remove unused ftype llama : change fallback type IQ4_NL -> Q4_0 Jul 16, 2024
@ggerganov
Copy link
Member Author

I moved the styling changes in a separate PR (#8502) and updated this one to contain just the type change

@sorasoras
Copy link

I am not a huge friend of this change.
I think this should also add a option to allow use of IQ4NL instead of Q4_0 as fallback

@LostRuins
Copy link
Collaborator

@sorasoras may I ask why? I doubt the perplexity difference of an already heavily compressed q2k model is all that much different with a fallback of one q4_nl to q4_0 tensor.

also remember that the current fallback for larger quants like q4_k_s is simply good old q5_0... and nobody has an issue with that.

case GGML_TYPE_Q4_K:   new_type = GGML_TYPE_Q5_0

If opposed to q4_0, then simply fall back to q5_0 like larger quants, the size difference will be under a few percent at most anyway

@sorasoras
Copy link

sorasoras commented Jul 16, 2024

@sorasoras may I ask why? I doubt the perplexity difference of an already heavily compressed q2k model is all that much different with a fallback of one q4_nl to q4_0 tensor.

also remember that the current fallback for larger quants like q4_k_s is simply good old q5_0... and nobody has an issue with that.

case GGML_TYPE_Q4_K:   new_type = GGML_TYPE_Q5_0

If opposed to q4_0, then simply fall back to q5_0 like larger quants, the size difference will be under a few percent at most

even in Q2KS, fallback to Iq4nl vs q4_0 could have outsize influence on PPL.
make it Q5_0 would make it larger.

Fallback to Q4_0 should be default behavior but only with option to enable IQ4NL fallback.
Q5_0 is a lot bigger than Q4_0 and especially IQ4NL.
Otherwise, I have to maintain a separate repo. Q4_0 is a lot worse than IQ4NL in my use case.

A kind of special case but It does affect quite a bit in term of PPL at translation with Qwen1 14B
Q2KS 5.1283 +/- 0.04957 old
Q2KS 5.1090 +/- 0.04962 new with IQ4NL

@slaren
Copy link
Member

slaren commented Jul 17, 2024

I would also suggest checking the assumption that this change will not increase the perplexity significantly. As far as I know, the main reason iq4_nl exists is precisely to be used as a fallback quant when the row size is not a multiple of 256. If this use case is removed, iq4_nl may as well be removed entirely.

@LostRuins
Copy link
Collaborator

@slaren I would think that IQ4_NL would be the superior option when combined with other I-Quant mixes and distributed as an iquant model. But ideally keeping within it's own category - for the same reason we avoid taking K-quanted tensors and mixing them into a q5_0 or q4_1 model.

I have not done my own perplexity tests but @sorasoras did highlight a change from Q2KS 5.1283 to Q2KS 5.1090 with IQ4NL on Qwen, or about 0.3% difference - to me that represents a mostly negligible change, but for those who find that important they can continue using the dedicated i-quanted model types.

@slaren
Copy link
Member

slaren commented Jul 17, 2024

I don't see why that is ideal. I understand that the main motivation for the request to make this change is to workaround the issue of the Vulkan backend not supporting IQ quants. I am not that we should be making these changes based on the limitations of some backends.

Quoting ikawrakow in the PR where he introduced IQ4_NL:

The main purpose of this PR is to provide a 4-bit quantization type that can be used when k- and i-quants that use blocks of 256 are not available (because the number of columns in some tensors are not a multiple of 256).

@LostRuins
Copy link
Collaborator

LostRuins commented Jul 17, 2024

I am not opposed to the existence of IQ4_NL quants, but then the resulting format should not be considered a K-Quant.

When people do quantize and select Q2_K or Q3_K_S, then the output should be K-Quant compatible and work on K-Quant compatible backends. If we want to utilize a IQ4_NL type by all means, but then the output should be labelled as IQX_X instead and made clear on the quantization options that you are getting an I-Quant instead.

@compilade
Copy link
Collaborator

for the same reason we avoid taking K-quanted tensors and mixing them into a q5_0 or q4_1 model.

Note that q6_K is still used for output.weight when making a q4_0, q4_1 or q5_0 model, when possible.

https://github.com/ggerganov/llama.cpp/blob/5e116e8dd51775f8f1c090570be148d5d7eea6c3/src/llama.cpp#L17863-L17864

@sorasoras
Copy link

I am not opposed to the existence of IQ4_NL quants, but then the resulting format should not be considered a K-Quant.

When people do quantize and select Q2_K or Q3_K_S, then the output should be K-Quant compatible and work on K-Quant compatible backends. If we want to utilize a IQ4_NL type by all means, but then the output should be labelled as IQ2_S instead and made clear on the quantization options that you are getting an I-Quant instead.

You are making decision to increase Q2KS PPL for a lot of backend just to make few backend usable.
I think you might want to make a option to exclude Iquant during quant can call it compatible mode or default to Qquant only but with option to include IQquant mix.
Ideally, It should be allow user to use different quant mix to reflect the different backend needs.
Q2KS and IQ2_2 are very different in that Q2KS don't need imatrix where IQ2S do.
Q2KS with IQ4NL is lot more potent in my test of translation than the old variant.

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 17, 2024

I would also suggest checking the assumption that this change will not increase the perplexity significantly. As far as I know, the main reason iq4_nl exists is precisely to be used as a fallback quant when the row size is not a multiple of 256. If this use case is removed, iq4_nl may as well be removed entirely.

If Vulkan really is the only backend that lacks iq4_nl support and it's really only one quant that's needed, might be better to put in the effort to implement it there. I haven't looked at iq quants yet cause I don't have that much time and new quants are tedious to implement.

@LostRuins
Copy link
Collaborator

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

@oldgithubman
Copy link

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

Good points on both sides. Ultimately, if you're going to damage more-popular backends, I think it makes more sense to isolate this change to Vulkan, no?

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 17, 2024

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

Good points on both sides. Ultimately, if you're going to damage more-popular backends, I think it makes more sense to isolate this change to Vulkan, no?

This can't be isolated to a single backend, cause it affects new models that get quantized and shared. The Vulkan backend supports k-quants, but it does not support the latest q2_k and q3_k (?) models cause they use a single iq4_nl tensor. That's what this is about.

@oldgithubman
Copy link

oldgithubman commented Jul 17, 2024

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

Good points on both sides. Ultimately, if you're going to damage more-popular backends, I think it makes more sense to isolate this change to Vulkan, no?

This can't be isolated to a single backend

Admittedly, I might not understand exactly what's going on here, but I hear this sort of statement a lot around here ("can't"), and that's rarely true. I think it should be replaced with, "I haven't figured out how to yet" (see the recent situation involving breaking stuff due to renaming - there were ways to do it without breaking things - they just didn't think of them). Considering this is proposing damaging more-popular backends to benefit the less-popular, I think it's worth brainstorming ways to do what you say can't be done. Again, I admit this is above my paygrade, but I have a hard time believing it's impossible. For example, why can't one convert the tensor to Q4_0 if they want to run Vulkan? Also, why can't Vulkan run IQ4_NL in the first place? Is it something llama.cpp can implement/fix/adapt/whatever? Just brainstorming

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 19, 2024
@0cc4m 0cc4m mentioned this pull request Jul 21, 2024
4 tasks
@ggerganov
Copy link
Member Author

Obsoleted by #8613

@ggerganov ggerganov closed this Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants