llama : change fallback type IQ4_NL -> Q4_0 #8489

ggerganov · 2024-07-15T07:28:17Z

Fallback to Q4_0 instead of IQ4_NL due to lack of implementation in of the latter in some backends

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

slaren · 2024-07-15T13:04:04Z

This will result in (possibly significantly) higher ppl, eventually somebody will notice and waste a lot of time trying to figure why.

LostRuins · 2024-07-15T13:39:06Z

@slaren hmm I think it's worth it for the increased compatibility. It would be easy to find an unaffected model to do an apples-to-apples comparison when evaluating perplexity.

We did the same thing in #2148 , so if it's a concern we could tag on a similar warning during the quantization process when the fallback is triggered, or even add such a warning when calculating perplexity.

slaren · 2024-07-15T13:41:37Z

I am not opposed to the change, but it is not a minor change and should be much more clear (it's not even in the PR title).

LostRuins

Tested on https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF

Builds and quantizes successfully, output Q3_K_S model with Q4_0 fallback tensors successfully.

However, for some reason it still did not work on Vulkan, perhaps due to an unrelated issue: GGML_ASSERT: ggml/src/ggml-vulkan.cpp:3446: n_as <= 8, pinging @0cc4m if they wish to take a look. May be MoE related?

Running in llama-cli, I can observe the tensor mix is indeed changed from

llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type q3_K:  241 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:   27 tensors

to

llama_model_loader: - type  f32:  108 tensors
llama_model_loader: - type q4_0:   27 tensors
llama_model_loader: - type q3_K:  241 tensors
llama_model_loader: - type q6_K:    1 tensors

Removing that Assert and generating with a very short prompt provides coherent output, so I can confirm that the new type mix is working, though perhaps this model is not the best to test with due to unrelated blockers.

I don't have the system resources required to run Qwen 72B myself, but if anyone knows of another model that uses quantized fallbacks i'll be happy to test that!

compilade · 2024-07-16T02:14:45Z

I don't have the system resources required to run Qwen 72B myself, but if anyone knows of another model that uses quantized fallbacks i'll be happy to test that!

@LostRuins There's a bunch of very tiny models with weird sizes, one of which is https://huggingface.co/delphi-suite/v0-llama2-12.8m (12.8M, yes millions), with an embedding size of 384 (not a multiple of 256).

If you were asking for bigger models, there's https://huggingface.co/1bitLLM/bitnet_b1_58-3B (3B) which has an embedding size of 3200 (not a multiple of 256).

(for a cursed model there's also https://huggingface.co/1bitLLM/bitnet_b1_58-xl (1.3B) which has an FFN size of 5460, which is not even a multiple of 32 (there are also tiny cursed models in https://huggingface.co/delphi-suite, but by necessity (small size) instead of accidentally))

ggml-ci

ggerganov · 2024-07-16T07:02:45Z

I moved the styling changes in a separate PR (#8502) and updated this one to contain just the type change

sorasoras · 2024-07-16T07:32:13Z

I am not a huge friend of this change.
I think this should also add a option to allow use of IQ4NL instead of Q4_0 as fallback

LostRuins · 2024-07-16T08:26:02Z

@sorasoras may I ask why? I doubt the perplexity difference of an already heavily compressed q2k model is all that much different with a fallback of one q4_nl to q4_0 tensor.

also remember that the current fallback for larger quants like q4_k_s is simply good old q5_0... and nobody has an issue with that.

case GGML_TYPE_Q4_K:   new_type = GGML_TYPE_Q5_0

If opposed to q4_0, then simply fall back to q5_0 like larger quants, the size difference will be under a few percent at most anyway

sorasoras · 2024-07-16T10:12:40Z

@sorasoras may I ask why? I doubt the perplexity difference of an already heavily compressed q2k model is all that much different with a fallback of one q4_nl to q4_0 tensor.

also remember that the current fallback for larger quants like q4_k_s is simply good old q5_0... and nobody has an issue with that.
case GGML_TYPE_Q4_K:   new_type = GGML_TYPE_Q5_0
If opposed to q4_0, then simply fall back to q5_0 like larger quants, the size difference will be under a few percent at most

even in Q2KS, fallback to Iq4nl vs q4_0 could have outsize influence on PPL.
make it Q5_0 would make it larger.

Fallback to Q4_0 should be default behavior but only with option to enable IQ4NL fallback.
Q5_0 is a lot bigger than Q4_0 and especially IQ4NL.
Otherwise, I have to maintain a separate repo. Q4_0 is a lot worse than IQ4NL in my use case.

A kind of special case but It does affect quite a bit in term of PPL at translation with Qwen1 14B
Q2KS 5.1283 +/- 0.04957 old
Q2KS 5.1090 +/- 0.04962 new with IQ4NL

slaren · 2024-07-17T02:03:43Z

I would also suggest checking the assumption that this change will not increase the perplexity significantly. As far as I know, the main reason iq4_nl exists is precisely to be used as a fallback quant when the row size is not a multiple of 256. If this use case is removed, iq4_nl may as well be removed entirely.

LostRuins · 2024-07-17T03:31:31Z

@slaren I would think that IQ4_NL would be the superior option when combined with other I-Quant mixes and distributed as an iquant model. But ideally keeping within it's own category - for the same reason we avoid taking K-quanted tensors and mixing them into a q5_0 or q4_1 model.

I have not done my own perplexity tests but @sorasoras did highlight a change from Q2KS 5.1283 to Q2KS 5.1090 with IQ4NL on Qwen, or about 0.3% difference - to me that represents a mostly negligible change, but for those who find that important they can continue using the dedicated i-quanted model types.

slaren · 2024-07-17T03:54:50Z

I don't see why that is ideal. I understand that the main motivation for the request to make this change is to workaround the issue of the Vulkan backend not supporting IQ quants. I am not that we should be making these changes based on the limitations of some backends.

Quoting ikawrakow in the PR where he introduced IQ4_NL:

The main purpose of this PR is to provide a 4-bit quantization type that can be used when k- and i-quants that use blocks of 256 are not available (because the number of columns in some tensors are not a multiple of 256).

LostRuins · 2024-07-17T05:42:09Z

I am not opposed to the existence of IQ4_NL quants, but then the resulting format should not be considered a K-Quant.

When people do quantize and select Q2_K or Q3_K_S, then the output should be K-Quant compatible and work on K-Quant compatible backends. If we want to utilize a IQ4_NL type by all means, but then the output should be labelled as IQX_X instead and made clear on the quantization options that you are getting an I-Quant instead.

compilade · 2024-07-17T05:54:50Z

for the same reason we avoid taking K-quanted tensors and mixing them into a q5_0 or q4_1 model.

Note that q6_K is still used for output.weight when making a q4_0, q4_1 or q5_0 model, when possible.

https://github.com/ggerganov/llama.cpp/blob/5e116e8dd51775f8f1c090570be148d5d7eea6c3/src/llama.cpp#L17863-L17864

sorasoras · 2024-07-17T06:08:20Z

I am not opposed to the existence of IQ4_NL quants, but then the resulting format should not be considered a K-Quant.

When people do quantize and select Q2_K or Q3_K_S, then the output should be K-Quant compatible and work on K-Quant compatible backends. If we want to utilize a IQ4_NL type by all means, but then the output should be labelled as IQ2_S instead and made clear on the quantization options that you are getting an I-Quant instead.

You are making decision to increase Q2KS PPL for a lot of backend just to make few backend usable.
I think you might want to make a option to exclude Iquant during quant can call it compatible mode or default to Qquant only but with option to include IQquant mix.
Ideally, It should be allow user to use different quant mix to reflect the different backend needs.
Q2KS and IQ2_2 are very different in that Q2KS don't need imatrix where IQ2S do.
Q2KS with IQ4NL is lot more potent in my test of translation than the old variant.

0cc4m · 2024-07-17T06:17:06Z

I would also suggest checking the assumption that this change will not increase the perplexity significantly. As far as I know, the main reason iq4_nl exists is precisely to be used as a fallback quant when the row size is not a multiple of 256. If this use case is removed, iq4_nl may as well be removed entirely.

If Vulkan really is the only backend that lacks iq4_nl support and it's really only one quant that's needed, might be better to put in the effort to implement it there. I haven't looked at iq quants yet cause I don't have that much time and new quants are tedious to implement.

LostRuins · 2024-07-17T06:19:09Z

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

oldgithubman · 2024-07-17T06:22:59Z

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

Good points on both sides. Ultimately, if you're going to damage more-popular backends, I think it makes more sense to isolate this change to Vulkan, no?

0cc4m · 2024-07-17T06:48:20Z

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

Good points on both sides. Ultimately, if you're going to damage more-popular backends, I think it makes more sense to isolate this change to Vulkan, no?

This can't be isolated to a single backend, cause it affects new models that get quantized and shared. The Vulkan backend supports k-quants, but it does not support the latest q2_k and q3_k (?) models cause they use a single iq4_nl tensor. That's what this is about.

oldgithubman · 2024-07-17T22:05:31Z

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

Good points on both sides. Ultimately, if you're going to damage more-popular backends, I think it makes more sense to isolate this change to Vulkan, no?

This can't be isolated to a single backend

Admittedly, I might not understand exactly what's going on here, but I hear this sort of statement a lot around here ("can't"), and that's rarely true. I think it should be replaced with, "I haven't figured out how to yet" (see the recent situation involving breaking stuff due to renaming - there were ways to do it without breaking things - they just didn't think of them). Considering this is proposing damaging more-popular backends to benefit the less-popular, I think it's worth brainstorming ways to do what you say can't be done. Again, I admit this is above my paygrade, but I have a hard time believing it's impossible. For example, why can't one convert the tensor to Q4_0 if they want to run Vulkan? Also, why can't Vulkan run IQ4_NL in the first place? Is it something llama.cpp can implement/fix/adapt/whatever? Just brainstorming

ggerganov · 2024-07-26T19:23:23Z

Obsoleted by #8613

ggerganov requested a review from LostRuins July 15, 2024 07:28

github-actions bot added the examples label Jul 15, 2024

LostRuins approved these changes Jul 15, 2024

View reviewed changes

llama : change fallback type IQ4_NL -> Q4_0

f6ea7a0

ggml-ci

ggerganov force-pushed the gg/quantize-fallback branch from d92f6b4 to f6ea7a0 Compare July 16, 2024 07:01

ggerganov changed the title ~~llama : valign + remove unused ftype~~ llama : change fallback type IQ4_NL -> Q4_0 Jul 16, 2024

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 19, 2024

0cc4m mentioned this pull request Jul 21, 2024

Vulkan IQ4_NL Support #8613

Merged

4 tasks

ggerganov closed this Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : change fallback type IQ4_NL -> Q4_0 #8489

llama : change fallback type IQ4_NL -> Q4_0 #8489

ggerganov commented Jul 15, 2024

slaren commented Jul 15, 2024

LostRuins commented Jul 15, 2024 •

edited

Loading

slaren commented Jul 15, 2024

LostRuins left a comment •

edited

Loading

compilade commented Jul 16, 2024

ggerganov commented Jul 16, 2024

sorasoras commented Jul 16, 2024

LostRuins commented Jul 16, 2024

sorasoras commented Jul 16, 2024 •

edited

Loading

slaren commented Jul 17, 2024

LostRuins commented Jul 17, 2024

slaren commented Jul 17, 2024 •

edited

Loading

LostRuins commented Jul 17, 2024 •

edited

Loading

compilade commented Jul 17, 2024

sorasoras commented Jul 17, 2024

0cc4m commented Jul 17, 2024

LostRuins commented Jul 17, 2024

oldgithubman commented Jul 17, 2024

0cc4m commented Jul 17, 2024

oldgithubman commented Jul 17, 2024 •

edited

Loading

ggerganov commented Jul 26, 2024

llama : change fallback type IQ4_NL -> Q4_0 #8489

llama : change fallback type IQ4_NL -> Q4_0 #8489

Conversation

ggerganov commented Jul 15, 2024

slaren commented Jul 15, 2024

LostRuins commented Jul 15, 2024 • edited Loading

slaren commented Jul 15, 2024

LostRuins left a comment • edited Loading

Choose a reason for hiding this comment

compilade commented Jul 16, 2024

ggerganov commented Jul 16, 2024

sorasoras commented Jul 16, 2024

LostRuins commented Jul 16, 2024

sorasoras commented Jul 16, 2024 • edited Loading

slaren commented Jul 17, 2024

LostRuins commented Jul 17, 2024

slaren commented Jul 17, 2024 • edited Loading

LostRuins commented Jul 17, 2024 • edited Loading

compilade commented Jul 17, 2024

sorasoras commented Jul 17, 2024

0cc4m commented Jul 17, 2024

LostRuins commented Jul 17, 2024

oldgithubman commented Jul 17, 2024

0cc4m commented Jul 17, 2024

oldgithubman commented Jul 17, 2024 • edited Loading

ggerganov commented Jul 26, 2024

LostRuins commented Jul 15, 2024 •

edited

Loading

LostRuins left a comment •

edited

Loading

sorasoras commented Jul 16, 2024 •

edited

Loading

slaren commented Jul 17, 2024 •

edited

Loading

LostRuins commented Jul 17, 2024 •

edited

Loading

oldgithubman commented Jul 17, 2024 •

edited

Loading