-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pulling new quantization format Q4_1_O into upstream ggml #89
Comments
Do you have an estimate of how much that change affected the perplexity in isolation from the We are doing some active work in All this is upstream in There are additional improvements pending to the And an additional idea for keeping one of the tensors in full-precision is also likely to be added. But this is not related to With that said, I am hoping after all this work is done to try and see if RWKV still breaks down using the existing formats. There also might be an option of replacing |
As a rough estimate, here is RWKV 169M ppl on a small, private dataset:
This is relevant for RWKV and has been already applied -- we decided not to quantize both
I understand, sounds reasonable. There's work to do even before attempting to pull (BTW, it's |
We now support 1D and 2D custom mapping operators in One option is to implement These results are very useful. |
Already did it, works flawlessly! Thanks @KerfuffleV2. It really saved my time. I will now focus on porting
I'll try to document exactly what setup I use for perplexity measure, so it's reproducible. Unfortunately, I don't want to run wikitext perplexity tests because they take days, so I do much smaller tests on a single file of ~4K tokens. Not ideal, but I beleive it is still representative and good enough. |
I've updated Pexplexity for RWKV 169M:
Per-token latency for RWKV 1B5 on Windows with AVX2:
Interestingly,
I'll wait until Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors is done, and will pull the changes and redo the measurements. |
The |
@ggerganov Hi! I have a small question about ggml. When calculating BTW, |
If it is not too much work for you, could you perform the perplexity / speed tests on the new |
ggml version. Measuring set-up. Perplexity for RWKV 169M:
Per-token latency for RWKV 1B5 on Windows with AVX2:
I need to do more testing to decide whether |
There are now new https://github.com/ggerganov/llama.cpp#quantization Would be interesting to see how they perform with RWKV |
@ggerganov Great! I guess I'll need to test Do you plan to add more quantization formats in the near future? |
A day without finding a new quantization format just means you forgot to pull the repo. (I actually love the rapid progress and iteration, so don't take that as any kind of complaint.) |
I think for the near future we will support these formats. The |
Tested Q5 and Q8 formats with the same settings:
I decided to remove BTW, the only thing for which I still need to fork @ggerganov If not too much work and you have time, could you check the changes and maybe comment on how can I resolve build issues for which I need these changes? (I can also open a separate issue if it would be more efficient) |
Thank you for the information! I don't have a Windows machine to test, but I think I have fixed the build issues that you experience. |
@ggerganov It works, thanks! Along with |
Yes, you probably want to build Awesome work on |
When developing rwkv.cpp, I've discovered that existing quantization formats
Q4_0
andQ4_1
break RWKV (that is, perplexity becomes 10x higher and the output is garbage). I've documented my observations in this issue. Looks like this is caused both by outliers in weights, and outliers in activations.To solve this, I've created a new format
Q4_1_O
. Commit in rwkv.cpp. Comparisons.Most important things about the format:
Q4_1
min
&delta
values inFP16
, notFP32
FP16
value (called "outlier") and its index in the block; all other values are quantized as if there was no outlierFP32
, that is, I dequantize the matrix and multuply it by activations already inFP32
the same as40% slower thanFP32
FP16
(on my machine)FP16
, but principle "it's better to use quantized X+1 model thanFP16
X model" holdsTL;DR: store single outlier value per block unquantized; dot in FP32.
Recently, it became clear that my
ggml
fork and upstreamggml
(inllama.cpp
/here) began to greatly diverge: Code difference is getting more between ggml and rwkv.cpp.I would like to keep interventions in my copy of
ggml
as small as possible, so I can pull latest optimizations/fixes without the need to apply all my changes again.Specifically, I ask: does it sound like
Q4_1_O
format belongs to upstreamggml
? If so, I can create a PR here.The text was updated successfully, but these errors were encountered: