-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : initial Mamba-2 support #9126
base: master
Are you sure you want to change the base?
Conversation
* ggml : improve ggml_mul speed when masking recurrent states
* ggml : make the ggml_mul fast broadcast path more consistently formatted
e9b0d19
to
aff9692
Compare
Hey @compilade , thanks for implementing this! I tried converting https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1 using
Nevertheless, I successfully converted a Mamba-Codestral Run it output model (remember to select the correct chat template, since the model does not come with one):
The result looks promising, but I have no idea why there are
Link to download GGUF: https://huggingface.co/ngxson/codestral-mamba-llamacpp-test/tree/main |
The steps I took to convert Mamba-Codestral-7B-v0.1 are the following:
I did not have tokenization problems in my tests. Maybe because I was using the original SentencePiece tokenizer instead of a BPE tokenizer. That There are probably still problems with the SentencePiece tokenizer too, I think the SentencePiece tokenizer should be preferred for this model; it should be easier to handle without workarounds. I should change that in |
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
Thanks for the guide! I've successfully converted the original repository the gguf by following your steps. For the I'm wondering if (Also cc @Vaibhavs10 since he's the maintainer of gguf-my-repo.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @compilade/ @ngxson - JFYI - the transformers weights are now merged in the main repo: https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1
If you face any issues with the conversion with this could you open an issue on the repo for us to track! 🤗
Any updates on when Codestral Mamba should be supported? |
Nice work! Just a note on the ssm_scan kernel performance: a better fused implementation by the flash-linear-attention project can give the equivalent functionality as Mamba2's original kernel: fla-org/flash-linear-attention#49 , and runs 2x faster: fla-org/flash-linear-attention#50 |
Hi @compilade ! I worked on repo conversion for the transformers-compatible mamba2 version, let us know if you need anything from us to move forward with this PR :) |
It sounds like having a simple fallback of expected filenames would be a reasonable thing to include here? I don't know that we want to maintain a ton of different ones, but adding a second layer of fallbacks for alternate filenames doesn't feel arduous. |
That's not really a problem anymore (at least for Mamba-Codestral) since the official repo was updated in https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1/commit/88085f9cdfa832c3aca8a0315a4520cf7558c947 to use more standard names. What is currently blocking this is that the Metal and CUDA kernels for |
Any updates on this? |
The max index is 31, so trimming the arguments is necessary.
Whoops, this is needed for the offset in the concatenated output.
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
This makes the weight buft detection in src/llama.cpp simpler. * convert : transpose Mamba-2 A, D and reshape SSM_NORM This breaks existing conversions of Mamba-2 models to avoid some reshapes. Not sure if it's a good idea, but it makes the graph slightly cleaner. * llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
Very excited for this PR! Thanks @compilade!! |
Hi @compilade , thank you for your impressive implementation. I am building the support for bi-mamba (https://arxiv.org/abs/2411.11843) on top of your implementation. However, I first tried to use mamba2-2.7 model and computed the ppl on wiki dataset (https://huggingface.co/datasets/ggml-org/ci/blob/main/wikitext-2-raw-v1.zip). The ppl is pretty bad with more than 3500+. So, have you ever tested the performance of your implementation before? My script is: |
Bi-Mamba seems amazing!
I did test it when working on it, and it did work, but it has been a while. I will get back to this to find out what has broken. |
Sounds like the issue might be related to state rollback which @compilade previously mentioned,
This could be causing the high perplexity values since the model has to reprocess previous content and repeat with each generation. It is also the problem I am encountering now; I cannot directly use llama.cpp to evaluate accuracy because model does not generate EOS. |
@Tangshengku Which model exactly is causing you problems? I can't reproduce the problem with a freshly-converted Perplexity seems fine (on 8 chunks of $ ./bin/llama-perplexity -m /path/to/mamba2-370M-Q8_0.gguf -f /path/to/wikitext-2-raw/wiki.test.raw --chunks 8
...
[1]10.6303,[2]13.3056,[3]14.2217,[4]14.3134,[5]13.8906,[6]13.9622,[7]14.4598,[8]14.9276,
Final estimate: PPL = 14.9276 +/- 0.92728 I've also tried with $ ./bin/llama-perplexity -m /path/to/mamba2-2.7B-Q4_K_M.gguf -f /path/to/wikitext-2-raw/wiki.test.raw --chunks 8
...
[1]7.4471,[2]8.9320,[3]9.1891,[4]9.3914,[5]9.2716,[6]9.4961,[7]9.8894,[8]10.2744,
Final estimate: PPL = 10.2744 +/- 0.59925 I'm not sure what could be causing what you've seen. Note that I tested I would suggest to try re-converting the model with For example, assuming this is run from a checkout of this branch, this results in a $ python3 convert_hf_to_gguf.py --outtype f16 --outfile /somewhere/tmp/mamba2-2.7b-F16.gguf /somewhere/src/mamba2-2.7b/
$ ./build/bin/llama-quantize /somewhere/tmp/mamba2-2.7b-{F16,Q4_K_M}.gguf q4_k_m @Tangshengku Alternatively, the problem may be related to GPU support... All my tests were on a CPU-only build (with AVX and AVX2). If you find out how to fix the problem you've noticed, please do share. Looking forward to help you make Bi-Mamba work with
@EthanFS State rollback not being properly handled shouldn't affect perplexity; it's only relevant when partially removing tokens from the context (as opposed to clearing the context, which is handled properly). Partial removal with recurrent models currently is handled by recomputing the context from the beginning if I recall correctly. That should not affect perplexity, only the efficiency when rolling back. |
@compilade Thanks for your explanation.
|
@EthanFS What I would suggest would be to either use repetition penalty, or use a stop string, or other models which are instruction-trained. For example, to use a stop string, you can use $ ./bin/llama-cli -m /path/to/mamba2-370m-Q8_0.gguf -n 1024 -p "### Question: What is quantum computing?\n### Answer:" -r "### Question:" I hope this helps! @Tangshengku Script to prepare the `pytorch_model.bin` of Bi-Mamba into a proper `model.safetensors` for conversion (click to expand)import torch
from safetensors.torch import save_file
model = torch.load("pytorch_model.bin", map_location="cpu", weights_only=True, mmap=True)
new_model = {}
for name, data in model.items():
if ".in_proj." in name or ".out_proj." in name:
if name.endswith(".weight"):
prefix = name.removesuffix(".weight")
wscale = model[prefix + ".wscale"]
wbias = model[prefix + ".wbias"]
data = wscale * torch.sign(data) + wbias
else:
continue
new_model[name] = data
print(name, data.shape)
save_file(new_model, "model.safetensors") NOTE: This takes around 20GiB of free RAM to run with the 10GiB F32 Bi-Mamba 2.7B model The important bit is that only the sign of the weights is used along with the scale and bias. Otherwise the model this produces does not have good perplexity. And so with your model (which is using the Mamba-2 architecture) I get good perplexity results: $ ./bin/llama-perplexity -m /path/to/bimamba-2.7B-F16.gguf -f /path/to/wikitext-2-raw/wiki.test.raw --chunks 8
...
llama_model_loader: - kv 0: general.architecture str = mamba2
...
[1]7.9030,[2]9.0506,[3]9.5904,[4]10.9340,[5]10.9741,[6]10.7500,[7]11.0106,[8]10.9961,
Final estimate: PPL = 10.9961 +/- 0.65000 Unfortunately, it seems like From having made A 1-bit type with 256-element blocks with a |
@compilade Hi, thank you for your quick reply! Sorry, I found that my issue is that I accidentally used the tokenizer of llama2 instead of the tokenizer used in the original mamba. After using the correct tokenizer, I can replicate the exact ppl results you provided in both Mamba-2.7B and Bi-Mamba on M4 Pro CPU. Instead of computing the w_scale and w_bias during tensor transformation, I compute the w_scale and w_bias during inference on the activation, which is equivalent to the operation on the binarized weight in math, like this: // in function llm_build_mamba2()
....
struct ggml_tensor * cur_scale = ggml_mul(ctx, cur, model.layers[il].ssm_in_wscale);
struct ggml_tensor * bias_term = llm_build_lora_mm(lctx, ctx, model.layers[il].ssm_in_wbias, cur);
// {n_embd, d_in_proj} @ {n_embd, n_seq_tokens, n_seqs} => {d_in_proj, n_seq_tokens, n_seqs}
struct ggml_tensor * zxBCdt = llm_build_lora_mm(lctx, ctx, model.layers[il].ssm_in, cur_scale);
zxBCdt = ggml_add(ctx, zxBCdt, bias_term);
....
....
struct ggml_tensor * y_scale = ggml_mul(ctx, y, model.layers[il].ssm_out_wscale);
struct ggml_tensor * bias_term_out = llm_build_lora_mm(lctx, ctx, model.layers[il].ssm_out_wbias, y);
cur = llm_build_lora_mm(lctx, ctx, model.layers[il].ssm_out, y_scale);
cur = ggml_add(ctx, cur, bias_term_out); In fact, I am quite new to llama.cpp, but I guess this operation on activation could be beneficial for actual binary computation (hope so...). If not, we can start over. I am happy to contribute to the later development for the new type and even the GPU support (just need more time to get familiar with this codebase) :). @EthanFS Hi, thanks for the help. I agree with @compilade that the model shows non-stop pattern since the models are not instruction-tuned. You can check the generation case in the Figure 10, 11 and 12 in the appendix of our paper https://arxiv.org/pdf/2411.11843 |
Yes, this is in line with the eventual goal of making an appropriate quantization type for binary models. The scale and bias would be applied at runtime, during matmul, without necessarily having to cast it all to F16 before the matmul. That would allow using appropriate integer SIMD on CPU too. With such a binary type, changing the model graphs would not be necessary, which means it would work for Bi-Mamba, and also FBI-LLM and any other binarized model based on a supported model architecture.
This operation would help, and that's why something similar is built-in to most quantization types in
We all have to start somewhere :) If you want, I can start making a prototype for a binary type in There's also something interesting with binary weights with a scale and bias, because the ideal rounding for It's impressive that Mamba-2 with binarized weights can work (as you've shown with Bi-Mamba). At some point, the states will take more memory than the weights. I wonder how that would affect speed. Footnotes
|
Follow-up from #8519 (comment). This should fix #7727 and fix #8519.
I've implemented the fully recurrent mode of Mamba-2, because it's very similar to Mamba-1, and also because it seems like the most appropriate mode for text generation.
This does not implement the sequentially semistructured matrix mode, because I'm not yet sure how the block decomposition would fit within the
batch
andubatch
framework ofllama.cpp
, and how the chunk size should be chosen. If the recurrent mode is faster at single-user auto-regressive text generation, then I'm not sure how to keep the graph node structure constant when using the most appropriate technique for the batch size.If the sequentially semistructured matrix mode is eventually implemented, it should help with prompt processing speed for large prompts.
What to expect
(mostly taken from #8519 (comment))
The state in Mamba-2 is bigger than I thought; Mamba-Codestral-7B-v0.1 takes
263.5 MiB
(inF32
) per sequence (e.g. with-np 1
), compared to38 MiB
(also inF32
) for Falcon-Mamba-7B (which is based on Mamba-1). But that remains constant whatever the context size. Mamba-2 is easier to implement efficiently, so the bigger state does not really impede inference speed.However, a big downside right now with recurrent models in
llama.cpp
is the lack of state rollback (which is implemented through state checkpoints in #7531, but needs to be re-adapted to #8526), so the prompt will be reprocessed a lot if usingllama-server
. I think usingllama-cli
in conversation mode does not have this problem, however (or maybe only the bare interactive mode with--in-prefix
and--in-suffix
, not sure).This initial implementation is CPU-only, but uses SIMD for the SSM scan, so even though the state is bigger than for Mamba-1 models, in my tests, the speed of
Mamba2-130M
is similar or better thanMamba-130M
(but still not that fast compared to transformer-based models with an empty context), when both are run on CPU.The speed of Mamba-2 models seems comparable to Transformer-based models when the latter have 2k to 4k tokens in their context.
Summary of changes
Mamba2ForCausalLM
(including the official Mamba-2 models, and Mamba-Codestral-7B-v0.1)config.json
needs to contain"architectures": ["Mamba2ForCausalLM"],
for the convert script to properly detect the architecture.d_inner
(aka2 * n_embd
) heads of size 1.ggml_ssm_scan
ggml
ggml_ssm_scan
.ssm_a
is broadcast)ssm_d
intoggml_ssm_scan
GGML_SIMD
.expf
in the state update unlike with Mamba-1.ggml_ssm_scan
.perf
.Other
Here's my favorite quote from Section 3.3 of https://arxiv.org/abs/2405.21060:
TODO
master
after merging llama : simplify Mamba with advanced batch splits #8526.ggml_ssm_scan
GGML_MUL
fast broadcast path because it's not used anymore to mask the states.Maybe use a new metadata key instead of(well, maybe kind of).{arch}.ssm.time_step_rank
for the number of heads of Mamba-2, because it's not really the rank of the time stepssm_d
inggml_ssm_scan
?ggml_ssm_scan
to separate the implementations for Mamba-1 and Mamba-2, although they do have a lot in common.