Add implementations of Mamba 2 into FLA #39

DanFosing · 2024-08-05T13:30:15Z

Modified version of https://github.com/huggingface/transformers/tree/add_codestral_mamba2/src/transformers/models/mamba2 (huggingface/transformers#32080) to work with FLA.
I haven't tested if it works yet, but I'm pretty sure it will work. It can probably be made a bit faster by implementing gated RMSNorm utilizing SiLU.

yzhangcs · 2024-08-05T13:33:27Z

@DanFosing Great job! I'll make some tests soon. Thank you for the contributions.

yzhangcs · 2024-08-05T13:36:58Z

Could you add __init__.py file like https://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/models/mamba/__init__.py to make the model recongnizable by fla and register it in
https://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/models/__init__.py

DanFosing · 2024-08-05T13:53:33Z

I think it should work with fla now.

DanFosing · 2024-08-05T13:56:34Z

Unfortunately because it's still an ongoing pull request into transformers package, there is a possibility that it may not fully work in some specific cases, that's why it requires some testing.

yzhangcs · 2024-08-05T14:04:20Z

@DanFosing Hi, looks like there are still some errors

  File "flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 607, in forward
    return self.cuda_kernels_forward(hidden_states, cache_params, cache_position, attention_mask)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 333, in cuda_kernels_forward
    rmsnorm_weight=self.norm.weight,
                   ^^^^^^^^^^^^^^^^
  File "anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'GatedRMSNorm' object has no attribute 'weight'

yzhangcs · 2024-08-05T14:04:44Z

@DanFosing You can check your impls via running

python -m benchmarks.benchmark_training_throughput --name mamba2

yzhangcs · 2024-08-05T14:06:15Z

Also it would be better to beautify your code style via pre-commit to make it follow PEP8 guidelines.

yzhangcs · 2024-08-05T14:07:50Z

You can refer to https://github.com/sustcsonglin/flash-linear-attention/blob/main/fla/layers/gla.py#L219-L228 for our impls of gate norm. I think there is no need to further wrap it with another GatedNorm.

yzhangcs · 2024-08-05T15:49:14Z

@DanFosing Appreciate your hard work and quick response.

DanFosing · 2024-08-05T16:44:58Z

Are you sure it works? When I tried it on kaggle I got assertion errors but it often has some weird problems so it may be an issue with kaggle (I can't test it on my pc right now)

yzhangcs · 2024-08-05T16:46:15Z

Could you paste more detailed infos?

yzhangcs · 2024-08-05T16:52:30Z

@DanFosing Here is the output

$ python -m benchmarks.benchmark_training_throughput --name mamba2
Initializing mamba2 model from the config:
Mamba2Config {
  "bos_token_id": 1,
  "chunk_size": 256,
  "conv_kernel": 4,
  "eos_token_id": 2,
  "expand": 2,
  "fuse_cross_entropy": true,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.1,
  "layer_norm_epsilon": 1e-05,
  "model_type": "mamba2",
  "n_groups": 8,
  "norm_before_gate": true,
  "num_heads": 64,
  "num_hidden_layers": 48,
  "pad_token_id": 0,
  "rescale_prenorm_residual": false,
  "residual_in_fp32": true,
  "rms_norm": true,
  "state_size": 128,
  "tie_word_embeddings": false,
  "time_step_floor": 0.0001,
  "time_step_limit": [
    0.0,
    Infinity
  ],
  "time_step_max": 0.1,
  "time_step_min": 0.001,
  "time_step_rank": 128,
  "transformers_version": "4.43.3",
  "use_bias": false,
  "use_cache": true,
  "use_conv_bias": true,
  "vocab_size": 32000
}

Mamba2ForCausalLM(
  (backbone): Mamba2Model(
    (embeddings): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-47): 48 x Mamba2Block(
        (norm): RMSNorm(2048, eps=1e-05)
        (mixer): Mamba2Mixer(
          (act): SiLU()
          (conv1d): Conv1d(6144, 6144, kernel_size=(4,), stride=(1,), padding=(3,), groups=6144)
          (in_proj): Linear(in_features=2048, out_features=10304, bias=False)
          (norm): FusedRMSNormSwishGate(4096, eps=1e-05)
          (out_proj): Linear(in_features=4096, out_features=2048, bias=False)
        )
      )
    )
    (norm_f): RMSNorm(2048, eps=1e-05)
  )
  (lm_head): Linear(in_features=2048, out_features=32000, bias=False)
)
Number of parameters in total: 1548430336 (1.44GiB)
Allocated memory after initialization: 2.89GiB
Max memory allocated: 42.01GiB: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [01:11<00:00,  4.44s/it]
Thoughput:   32048.92 tokens/s: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:16<00:00,  1.96it/s]

DanFosing · 2024-08-05T17:20:50Z

Turns out it was just a problem with T4 not supporting some things from causal_conv1d package, there is the same issue with mamba 1 and 2 from mamba-ssm. I don't remember what it was but there was some workaround that fixed it if you turned off some low memory mode in pytorch or something. Btw torch.compile support was added to mamba-1 but it requires some changes in mambacache and mamba modeling, I will try to implement it when I have some time.

learning-chip · 2024-08-05T20:10:58Z

One question on this PR:

From the FLA paper, both mamba-1 and mamba-2 can be written as Gated linear attention formulation:

So, the SSM part of Mamba-2 (excluding conv1d, normalization, ...) should permit the same "Chunkwise Parallel Form" as advocated by the FLA paper, no?

Then, the modeling_mamba2.py implementation here shouldn't need to import mamba_chunk_scan_combined from original mamba_ssm repository, right? It can use a custom kernel similar to ops/gla/chunk_fuse.py in this repo.

Otherwise, the modeling_mamba2.py in current PR only uses custom RMSNorm/FusedRMSNormSwishGate kernels, while the SSM part is no different from original mamba_ssm repository. Then the performance will remain largely the same as original repo, and you cannot tell whether FLA formulation is faster...

learning-chip · 2024-08-05T20:31:19Z

The mamba-2 blog did mention that FLA chunkwise parallel "turns out to be essentially equivalent to the SSD algorithm specialized to a restricted case"

Will the FLA formulation be just identical to the mamba_chunk_scan_combined in original mamba-2 code? Or there is still some chance to improve on the original ver?

DanFosing · 2024-08-05T20:39:58Z

Indeed I think it can be made faster if a kernel similar to chunk_fuse.py is used (maybe it would be possible to just modify gla one as both GLA and Mamba-2 are extremely similar to each other). Unfortunately I'm not familar with triton so I can't really do it, I think it would be best if you made an issue or something, unless some of FLA devs replies there. Current implementation is a lot like mamba-1 implementation, both are mostly like original mamba-ssm ones, with just some minor speed up thanks to FusedCrossEntropyLoss and custom RMSNorm and those are compatible with huggingface transformers.

DanFosing · 2024-08-05T20:48:11Z

If you take a look at this paper: https://arxiv.org/pdf/2406.06484 you can also see that in terms of recurrence and memory read-out mamba-2 is very similar to RetNet and Linear Attention:

learning-chip · 2024-08-06T09:03:59Z

I think it would be best if you made an issue or something, unless some of FLA devs replies there.

Will look into it!

Also @yzhangcs for any suggestions -- like, which existing triton kernel is the best starting point to re-implement the SSM part of mamba2? From the table above, kernels for RetNet should be the most close ones?

The mamba-2 blog also mentions the close connection between RetNet and Mamba-2 (under their SSD framework):

Prior examples include the original linear attention as well as the recent Retentive Network (RetNet) model
[18] . These can be viewed as direct special cases of SSD.

yzhangcs · 2024-08-06T12:33:39Z

@learning-chip Hi, you may refer to simple gla by @sustcsonglin which provides data-dependent decay upon RetNet.

Add files via upload

776f57b

DanFosing mentioned this pull request Aug 5, 2024

Add implementations of Mamba 2 into FLA #34

Closed

DanFosing added 3 commits August 5, 2024 15:44

Add __init__.py

fd66250

Add __init__.py

ea2edc9

Update __init__.py

5a3d890

DanFosing and others added 4 commits August 5, 2024 17:15

Update configuration_mamba2.py

52516b6

Update modeling_mamba2.py

48df556

[Mamba2] Fix norm arg name

932b2d1

[Mamba2] Update default params

aa9f0e8

yzhangcs merged commit 0eacb4c into fla-org:main Aug 5, 2024

learning-chip mentioned this pull request Aug 18, 2024

Replace mamba2 mamba_chunk_scan_combined triton kernel by simple_gla triton kernel #49

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add implementations of Mamba 2 into FLA #39

Add implementations of Mamba 2 into FLA #39

DanFosing commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

DanFosing commented Aug 5, 2024 •

edited

Loading

DanFosing commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

DanFosing commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

DanFosing commented Aug 5, 2024

learning-chip commented Aug 5, 2024 •

edited

Loading

learning-chip commented Aug 5, 2024 •

edited

Loading

DanFosing commented Aug 5, 2024

DanFosing commented Aug 5, 2024

learning-chip commented Aug 6, 2024 •

edited

Loading

yzhangcs commented Aug 6, 2024

Add implementations of Mamba 2 into FLA #39

Add implementations of Mamba 2 into FLA #39

Conversation

DanFosing commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

DanFosing commented Aug 5, 2024 • edited Loading

DanFosing commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

DanFosing commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

yzhangcs commented Aug 5, 2024

DanFosing commented Aug 5, 2024

learning-chip commented Aug 5, 2024 • edited Loading

learning-chip commented Aug 5, 2024 • edited Loading

DanFosing commented Aug 5, 2024

DanFosing commented Aug 5, 2024

learning-chip commented Aug 6, 2024 • edited Loading

yzhangcs commented Aug 6, 2024

DanFosing commented Aug 5, 2024 •

edited

Loading

learning-chip commented Aug 5, 2024 •

edited

Loading

learning-chip commented Aug 5, 2024 •

edited

Loading

learning-chip commented Aug 6, 2024 •

edited

Loading