GitHub - microsoft/Magma: Magma: A Foundation Model for Multimodal AI Agents

Magma: A Foundation Model for Multimodal AI Agents

Jianwei Yang^*¹^† Reuben Tan¹^† Qianhui Wu¹^† Ruijie Zheng²^‡ Baolin Peng¹^‡ Yongyuan Liang²^‡

Yu Gu¹ Mu Cai³ Seonghyeon Ye⁴ Joel Jang⁵ Yuquan Deng⁵ Lars Liden¹ Jianfeng Gao¹^▽

¹ Microsoft Research; ² University of Maryland; ³ University of Wisconsin-Madison
⁴ KAIST; ⁵ University of Washington

^* Project lead ^† First authors ^‡ Second authors ^▽ Leadership

[arXiv Paper] [Project Page] [Model Coming Soon!]

✨ Highlights

Digital and Physical Worlds: Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
Versatile Capabilities: Magma as a single model not only posseesses generic image and videos understanding ability, but alse generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
State-of-the-art Performance: Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
Scalable Pretraining Strategy: Magma is designed to be learned scalably from unlabeled videos in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!

🔥 News

[2025.02.20] Magma has reached the top spot on Hacker News!
[2025.02.19] We will be releasing our code, model and UI navigation demo by MSR Forum on 02.25 next Tuesday!
[2025.02.18] Our Flagship Project Magma at MSR is released on arXiv!

📑 Todos

We will be releasing all the following contents:

What is Magma?

Magma is a foundation model for multimodal AI agents. As the bedrock for mutimodal agentic models, it should possesse strong capabilities to perceive the multimodal groundingly world AND take goal-driven actions precisely (see above figure). With this in mind, we are striving for the following goals:

Verbal and spatial-temporal intelligence: Magma is supposed to have both strong verbal and spatial-temporal intelligence to understand images and videos, ground its actions on the observations, and further translate the external goal into action plan and executions.
Digial and physical world: Magma should not be limited to either the digital world (e.g., web navigation) or the physical world (e.g., robotics manipulation), but rather be able to work across both worlds, just like humans ourselves.

With this in mind, we developed a new pretraining data, which mostly consists of unlabeled videos in the wild plus the existing annotated agentic data, and a new pretraining framework, which unifies the training of all three modalities (text, image, and action), to train a new foundation model for multimodal AI agents, named Magma.

How we pretrain Magma?

We pursue the goal through two dimensions:

Large-scale hetergeneous training data: we curage a large amount of data in the wild, including existing multimodal understanding data, UI navigation data, and robotics manipulation data, and unlabeled videos in the wild. We also propose a new data collection pipeline to collect unlabeled videos in the wild, which is scalable and cost-effective. To attain useful action supervision from raw videos and robotics trajectories, we meticulously removed the camera motions in the videos and then transform the motions into "action" supervisions for our model training. These provide unique signals for the model to learn the cross-modal connections and long-horizong action prediction and planning.
Universal pretraining objectives: texts and actions are inherently different and thus cause a huge gap, while visual tokens are continuous. We propose a universal pretraining framework that unifies the training of all three modalities, and we show that this is crucial for the model to learn the cross-modal connections. More specifically, we proposed Set-of-Mark and Trace-of-Mark as the auxiliary tasks for our model pretraining, as the bridge of different output modalities. In this way, we are building a great alignment between the text and action modalities, and also between the image and action modalities.

Installation

Clong this repo to your local machine:

git clone https://github.com/microsoft/Magma
cd Magma

Install the dependencies:

conda create -n magma python=3.10 -y
conda activate magma
pip install --upgrade pip
pip install -e .

Install packages for training:

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Model Usage

Inference with Huggingface Transformers

We have uploaded the model to Huggingface Hub. You can easily load the model and processor with the following code.

Click to expand

from PIL import Image
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor 

model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
model.to("cuda")

# Inference
image = Image.open("./assets/images/magma_logo.jpg").convert("RGB")

convs = [
    {"role": "system", "content": "You are agent that can see, talk and act."},            
    {"role": "user", "content": "<image_start><image><image_end>\nWhat is the letter on the robot?"},
]
prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to("cuda")

generation_args = { 
    "max_new_tokens": 500, 
    "temperature": 0.0, 
    "do_sample": False, 
    "use_cache": True,
    "num_beams": 1,
} 

with torch.inference_mode():
    generate_ids = model.generate(**inputs, **generation_args)

generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()

print(response)

Inference with local code from this repo

If you want to debug our model, we also provide a local code for inference. You can run the following code to load the model.

Click to expand

from magma.processing_magma import MagmaProcessor
from magma.modeling_magma import MagmaForCausalLM

model = MagmaForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
processor = MagmaProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
model.to("cuda")

Evaluation with lmms-eval

To faciliate the quantitative evaluation of our model, we also provide a model class for lmms-eval. Please refer to lmms-eval-magma for the code.

After installing lmms-eval, copy 'tools/lmms_eval_magma/magma.py' to 'lmms-eval/lmms-eval/models' folder.

Remember to register our model by modifying the 'lmms-eval/lmms_eval/models/init.py' file as follows:

AVAILABLE_MODELS = {
    # many previous registered models
    "magma": Magma,
}

Once everything is ready, you can run the following code to evaluate our model.

sh scripts/lmms_eval_magma.sh

User Guidance

Direct use

This model is intended for broad research use in English. The model take images and text as inputs, and produces the textual outputs for the following uses:

Image/Video-Conditoned Text Generation: The model can generate text (e.g., descriptions, answers) based on the input text and image.
Visual Planning Capabilities: The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).
Agentic Capabilities: The model can also generate UI grounding (e.g., click ``search'' button) and robotics manipulations (e.g., 7 DoF for the robot gripper).

Our model is designed only for research purpose and aimed at knowledge-sharing and accelerating research in multimodal AI, in particularly the mutimodal agentic AI.

Downstream Use

The model can be further finetuned for different downstream tasks, such as:

Image Captioning and QA: We can further finetune this model for image captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better spatial understanding and reasoning on these tasks.
Video Captioning and QA: We can further finetune this model for video captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better temporal understanding and reasoning on these tasks.
UI Navigation: We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.
Robotics Manipulation: Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.

Bias, Risks, and Limitations

Please note that this model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.

Citation

If you use this model in your research, please consider citing:

@misc{yang2025magmafoundationmodelmultimodal,
      title={Magma: A Foundation Model for Multimodal AI Agents}, 
      author={Jianwei Yang and Reuben Tan and Qianhui Wu and Ruijie Zheng and Baolin Peng and Yongyuan Liang and Yu Gu and Mu Cai and Seonghyeon Ye and Joel Jang and Yuquan Deng and Lars Liden and Jianfeng Gao},
      year={2025},
      eprint={2502.13130},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.13130}, 
}

Acknowledgements

Our work is supported by Microsoft Research. We thank all the contributors for their efforts in building this project.

Our work is built on top of some amazing open-source projects, including Transformers, LLaVA, OpenVLA, SeeClick, Mind2Web, and also a number of awesome open-source datasets, including Ego4d, Epic-Kitchen, Something-Somethingv2, Open-X-Embodiment, and a number of evaluation benchmarks, including SimplerEnv, Libero.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
assets		assets
magma		magma
scripts		scripts
tools/lmms-eval-magma		tools/lmms-eval-magma
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Magma: A Foundation Model for Multimodal AI Agents

✨ Highlights

🔥 News

📑 Todos

What is Magma?

How we pretrain Magma?

Installation

Model Usage

Inference with Huggingface Transformers

Inference with local code from this repo

Evaluation with lmms-eval

User Guidance

Direct use

Downstream Use

Bias, Risks, and Limitations

Citation

Acknowledgements

License

Contributing

Trademarks

About

Releases

Packages

Contributors 4

Languages

License

microsoft/Magma

Folders and files

Latest commit

History

Repository files navigation

Magma: A Foundation Model for Multimodal AI Agents

✨ Highlights

🔥 News

📑 Todos

What is Magma?

How we pretrain Magma?

Installation

Model Usage

Inference with Huggingface Transformers

Inference with local code from this repo

Evaluation with lmms-eval

User Guidance

Direct use

Downstream Use

Bias, Risks, and Limitations

Citation

Acknowledgements

License

Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages