MathArena is a platform for the evaluation of LLMs on the latest math competitions and olympiads. It is hosted on matharena.ai. This repository contains all the code used for model evaluation of the competitions. The README explains how to run your models or add a new competition.
MathArena uses UV to manage dependencies.
- macOS and Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
Create a configuration file in the configs/
folder. Each config must include:
- Required:
model
: Model name. Reasoning effort of OpenAI models can be set by appending--[low/medium/high]
to the model name, e.g.,o3-mini--high
.api
: API provider (supported options with their corresponding API keys):- openai:
OPENAI_API_KEY
- anthropic:
ANTHROPIC_API_KEY
- together:
TOGETHER_API_KEY
- google:
GOOGLE_API_KEY
- deepseek:
DEEPSEEK_API_KEY
- openrouter:
OPENROUTER_API_KEY
- vllm: (runs locally; no API key required)
- openai:
human_readable_id
: A unique, descriptive identifier.
- Optional Parameters:
- API settings like
temperature
,top_p
, andtop_k
(default:temperature
is from competition config, see Adding a Competition). max_tokens
: Max number of tokens for the model (default: from competition config, see Adding a Competition).concurrent_requests
: Number of parallel requests to API (default: 30).timeout
: Request timeout in seconds (default: 500).max_retries
: Retry attempts to API (default: 50).read_cost
&write_cost
: Cost per million tokens in USD for input and output tokens (default: 1 each).
- API settings like
Execute the following command to evaluate a model on a competition:
uv run python scripts/run.py --configs path/to/your/config --comp path/to/competition
path/to/your/config
: Relative path from theconfigs/
folder to the model configuration.path/to/competition
: Relative path from thedata/
folder to the competition folder.
Example:
uv run python scripts/run.py --configs openai/gpt-4o.yaml --comp aime/aime_2025_I
Additional Flags:
skip_existing
: Skip problems already processed through the model.n
: Number of runs per problem (default: 4).
If using a local model with vllm, start the server:
vllm serve [[model_name]] --dtype auto --api-key token-abc123
Adding a competition can be done in several quick steps.
In the data/
folder, create a new directory for your competition with the following structure:
- Problems:
- Create a subfolder
problems/
and add each problem as a separate LaTeX file named1.tex
,2.tex
, …,{k}.tex
, wherek
is the number of problems in your competition. You can skip a problem if you want/need to.
- Create a subfolder
- Answers:
- Add an
answers.csv
file with columnsid
andanswer
.id
: The problem filename (without the.tex
extension).answer
: The integer answer.
- Add an
- Competition Config:
- Create a
config.yaml
with:instruction
: Instructions for the model. Must require the final answer be in\boxed{}
(for correct parsing).default_temperature
: Default temperature for runs.default_max_tokens
: Default max tokens.strict_parsing
:true
for strict format matching (e.g., only\boxed{43}
is accepted) orfalse
for lenient parsing.n_problems
: Total number of problems.
- Create a
Ensure your LaTeX problems compile correctly:
uv run python scripts/check_latex.py --comp path/to/competition
Then, build the latex/main.tex
to generate a PDF and confirm all problems appear as expected.
To run multiple models (possibly across different APIs), use:
uv run python scripts/run_multiple.py --apis openai google anthropic together --comp path/to/competition
This will run models from the same API sequentially and from different APIs concurrently. Options:
--simul
: Run all models in parallel, even if they use the same API.models
: Provide space-separated regex patterns to filter models. A model is only run if it matches any of the regexes.skip_existing
: Skip problems already processed through the model.n
: Number of runs per problem (default: 4).
Note: For local vllm usage, ensure the vllm server is running as described above. Logs will be found in the logs/
folder.
Launch a local web server to inspect the results:
uv run python scripts/app.py --comp path/to/competition
Access the app at http://localhost:5001/. Warning signs for solutions indicate a potential problem with the model run and should be manually verified. Any warning is caused by one of the following problems:
- 💀: parser threw an error or encountered something unexpected.
⚠️ : The correct answer might be present in the model answer, but it was not extracted.- ❕: Model likely hit max token limit.
If issues are found, delete the corresponding output file or fix the parser and rerun the model with skip_existing
. If the parser requires a manual overwrite, you can edit src/matharena/parse_manual.py
and add a key-value pair mapping the model solution to a parseable solution.
You can find logs from our evaluation containing full reasoning traces (if availabel) and solutions produced by the models at the following link: https://files.sri.inf.ethz.ch/matharena/matharena_data.zip.
To post-process results to add them to our website, you should run
bash scripts/website/postprocess.sh path/to/competition
Once processed, you should copy the files in web_outputs/path/to/competition
to our website repository.
@misc{balunovic_srimatharena_2025,
title = {MathArena: Evaluating LLMs on Uncontaminated Math Competitions},
author = {Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, Martin Vechev},
copyright = {MIT},
url = {https://matharena.ai/},
publisher = {SRI Lab, ETH Zurich},
month = feb,
year = {2025},
}