This repository contains a set of tools for reinforcement learning with LLMs in verifiable environments.
For now, it supports the TRL implementation of the GRPO algorithm via a fork (open PR), and requires vLLM for inference.
PyPI coming soon once a couple more features are added, just clone it for now and run:
(uv) pip install -e .
Ensure your wandb
and huggingface-cli
logins are set up (or set report_to=None
in training_args
).
Tested with Python 3.11 and this image. If you encounter version issues, please confirm that you are able to run basic TRL training in your environment before opening an issue. flash-attn
and liger-kernel
are included for performance reasons. Recommended usage is via accelerate
with DeepSpeed ZeRO 3 (example config) but torchrun
works in my tests as well.
# script.py
import verifiers as vf
from trl import GRPOTrainer
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model, tokenizer = vf.get_model_and_tokenizer(model_name)
vf_env = vf.DoubleCheckEnv(dataset="gsm8k")
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
env=vf_env,
reward_funcs=vf_env.get_rubric(),
args=vf.get_default_grpo_config(run_name="doublecheck", num_gpus=1),
train_dataset=vf_env.get_dataset(),
)
trainer.train()
# vf_env.eval(batch_size=32) (coming soon)
See examples
for additional usage examples.
To create your own multi-step environment, inherit from MultiStepEnv
and implement:
def get_dataset(self, **kwargs: Any) -> Dataset:
pass
def get_rubric(self, **kwargs: Any) -> List[RewardFunc]:
pass
def is_completed(self, messages: List[Dict[str, str]], **kwargs: Any) -> bool:
pass
def env_response(self, messages: List[Dict[str, str]], **kwargs: Any) -> Dict[str, str]:
pass
Accelerate:
accelerate launch --config_file /path/to/deepspeed_zero3.yaml --num_processes [N-1] script.py
Torchrun:
torchrun --nproc_per_node=[N-1] script.py
- Environments:
SimpleEnv
,MathEnv
,DoubleCheckEnv
,CodeEnv
- Multi-step code execution in
CodeEnv
- Dataset formatting
- Rubrics for math correctness + response formatting
- Rubrics for code correctness + response formatting
- Defaults for GRPO, model, tokenizer, etc.
There are a number of features we're planning to support in the near future:
- Integrated evals
- TextArena games
- LLM judges
- Claude-generated rubrics
- A range of other environments (suggestions welcome!)
- PPO
- Potential interoperability with other RL libraries (veRL, OpenRLHF, open-instruct, oat, etc.)
Community contributions are appreciated and encouraged!
If you use this code in your research, please cite:
@article{brown2025verifiers,
title={Verifiers: Reinforcement Learning with LLMs in Verifiable Environments},
author={Brown, William},
year={2025}
}