R1-Style distributed GRPO #2326

RedTachyon · 2025-02-01T00:24:41Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

After some discussions on another PR and on Discord, this is the current state of my distributed GRPO implementation. I'm still iterating on this, prioritizing checking whether it actually works.

I have some early successes, but it's too soon to proclaim victory. Soon I'll probably also adapt it to a multinode workflow when #2301 is merged (or just snatch some code from there), because RL is sufficiently resource-hungry that single-node training isn't really an option for anything even moderately serious.

Right now the repo/PR is very messy and in a researchy state, to find something that works. Once it does work, I'll start cleaning it up to meet OSS standards. I'm putting it here to be able to keep track of the diffs and for potential discussions.

The rest will be filled when possible/relevant:

Changelog

What are the changes made in this PR?
*

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

Grpo

pytorch-bot · 2025-02-01T00:24:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2326

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2ba4a97 with merge base e6cba25 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

musabgultekin · 2025-02-03T13:51:34Z

Im extremely interested in this.
One extra suggestion is that we could technically run the frozen reference policy on another device through SGLang or vLLM. That way, we only hold the policy in the VRAM.
One potential drawback would be the inner implementations of these engines might cause subtle differences and that might cause issues on advantage&loss calculation.

Reorganize some recipes Add SFT dataset

SFT recipe for GSM8k

Grpo

Clean up reward function New (untested) generation function New recipe config

akashc1 · 2025-02-04T20:49:51Z

@RedTachyon thanks for the implementation! Please let me know if there's anything you'd like support on, I'm happy to help!

For my own use, could you please share how you set up the environment/commands to launch training?

Manual resharding (?) Mostly-working 3B GRPO config SFT recipe for gsm8k

RedTachyon · 2025-02-06T15:01:24Z

Hi everyone, glad to see people interested in contributing to this implementation! I'm happy to say that the core implementation "just works" - on a recent run I did the following process:

Take base model Llama 3B
Train it with SFT for 1 epoch on 1/10th of the GSM8k train set, using the R1 prompt template (note: extremely quick training, took like 3 minutes on a single node)
Continue training the resulting model on the remaining 9/10ths of the GSM8k train set

At the moment, everything follows the R1 paper relatively closely - format is cot 42, and reward computation is done by XML parsing the response, and checking the answer within tags for a perfect match (failed parse = 0 reward). There's some

Base model by itself struggles with the format (and knowing when to output <|eos|>), so its performance is just bad. The SFT-trained model follows the format well, but kinda sucks at math, so it tends to get about 10% success rate on GSM8k.

Continuing the training with GRPO, the model climbs up to a ~60% success rate! (and can probably go higher if it keeps running for longer)

Caveats: this estimate is based on the training set questions, without a separate eval - but a majority of the improvement happens before the first epoch finishes.

Which brings me to the list of things that still need to be done. I'll probably move it to the top comment on the PR later, but for now here it is:

I'll start going through some of these things, and if someone wants to contribute, please mention it so that we don't duplicate the effort.

Later today I'll also push my sbatch files to run the full pipeline so that anyone can run it on slurm - for single-node experiments, regular torchtune runner should work fine.

Regarding the organization: it's probably best to keep everything under a shared PR (i.e. this), and make other intermediate PRs into this branch. So various people can contribute parts of the GRPO setup, we merge it into this branch, and then when it's all done, the maintainers can review the complete thing and merge it into main.
It might be better to move the code to a branch on this repo instead of my fork - in any case, I'm happy to adapt it to whatever the maintainers think is best (@SalmanMohammadi @felipemello1 - not sure who's the right person to bother about the managementy things)

CC: @musabgultekin @akashc1 @ianbarber

RedTachyon · 2025-02-17T16:37:07Z

@ebsmothers @SalmanMohammadi So I did a refactor and moved all "weird" stuff into /dev. Notably, I also took my changes to the generate function into a "fork" inside /dev, which admittedly creates some (temporary) code duplication, but hopefully it's an acceptable trade-off? The differences are multi-device support and allowing ignoring logits - we can take some more time to figure out the right APIs and tests for that, but for the time being, the logit issue really hurts the GRPO performance.

recipes/configs/dev/3B_sft_for_grpo.yaml

ebsmothers · 2025-02-18T00:43:17Z

@RedTachyon thanks for doing that! At least at first glance this looks much easier for us to land, will give a proper review tomorrow.

This reverts commit 987e971, reversing changes made to 8283230.

RedTachyon · 2025-02-19T19:41:14Z

@ebsmothers Gentle nudge - I'm hoping to have this merged sooner rather than later, so that I can start iterating on some more experimental stuff without getting into a branching nightmare

ebsmothers

Thanks for your patience @RedTachyon! A few more small comments and I think this is good to go (now that everything is in dev I will not be as pedantic about some of the design considerations). Main thing is to make sure that everything is runnable out-of-the-box. Also if you're able to share some of the logged metrics (successes, rewards, etc) on your latest runs that would be great as well.

recipes/configs/dev/3B_full_grpo.yaml

recipes/configs/dev/3B_sft_for_grpo.yaml

ebsmothers · 2025-02-21T00:22:59Z

recipes/dev/grpo_full_finetune_distributed.py

+            self._log_peak_memory_stats = False
+
+        self.fsdp_cpu_offload = cfg.get("fsdp_cpu_offload", False)
+        self._enable_async_checkpointing = cfg.get("enable_async_checkpointing", False)


Bumping this comment, I don't believe this is actually used anywhere? If so can we just remove it?

recipes/configs/dev/3B_full_grpo.yaml

recipes/dev/grpo_full_finetune_distributed.py

torchtune/dev/grpo/generation.py

Co-authored-by: ebsmothers <[email protected]>

felipemello1 · 2025-02-21T16:09:27Z

hey @RedTachyon , just wanted to thank you for being so responsive and putting the effort into this. We all appreciate it :)

RedTachyon · 2025-02-21T17:01:53Z

Happy to contribute, and thanks for all the help in bringing this to a publishable state!

I applied the final fixes and launched another run to make sure it still learns - so far it's just about the same, and I don't really expect anything to be different. (EDIT: about 2 hours in, it's going up the same way it was going up before, so it's most likely all good)

As for some existing metrics - right now I don't have anything very pretty, but just to give a sense of what to expect, here are some wandb graphs (which I unfortunately can't share as actual wandb reports at the moment).

There are 8 curves, varying across training from the base model or from an SFT-initialized model, and then across the number of nodes (so effectively batch size - 2 nodes have an effective batch size of 16, 4 nodes - 32, 8 nodes - 64).

Success rate vs steps

Reward vs steps

Success rate vs time

Reward vs time

Note that the hardware isn't super consistent, so the time graphs are meant as a very rough estimate

ebsmothers

Thank you so much @RedTachyon! This was a serious PR. Really appreciate your patience through the review process and we're so glad to have GRPO thanks to your efforts!

Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: ebsmothers <[email protected]> Co-authored-by: salman <[email protected]>

RedTachyon and others added 10 commits January 25, 2025 23:43

RL starter code

1c29d67

Add gsm8k

ed544da

Notebooks

6ca2c38

Distributed dev progress

7e39b69

Decent progress on R1 RL

55b1c65

PoC training loop, but not working - code checkpoint

aa34954

8B recipe, sorta running?

0cc7795

Merge branch 'main' into grpo

b11e742

Merge pull request #1 from RedTachyon/grpo

ce7e8ce

Grpo

Some updates, some progress

736b14f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 1, 2025

ianbarber mentioned this pull request Feb 1, 2025

Grpo & verifiable rewards dataset #2324

Closed

8 tasks

RedTachyon added 2 commits February 1, 2025 21:30

Multi-node training, new reward shaping, success metric

eb3d5b9

Synchronize metrics, fix parsing, more memory management

9df4b38

RedTachyon and others added 4 commits February 3, 2025 17:42

Sync rewards and successes between processes

3d39ffc

Reorganize some recipes Add SFT dataset

Add filter kwargs and partition for easier dataset filtering

30fa65c

SFT recipe for GSM8k

Merge pull request #2 from RedTachyon/grpo

e8f19f2

Grpo

Reorganize methods

499c013

Clean up reward function New (untested) generation function New recipe config

felipemello1 mentioned this pull request Feb 4, 2025

Feature request: GRPO support #2340

Open

RedTachyon added 7 commits February 5, 2025 22:59

Batched logit to logprob conversion

3ca4f49

Manual resharding (?) Mostly-working 3B GRPO config SFT recipe for gsm8k

Remove old notes

300c117

Revert config changes

3cf169c

More config cleanup

7d9d37a

Recipe cleanup

654eb56

Remove unnecessary PPO change

c75cb7a

Cleanup

1f6be85

RedTachyon added 6 commits February 17, 2025 13:13

Properly move experimental code to /dev

53ee98f

Remove recursive_reshard from the public API

682ebef

Separate optimized generation function, small fixes

8dd7546

Undo generation changes in the main function

0939c94

Fix custom generate

81a7765

Pre-commit

e41a520

TAplutos reviewed Feb 17, 2025

View reviewed changes

recipes/configs/dev/3B_sft_for_grpo.yaml Outdated Show resolved Hide resolved

RedTachyon and others added 4 commits February 18, 2025 09:46

Fix SFT dataset path

45084e0

Merge branch 'pytorch:main' into main

8283230

Merge branch 'experiments' into main

987e971

Revert "Merge branch 'experiments' into main"

7a03139

This reverts commit 987e971, reversing changes made to 8283230.

ebsmothers reviewed Feb 21, 2025

View reviewed changes

RedTachyon and others added 5 commits February 21, 2025 16:42

Update recipes/configs/dev/3B_full_grpo.yaml

3fc43d7

Co-authored-by: ebsmothers <[email protected]>

Update recipes/configs/dev/3B_sft_for_grpo.yaml

dde5fd8

Co-authored-by: ebsmothers <[email protected]>

Update recipes/configs/dev/3B_full_grpo.yaml

2fddf9c

Co-authored-by: ebsmothers <[email protected]>

Remove redundant async checkpointing code

aead54e

Remove some redundant clones

16ad525

RedTachyon added 2 commits February 21, 2025 17:38

Add a generation comment

70886b2

Pre-commit

2ba4a97

ebsmothers mentioned this pull request Feb 21, 2025

Any plans to add the unsloth GRPO implementation? #2417

Open

ebsmothers approved these changes Feb 21, 2025

View reviewed changes

ebsmothers merged commit cf0142b into pytorch:main Feb 21, 2025
17 checks passed

RedTachyon mentioned this pull request Feb 22, 2025

GRPO Improvement checklist #2421

Open

15 tasks

joecummings pushed a commit to joecummings/torchtune that referenced this pull request Feb 27, 2025

R1-Style distributed GRPO (pytorch#2326)

58e35e0

Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: ebsmothers <[email protected]> Co-authored-by: salman <[email protected]>

joecummings pushed a commit to joecummings/torchtune that referenced this pull request Feb 27, 2025

R1-Style distributed GRPO (pytorch#2326)

9e61aaf

Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: ebsmothers <[email protected]> Co-authored-by: salman <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R1-Style distributed GRPO #2326

R1-Style distributed GRPO #2326

RedTachyon commented Feb 1, 2025

pytorch-bot bot commented Feb 1, 2025 •

edited

Loading

musabgultekin commented Feb 3, 2025 •

edited

Loading

akashc1 commented Feb 4, 2025

RedTachyon commented Feb 6, 2025

RedTachyon commented Feb 17, 2025

ebsmothers commented Feb 18, 2025

RedTachyon commented Feb 19, 2025

ebsmothers left a comment

ebsmothers Feb 21, 2025

felipemello1 commented Feb 21, 2025

RedTachyon commented Feb 21, 2025 •

edited

Loading

ebsmothers left a comment

R1-Style distributed GRPO #2326

R1-Style distributed GRPO #2326

Conversation

RedTachyon commented Feb 1, 2025

Context

Changelog

Test plan

UX

pytorch-bot bot commented Feb 1, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2326

✅ No Failures

musabgultekin commented Feb 3, 2025 • edited Loading

akashc1 commented Feb 4, 2025

RedTachyon commented Feb 6, 2025

RedTachyon commented Feb 17, 2025

ebsmothers commented Feb 18, 2025

RedTachyon commented Feb 19, 2025

ebsmothers left a comment

Choose a reason for hiding this comment

ebsmothers Feb 21, 2025

Choose a reason for hiding this comment

felipemello1 commented Feb 21, 2025

RedTachyon commented Feb 21, 2025 • edited Loading

ebsmothers left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Feb 1, 2025 •

edited

Loading

musabgultekin commented Feb 3, 2025 •

edited

Loading

RedTachyon commented Feb 21, 2025 •

edited

Loading