GRPO Improvement checklist #2421

RedTachyon · 2025-02-22T09:44:25Z

It's alive! It's alive! (#2326)

The recipe definitely works (as in, I can run it and reach like a 60% success rate on GSM8k with a 3B model), but it's somewhat barebones and underoptimized. Here, I want to keep track of all the most important features and improvements that I think are missing. I'll probably go through this at some point, at some pace, but if anyone else wants to contribute - you can grab something from this list.

Improvements

Bugs

Because of course I found a bug right after everything was finalized. There will likely be more, so this subsection might or might not be useful.

Right now, generate_trajectory_batched can crash when the different generations are of different size. For example, one batch of completions generated the full 512 tokens, but another one got truncated at 300 because it hit a stop token everywhere. So you have tensors of shapes [16, 512] and [16, 300], and try to concatenate them across zero-th axis - which obviously doesn't work. The tensors need to be padded to consistent length.
Very, very rarely, it seems that an invalid token is sampled - for example token 128011, which is an undefined special token with the standard config. When we try to decode this for the reward computation, the entire program crashes because tiktoken can't handle the unknown token. This can probably be handled by replacing undefined generated tokens with pad_id or something. As to why these tokens are ever sampled - the model probably gives them a very low probability, say 1e-7, but if you sample a new token 1e7 times, chances are, it will happen at some point.

Note to maintainers - I took the liberty to create this centralized checklist since I still have all the necessary improvements in my context window. In principle, each bullet point could be a separate issue, but that would probably be a nightmare. We can coordinate the effort around this issue, and start adding the improvements, one PR at a time.

The text was updated successfully, but these errors were encountered:

ebsmothers · 2025-02-23T19:40:24Z

Thank you so much for creating this checklist @RedTachyon! It's great to have all these items in one place. Actually a couple of the improvements you listed are horizontal changes we've been wanting to enable across the repo anyways -- I'm thinking specifically of vLLM, eval datasets, and step-based checkpointing. Step-based checkpointing is already in progress and I think some basic eval shouldn't be too hard (I think #2238 was going in that direction and may just need some minor changes). Proper vLLM integration may be a bigger effort, but we are planning to get going on this asap. And thanks @krammnic for already working on a couple of these!

This was referenced Feb 22, 2025

GRPO datasets #2422

Open

[WIP] Padding bug in GRPO #2425

Open

ebsmothers added the community help wanted We would love the community's help completing this issue label Feb 23, 2025

imenelydiaker mentioned this issue Feb 25, 2025

Attempt to make the reward function customizable in GRPO #2433

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO Improvement checklist #2421

GRPO Improvement checklist #2421

RedTachyon commented Feb 22, 2025 •

edited

Loading

ebsmothers commented Feb 23, 2025

GRPO Improvement checklist #2421

GRPO Improvement checklist #2421

Comments

RedTachyon commented Feb 22, 2025 • edited Loading

Improvements

Bugs

ebsmothers commented Feb 23, 2025

RedTachyon commented Feb 22, 2025 •

edited

Loading