Policy gradient RL algorithms like GRPO have been used to improve LLMs’ performance on verifiable tasks like math and coding problems.
However, existing implementations of GRPO simply sample a group of completions from the same prompt. If it is a difficult problem, it is possible all of completions will simply fail. If all completions in the group have a reward of 0, then all the advantages are 0 and the model will not learn.
Compute will be spent on generating new solutions in parallel instead of using the feedback from the old solutions to generate improved solutions.
We propose SRRL. Let be the original problem prompt. If we generate many completions, we can generate refined completions as follows:
Then instead of simply training on , we train on a group that comprises all the following pairs:
For some difficult problems where environment feedback can correct the LLM’s mistakes, this can help augment the reward signal.
Do note we train with a different prompt than was used to generate the completions, so importance-sampling must be used.
We recently implemented this in an experimental codebase that RLed a 3B model on 60 problems and obtained a success rate of 66.7% with SRRL vs. 62.7% with GRPO. However, please do note the p-value is 0.36.
Ibrahim Ahmed