Reinforcement Learning with Verifiable Rewards: How “Wrong” Signals Can Still Improve AI Models
Research
Dr Nadine Kroher
Chief Scientific Officer
Reinforcement Learning with Verifiable Rewards (RLVR) is a common method to improve LLM performance by fine-tuning on prompts where the model’s answer can be easily verified as being correct or incorrect. For example, math problems can be used in an RL framework where the reward can be as simple as R=1 for a correct and R=0 for an incorrect answer.
The paper “Spurious Rewards: Rethinking Training Signals in RLVR” investigates how far the quality of the reward signal actually matters to the fine-tuning process.
Key Insights from Study
The astonishing finding here is that certain models can still improve through RLVR if the rewards are noisy or even simply wrong. In other words, an LLM can get better at math by fine-tuning on wrong answers!
This is a very important discovery that may change the way we understand the LLM training process.
Why This Method Stands Out
The authors compared performance after fine-tuning LLMs from the Qwen family on different reward variants using the same data.
The baseline was the straightforward case: The LLM received a reward of 1 if it gave the correct answer and 0 if it gave a wrong answer to the math problem. In this case, fine-tuning boosted the LLM’s performance by 29% on a set of difficult math questions.
They then tested a series of “spurious” reward frameworks which - astonishingly - also improved the LLM’s performance:
In the “majority reward” case, the reward was 1 if the main answer of the LLM was in line with the majority of alternative answers sampled from the same LLM. Note that this reward has absolutely no relation to the correctness of the answer. Yet, performance still improved by 27%!
For a random (yes, absolutely random!!!) reward, the performance gain was still 21.4%.
Maybe most shockingly, when training on a scheme that rewards common incorrect answers, accuracy can be improved by 24%.
Finally, a variant that only rewards a specific formatting of the answer (again no relation to correctness) yields a 13.8% improvement.
Now here is a catch: This extreme behaviour was only observed for some models of the Qwen family. For example, LLama showed no improvement when trained on these spurious rewards.
The current hypothesis is that this is related to a “code-style” reasoning the Qwen model often uses to solve complex math problems. It turns out that the fine-tuning, even with spurious rewards, implicitly drove the model to use this technique more frequently which did in turn improve its performance.
Why It Matters / Passion Labs’ Take
This paper showed us researchers and practitioners how little we actually know about, or can assume, about the inner mechanics of parameter optimization and model performance root causes. The authors of this paper may have found, by accident, a characteristic of the Qwen models that, when reinforced, boosts performance. There might be many more such performance gain opportunities that we haven’t come across yet. In addition, aspects like data quality that seems to be an obvious factor, turned out to be rather irrelevant in this specific scenario.
Reference
[1] Shao, Rulin, et al. "Spurious Rewards: Rethinking Training Signals in RLVR." arXiv preprint arXiv:2506.10947 (2025).