RiVER: Reinforcement Learning for LLMs Without Ground-Truth Solutions

Why it matters

RiVER enables LLM training on a wider range of tasks where ground-truth answers are unavailable, such as complex coding problems. This expands the applicability of RL-based training for improving LLM coding abilities, potentially leading to more versatile and capable AI models for developers.

What changed Researchers have developed a new framework called RiVER (Ranking-induced Verifiable) designed to train Large Language Models (LLMs) using reinforcement learning (RL) without the need for ground-truth solutions. Traditional RL methods for LLMs often rely on verifiable rewards, which necessitate having correct answers to assign appropriate rewards. This limitation restricts their use in scenarios where such ground-truth data is unknown or difficult to obtain. RiVER overcomes this by training LLMs on score-based optimization tasks. It utilizes deterministic execution feedback, which provides continuous-valued supervision, as a substitute for explicit ground-truth answers. The framework specifically addresses two key challenges encountered when applying group-relative RL to these continuous rewards: scale dominance, where the magnitude of scores across different test instances can skew policy updates, and frequency dominance, where frequently sampled suboptimal solutions might overshadow rarer but superior candidates. RiVER implements calibrated reward shaping to mitigate these issues. This involves using instance-wise comparisons and prioritizing top-ranked solvers while still incorporating bounded feedback for other valid solutions.

The effectiveness of RiVER was demonstrated through training on 12 AtCoder Heuristic Contest tasks. The models were then evaluated on several benchmarks, including the Algorithm Engineering Benchmark (ALE-Bench), LiveCodeBench, and USACO. The results showed that RiVER significantly advanced the performance of Qwen3-8B and GLM-Z1-9B-0414 models, improving their ALE rating rank by 8.9% and 9.4%, respectively. Notably, even though RiVER was trained exclusively on score-based tasks without any ground-truth solutions, it also enhanced the performance of the underlying models on exact-solution benchmarks like LiveCodeBench and USACO. These improvements were an absolute average of 2.4% and 3.5%, respectively. In contrast, baseline models trained using raw execution scores showed improvements on ALE rating but did not transfer effectively to exact-solution benchmarks. This suggests that score-based optimization tasks, when paired with appropriate reward calibration techniques like those in RiVER, can serve as effective training environments for developing general coding abilities in LLMs, even in the absence of ground-truth solutions.

Why it matters for builders RiVER's ability to train LLMs without ground-truth solutions is a significant advancement for AI builders. It opens up new avenues for improving LLM capabilities in domains where obtaining perfect, verifiable answers is impractical or impossible. This includes many real-world coding challenges and complex problem-solving tasks. By leveraging score-based feedback, developers can more easily fine-tune models for specific applications, potentially leading to more robust and adaptable AI coding assistants and tools. The framework's success in improving performance on exact-solution benchmarks, despite being trained on score-based tasks, indicates a promising path towards developing more generalized coding intelligence in LLMs.

Practical impact For developers working with LLMs, RiVER offers a more flexible and accessible training paradigm. It reduces the dependency on curated datasets with ground-truth labels, which can be expensive and time-consuming to create. This makes it feasible to train or fine-tune models for niche programming languages, specialized algorithms, or proprietary codebases where ground-truth data is scarce. The framework's demonstrated improvements on established benchmarks like ALE-Bench, LiveCodeBench, and USACO suggest that RiVER-trained models could offer enhanced performance in competitive programming environments, automated code generation, and debugging tools. The ability to transfer learning from score-based tasks to exact-solution tasks implies that RiVER can contribute to building LLMs with a more comprehensive understanding of code quality and correctness, beyond simply matching a predefined answer.

Caveats and source limits The research presented in this paper is based on a single source, an arXiv preprint. While the findings are promising, they represent preliminary results and have not yet undergone formal peer review. The specific performance gains reported (e.g., 8.9% and 9.4% improvements) are tied to the specific models (Qwen3-8B and GLM-Z1-9B-0414) and benchmarks (ALE-Bench, LiveCodeBench, USACO) used in the study. Generalizability to other LLMs or different types of tasks would require further investigation. The paper focuses on coding tasks, and its applicability to other domains where ground-truth solutions are absent but score-based feedback might be available is not explicitly detailed. The exact implementation details of the calibrated reward shaping and its sensitivity to different scoring mechanisms are also not fully elaborated in the provided excerpt.

Article ID - cmquagcor0Featured on AI Radar: RiVER: Reinforcement Learning for LLMs Without Ground-Truth Solutions