DistIL: Reinforcement Learning from Rich Feedback with Distributional DAgger

Researchers have introduced DistIL, a new approach to reinforcement learning that leverages rich feedback beyond simple binary rewards. DistIL uses a distributional variant of the DAgger imitation learning algorithm with a forward cross-entropy objective, which allows for more effective credit assignment and guarantees monotonic policy improvement. This method has shown empirical improvements over traditional RL from verifiable rewards (RLVR) and self-distillation baselines in tasks like scientific reasoning, coding, and complex mathematical problem-solving.

RDR75Confidence 88%reinforcement learningimitation learningdaggerrich feedbackpolicy improvementscientific reasoningcodingmathematical problems

Why it matters

The DistIL approach offers a more robust and efficient way to train reasoning models by utilizing diverse feedback types, such as execution traces and expert corrections, which are often available but underutilized in current RL methods. By ensuring monotonic policy improvement and optimizing for teacher-weighted likelihood of success, DistIL could lead to more reliable and performant AI systems in complex problem-solving domains.

Current reinforcement learning from verifiable rewards (RLVR) often relies on a narrow feedback mechanism: a single bit indicating the correctness of a final answer. However, many real-world scenarios provide richer feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. The paper "Reinforcement Learning from Rich Feedback with Distributional DAgger" introduces DistIL, an approach designed to effectively utilize this richer feedback.

DistIL employs a distributional variant of the classic DAgger imitation learning algorithm. It uses a forward cross-entropy objective, which allows the learner to access an expert distribution on states visited by the current policy. This objective facilitates sequence-level gradient propagation for rich credit assignment, tracing future expert-student disagreements back to earlier decisions.

Unlike prior RL with self-distillation objectives (e.g., based on reverse KL or Jensen-Shannon), which may not guarantee monotonic policy improvement, DistIL's forward cross-entropy objective ensures monotonic policy improvement and provides guarantees on regret. The approach also optimizes a lower bound on the teacher-weighted likelihood of success, contributing to improved Pass@N scores. Empirical evaluations demonstrate that DistIL outperforms RLVR and existing self-distillation baselines across various domains, including scientific reasoning, coding, and solving challenging mathematical problems.

Article ID - cmpz15y670Featured on AI Radar: DistIL: Reinforcement Learning from Rich Feedback with Distributional DAgger