Why it matters
The DistIL approach offers a more robust and efficient way to train reasoning models by utilizing diverse feedback types, such as execution traces and expert corrections, which are often available but underutilized in current RL methods. By ensuring monotonic policy improvement and optimizing for teacher-weighted likelihood of success, DistIL could lead to more reliable and performant AI systems in complex problem-solving domains.

Current reinforcement learning from verifiable rewards (RLVR) often relies on a narrow feedback mechanism: a single bit indicating the correctness of a final answer. However, many real-world scenarios provide richer feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. The paper "Reinforcement Learning from Rich Feedback with Distributional DAgger" introduces DistIL, an approach designed to effectively utilize this richer feedback.

DistIL employs a distributional variant of the classic DAgger imitation learning algorithm. It uses a forward cross-entropy objective, which allows the learner to access an expert distribution on states visited by the current policy. This objective facilitates sequence-level gradient propagation for rich credit assignment, tracing future expert-student disagreements back to earlier decisions.

Unlike prior RL with self-distillation objectives (e.g., based on reverse KL or Jensen-Shannon), which may not guarantee monotonic policy improvement, DistIL's forward cross-entropy objective ensures monotonic policy improvement and provides guarantees on regret. The approach also optimizes a lower bound on the teacher-weighted likelihood of success, contributing to improved Pass@N scores. Empirical evaluations demonstrate that DistIL outperforms RLVR and existing self-distillation baselines across various domains, including scientific reasoning, coding, and solving challenging mathematical problems.

Share:XHacker NewsLink
Article ID - cmpz15y670Featured on AI Radar: DistIL: Reinforcement Learning from Rich Feedback with Distributional DAgger