Current reinforcement learning from verifiable rewards (RLVR) often relies on a narrow feedback mechanism: a single bit indicating the correctness of a final answer. However, many real-world scenarios provide richer feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. The paper "Reinforcement Learning from Rich Feedback with Distributional DAgger" introduces DistIL, an approach designed to effectively utilize this richer feedback.
DistIL employs a distributional variant of the classic DAgger imitation learning algorithm. It uses a forward cross-entropy objective, which allows the learner to access an expert distribution on states visited by the current policy. This objective facilitates sequence-level gradient propagation for rich credit assignment, tracing future expert-student disagreements back to earlier decisions.
Unlike prior RL with self-distillation objectives (e.g., based on reverse KL or Jensen-Shannon), which may not guarantee monotonic policy improvement, DistIL's forward cross-entropy objective ensures monotonic policy improvement and provides guarantees on regret. The approach also optimizes a lower bound on the teacher-weighted likelihood of success, contributing to improved Pass@N scores. Empirical evaluations demonstrate that DistIL outperforms RLVR and existing self-distillation baselines across various domains, including scientific reasoning, coding, and solving challenging mathematical problems.
Featured on AI Radar: DistIL: Reinforcement Learning from Rich Feedback with Distributional DAgger