Paper
Escaping the Verifier: Learning to Reason via Demonstrations
arXiv:2511.21667v4 Announce Type: replace Abstract: Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization), which learns strong reasoning capabilities from expert demonstrations alone via Inverse Reinforcement Learning. RARO sets up an adversarial game between a policy and a relativistic c…
Authors:
Topics
Relevant entities
People
Linked people will appear here.
Related coverage
Linked coverage will appear here.
Related events
Linked events will appear here.
Related discussions
Related discussion nodes will appear here.