Paper
Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents
arXiv:2606.05263v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing process rewards are mostly correlational: they reward retrieval-, reflection-, or verification-like steps without estimating whether the step contributes to final verified success under a specified intervention. We propose CVT-RL, a constrained policy-gradient algorithm with dense verifiable rewards, intervention-validity…
Authors:
Topics
Relevant entities
People
Linked people will appear here.
Related coverage
Linked coverage will appear here.
Related events
Linked events will appear here.
Related discussions
Related discussion nodes will appear here.