Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for
Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction
prevents direct deployment on physical robots. Recent work attempts to use learned world
models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably
suffer from hallucination and long-horizon error accumulation. Such errors not only degrade
visual fidelity, but also mislead policy optimization by providing unreliable learning
signals.
We propose WoVR, a reliable world-model-based RL framework for post-training
VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL
interacts with imperfect imagined dynamics. It improves rollout stability through a
controllable action-conditioned video world model, reshapes imagined interaction to reduce
effective error depth via Keyframe-Initialized Rollouts (KIR), and maintains policy-simulator
alignment through World Model-Policy co-evolution.
Extensive experiments demonstrate that WoVR enables stable long-horizon imagined rollouts and
effective policy optimization, achieving superior LIBERO performance and consistent real-world
gains across multiple robotic platforms. These results show that world models can serve as
practical simulators for RL when hallucination is explicitly controlled.