Project Page Anonymous Submission

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

WoVR framework overview

Reliability-aware optimization in imagination: a controllable simulator, hallucination-aware interaction, and policy-simulator co-evolution.

WoVR turns learned world models into reliable simulators for reinforcement learning by explicitly controlling hallucination at the simulator, interaction, and alignment levels.

Real-World Execution Video

OpenVLA-OFT + Franka

Task SFT Failure Mode 1 SFT Failure Mode 2 WoVR RL Success
Pick Banana
Pick Bread
Open Drawer

Visualization of real-world execution before and after WoVR policy optimization.

AgileX + OpenVLA-OFT

Task SFT Failure Mode 1 SFT Failure Mode 2 WoVR RL Success
Pick Cube
Pick Tomato
Fold Towel

AgileX + OpenPI-0.5

Task SFT Failure Mode 1 SFT Failure Mode 2 WoVR RL Success
Pick Cube
Pick Tomato
Fold Towel

Overview

Abstract

Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors not only degrade visual fidelity, but also mislead policy optimization by providing unreliable learning signals.

We propose WoVR, a reliable world-model-based RL framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts (KIR), and maintains policy-simulator alignment through World Model-Policy co-evolution.

Extensive experiments demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, achieving superior LIBERO performance and consistent real-world gains across multiple robotic platforms. These results show that world models can serve as practical simulators for RL when hallucination is explicitly controlled.

The central question

If world models inevitably hallucinate under closed-loop rollouts, how can reinforcement learning remain reliable instead of learning to exploit simulator errors?

Method

A Reliability-Driven World-Model RL Pipeline

WoVR regulates reliability at three interconnected levels: simulator-level control, interaction-level reshaping, and alignment-level co-evolution.

Overview of WoVR
Overall architecture. WoVR builds a reliability-driven reinforcement learning framework entirely around the learned world model.
01

Stabilized Action-Conditioned World Model

WoVR upgrades a video diffusion backbone into an action-controllable, rollout-stable simulator with dual-channel action injection and first-frame anchoring. This reduces long-horizon drift and keeps imagined trajectories responsive to policy actions.

  • Dual-path action conditioning for local modulation and global control.
  • First-frame anchoring to preserve scene structure across autoregressive chunks.
  • Noisy context training to narrow the train-inference gap.
World model architecture
02

Hallucination-Aware Interaction with KIR

Instead of always rolling out from the episode start, Keyframe-Initialized Rollouts (KIR) start part of the imagined trajectories near task-critical states. This shortens the effective prediction depth and keeps policy learning closer to physically meaningful transitions.

  • Initialize rollouts near critical contacts and failure states.
  • Shorten the prefix the world model must predict before decisive moments.
  • Reduce hallucination compounding and spurious long-horizon success.
Keyframe-Initialized Rollouts illustration
03

PACE: Policy-Simulator Co-Evolution

Policy optimization changes the action distribution, which can make a previously reliable world model drift out of regime. PACE periodically refreshes the simulator with new data gathered under the updated policy so the world model stays aligned with the behavior it is asked to simulate.

Mitigates distribution shift
Preserves simulator reliability
Supports scalable on-policy optimization in imagination

Results

The latest paper version expands evaluation to stronger initialization settings and multi-platform real-world transfer while keeping the same core conclusion: reliable imagination leads to better downstream RL.

World Model Quality

WoVR achieves stronger long-horizon video quality than prior world-model baselines while maintaining 23 FPS inference throughput.

Method Rollout FPS ↑ LPIPS ↓ FID ↓ FVD ↓ FloLPIPS ↓
EVAC5121.350.14646.528345.8180.205
Cosmos-Predict25123.500.315165.862275.7370.265
OpenSora5127.000.10538.47889.3910.156
WoVR (Ours)51223.00.09134.25268.0110.154

LIBERO Policy Performance

Under both weak and strong SFT initialization, WoVR achieves the best average success rate across LIBERO Spatial, Object, Goal, and Long suites.

Setting Method Spatial Object Goal Long Avg
One-Trajectory SFTOpenVLA-OFT63.636.448.213.840.5
w/ GRPO66.645.252.214.644.6
w/ WMPO67.865.456.613.850.9
w/ WoVR84.280.877.435.869.5
Full-Trajectory SFTOpenVLA-OFT93.683.090.085.688.1
w/ GRPO94.686.292.285.889.7
w/ WMPO95.094.892.887.092.4
w/ WoVR98.898.894.891.496.0

Real-World Transfer

WoVR improves real-world success rates on both Franka Emika Panda and AgileX Piper without additional online real-world reinforcement learning.

Real-world experiments on Franka and AgileX Piper
Top row: Franka Emika Panda tasks. Bottom row: AgileX Piper tasks.
Platform Method Task 1 Task 2 Task 3 Avg Delta
FrankaOpenVLA-OFT36.770.046.751.1-
Frankaw/ WoVR86.790.063.380.0+28.9
AgileX PiperOpenVLA-OFT10.023.313.315.5-
AgileX Piperw/ WoVR20.033.333.328.9+13.4
AgileX Piper (π0.5)Base20.060.023.334.4-
AgileX Piper (π0.5)w/ WoVR30.086.753.356.7+22.3
Simulator

WoVR is both faster and more stable than prior world-model baselines under long-horizon rollout.

Policy

Reliable imagination consistently outperforms limited-budget online RL and previous world-model RL.

Transfer

Performance gains carry over to different platforms and even to a different VLA backbone.

Media: Supplementary Videos

Project Presentation