Project Page Anonymous Submission

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

Reliability-aware optimization in imagination: a controllable simulator, hallucination-aware interaction, and policy-simulator co-evolution.

WoVR turns learned world models into reliable simulators for reinforcement learning by explicitly controlling hallucination at the simulator, interaction, and alignment levels.

Real-World Execution Video

OpenVLA-OFT + Franka

Task	SFT Failure Mode 1	SFT Failure Mode 2	WoVR RL Success
Pick Banana
Pick Bread
Open Drawer

Visualization of real-world execution before and after WoVR policy optimization.

AgileX + OpenVLA-OFT

Task	SFT Failure Mode 1	SFT Failure Mode 2	WoVR RL Success
Pick Cube
Pick Tomato
Fold Towel

AgileX + OpenPI-0.5

Task	SFT Failure Mode 1	SFT Failure Mode 2	WoVR RL Success
Pick Cube
Pick Tomato
Fold Towel

Overview

Abstract

Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors not only degrade visual fidelity, but also mislead policy optimization by providing unreliable learning signals.

We propose WoVR, a reliable world-model-based RL framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts (KIR), and maintains policy-simulator alignment through World Model-Policy co-evolution.

Extensive experiments demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, achieving superior LIBERO performance and consistent real-world gains across multiple robotic platforms. These results show that world models can serve as practical simulators for RL when hallucination is explicitly controlled.

The central question

If world models inevitably hallucinate under closed-loop rollouts, how can reinforcement learning remain reliable instead of learning to exploit simulator errors?

Method

A Reliability-Driven World-Model RL Pipeline

WoVR regulates reliability at three interconnected levels: simulator-level control, interaction-level reshaping, and alignment-level co-evolution.

Overview of WoVR — **Overall architecture.** WoVR builds a reliability-driven reinforcement learning framework entirely around the learned world model.

Stabilized Action-Conditioned World Model

WoVR upgrades a video diffusion backbone into an action-controllable, rollout-stable simulator with dual-channel action injection and first-frame anchoring. This reduces long-horizon drift and keeps imagined trajectories responsive to policy actions.

Dual-path action conditioning for local modulation and global control.
First-frame anchoring to preserve scene structure across autoregressive chunks.
Noisy context training to narrow the train-inference gap.

Hallucination-Aware Interaction with KIR

Instead of always rolling out from the episode start, Keyframe-Initialized Rollouts (KIR) start part of the imagined trajectories near task-critical states. This shortens the effective prediction depth and keeps policy learning closer to physically meaningful transitions.

Initialize rollouts near critical contacts and failure states.
Shorten the prefix the world model must predict before decisive moments.
Reduce hallucination compounding and spurious long-horizon success.

Keyframe-Initialized Rollouts illustration

PACE: Policy-Simulator Co-Evolution

Policy optimization changes the action distribution, which can make a previously reliable world model drift out of regime. PACE periodically refreshes the simulator with new data gathered under the updated policy so the world model stays aligned with the behavior it is asked to simulate.

Mitigates distribution shift

Preserves simulator reliability

Supports scalable on-policy optimization in imagination

Results

The latest paper version expands evaluation to stronger initialization settings and multi-platform real-world transfer while keeping the same core conclusion: reliable imagination leads to better downstream RL.

World Model Quality

WoVR achieves stronger long-horizon video quality than prior world-model baselines while maintaining 23 FPS inference throughput.

Method	Rollout	FPS ↑	LPIPS ↓	FID ↓	FVD ↓	FloLPIPS ↓
EVAC	512	1.35	0.146	46.528	345.818	0.205
Cosmos-Predict2	512	3.50	0.315	165.862	275.737	0.265
OpenSora	512	7.00	0.105	38.478	89.391	0.156
WoVR (Ours)	512	23.0	0.091	34.252	68.011	0.154

LIBERO Policy Performance

Under both weak and strong SFT initialization, WoVR achieves the best average success rate across LIBERO Spatial, Object, Goal, and Long suites.

Setting	Method	Spatial	Object	Goal	Long	Avg
One-Trajectory SFT	OpenVLA-OFT	63.6	36.4	48.2	13.8	40.5
	w/ GRPO	66.6	45.2	52.2	14.6	44.6
	w/ WMPO	67.8	65.4	56.6	13.8	50.9
	w/ WoVR	84.2	80.8	77.4	35.8	69.5
Full-Trajectory SFT	OpenVLA-OFT	93.6	83.0	90.0	85.6	88.1
	w/ GRPO	94.6	86.2	92.2	85.8	89.7
	w/ WMPO	95.0	94.8	92.8	87.0	92.4
	w/ WoVR	98.8	98.8	94.8	91.4	96.0

Real-World Transfer

WoVR improves real-world success rates on both Franka Emika Panda and AgileX Piper without additional online real-world reinforcement learning.

Real-world experiments on Franka and AgileX Piper — Top row: Franka Emika Panda tasks. Bottom row: AgileX Piper tasks.

Platform	Method	Task 1	Task 2	Task 3	Avg	Delta
Franka	OpenVLA-OFT	36.7	70.0	46.7	51.1	-
Franka	w/ WoVR	86.7	90.0	63.3	80.0	+28.9
AgileX Piper	OpenVLA-OFT	10.0	23.3	13.3	15.5	-
AgileX Piper	w/ WoVR	20.0	33.3	33.3	28.9	+13.4
AgileX Piper (π_0.5)	Base	20.0	60.0	23.3	34.4	-
AgileX Piper (π_0.5)	w/ WoVR	30.0	86.7	53.3	56.7	+22.3

Simulator

WoVR is both faster and more stable than prior world-model baselines under long-horizon rollout.

Policy

Reliable imagination consistently outperforms limited-budget online RL and previous world-model RL.

Transfer

Performance gains carry over to different platforms and even to a different VLA backbone.