Preprint · 2026

Coupled Local and Global World Models for Efficient First Order RL

Joseph Amigo*, Rooholla Khorrambakht*, Nicolas Mansard, Ludovic Righetti

* equal contribution

A decoupled first-order gradient (FoG) RL method that trains policies from scratch entirely inside a large-scale world model — also learned from scratch from real-world data — and deploys them zero-shot on the real robot.

Real-world robotic tasks solved zero-shot by policies trained entirely inside our learned world models

Real-world tasks solved zero-shot by policies trained entirely inside our learned world models: Push-T with a tabletop manipulator (left), Ego-Centric Grasp and Lift with a G1 humanoid (centre), and Ego-Centric Push Cube with a Go2 quadruped (right).

Abstract

World models offer a promising avenue for capturing complex environment dynamics where simulators face challenges. However, large-scale world models required for complex real-world settings are computationally expensive to adopt in popular RL approaches. We introduce a novel first-order RL method that enables policy training via a decoupled first-order gradient (FoG): a large-scale world model generates accurate forward trajectories while a lightweight latent-space surrogate approximates its local dynamics for efficient gradient computation. This coupled local-and-global formulation allows high-fidelity forward dynamics alongside the computationally efficient differentiation needed for model-based RL. Across a range of real-world robotic tasks we demonstrate tractable RL and zero-shot deployment, with significantly better sample efficiency than PPO on a canonical real-world Push-T benchmark and similar gains on more complex ego-centric manipulation and grasping.

Method

We extend Decoupled forward-backward Model-based policy Optimization (DMO) to the simulator-free setting in image space. A pretrained large-scale world model (DIAMOND for Push-T, DreamerV4 for ego-centric tasks) acts as the global simulator surrogate. Alongside it, a lightweight Recurrent State-Space Model (RSSM) acts as the local low-dimensional latent dynamics that supplies stable, low-variance gradients without backpropagating through the heavy global model. Forward accuracy and backward tractability are optimised independently, opening the door to practical, efficient FoG-MBRL with large-scale pixel-space world models.

Decoupled local-and-global world model architecture for first-order RL

Sample efficiency

On Push-T we match or beat PPO with an order-of-magnitude fewer environment samples, and we see the same pattern on the more complex G1 ego-centric grasp-and-lift and Go2 ego-centric push-cube tasks. All curves are over 4 seeds. Click any figure to enlarge.

Push-T sample efficiency
Push-T (tabletop)
G1 grab box sample efficiency
G1 ego-centric grab-and-lift
Go2 push cube sample efficiency
Go2 ego-centric push-cube
Aggregated sample efficiency across all tasks
Figure 5. Aggregated performance metrics across all evaluation tasks. Left: normalised reward versus normalised sample number. Right: normalised reward versus normalised wall-clock time. Each experiment was run with 4 seeds.

Real-robot rollouts

Policies trained entirely inside the learned world models, deployed zero-shot on hardware.

G1 humanoid · grasp & lift · ten successes in a row

Only the grasp-and-lift motion is policy-driven. Between trials a human teleoperator returns the box to the table, and the policy is restarted from scratch for the next attempt.

Push-T · tabletop manipulator

Go2 quadruped · ego-centric push-cube — DMO (ours) vs PPO vs BC (ACT)

Scene 1 — going straight to the goal

DMO (ours)
PPO
BC (ACT)

Scene 2 — discovery behaviour when the cube is initially out of view

DMO (ours)
PPO
BC (ACT)

Scene 3 — same task with a different low-level locomotion policy

DMO (ours)
PPO
BC (ACT)

World-model fidelity

A side-by-side rollout on the G1 humanoid grasp-and-lift task lets us inspect how the global diffusion world model and the lightweight local RSSM each track the real trajectory over a 5.5-second horizon. The global model preserves visual fidelity and contact dynamics long after the local model loses high-frequency detail — which is exactly why we keep the global model in the forward pass and reserve the local model for cheap backward gradients.

Real vs local vs global world-model rollouts on G1
Figure 4. Unrolling of real and model-predicted trajectories on the G1 humanoid manipulation task. We compare the real camera observation (top), the local DreamerV3 RSSM trajectory (middle), and the global DreamerV4 diffusion trajectory (bottom) at time steps 0, 5, 10, …, 55 (5.5 s at 10 Hz), with both models initialised from the first real frame. The global diffusion model maintains high visual fidelity and accurate contact dynamics over the long horizon.

Patching world-model exploits

When training a high-reward RL policy entirely inside a learned world model, the policy inevitably finds physical hallucinations — frames where the world model breaks its own physics to maximise reward. Our pipeline deploys those exploit policies on the real robot to harvest targeted patching data, then fine-tunes the world model on it. The figure below shows the same exploit policy rolled out on the original and the patched model — the hallucination disappears.

World model before vs after patching on G1 grab box
Figure 7. Visual comparison of the world model before and after patching on the Unitree G1 Grab Box task. Rows labelled original display the unpatched world model where the RL policy discovered physical exploits to artificially maximise reward. In the first rollout (top two rows), the policy exploits the model by moving the hand into a lifting pose, causing the box to spontaneously teleport into its grasp. In the second rollout (bottom two rows), the hand remains distant but pulls the box toward itself using an "invisible force". Rows labelled patched show the corrected world model rollouts after being fine-tuned with real-world data collected by deploying these exploiting policies, effectively eliminating the physical hallucinations.

BibTeX

@misc{amigo2026coupledlocalglobalworld,
      title={Coupled Local and Global World Models for Efficient First Order RL},
      author={Joseph Amigo and Rooholla Khorrambakht and Nicolas Mansard and Ludovic Righetti},
      year={2026},
      eprint={2602.06219},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.06219},
}