Grounding Generated Videos in Feasible Plans via World Models

1Imperial College London, 2UC Berkeley
GVP-WM teaser panel

GVP-WM grounds video-generated plans into feasible action sequences using a pre-trained action-conditioned world model via video-guided latent collocation.

Abstract

Large-scale video generative models have shown emerging capabilities as zero-shot visual planners, yet video-generated plans often violate temporal consistency and physical constraints, leading to failures when mapped to executable actions. To this end, we propose Grounding Video Plans with World Models (GVP-WM), a planning method that grounds video-generated plans into feasible action sequences using a pre-trained action-conditioned world model. At test-time, GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories via video-guided latent collocation. In particular, we formulate grounding as a goal-conditioned latent-space trajectory optimization problem that jointly optimizes latent states and actions under world-model dynamics, while preserving semantic alignment with the video-generated plan. Empirically, GVP-WM recovers feasible long-horizon plans from zero-shot image-to-video–generated and motion-blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.

I2V-Generated Video Plans Violate Physics

Spatial Bilocation

Object Teleportation

Rigid-Body Violation

Object Disintegration

Morphologic Drift

GVP-WM Grounds I2V-Generated Plans via World Models

Top: I2V-Generated Plan  —  Bottom: GVP-WM Grounding

Morphological Drift (Success)

(a) Morphological Drift (Success)

Morphological Drift (Failure)

(b) Morphological Drift (Failure)

Rigid-Object Physics Violation (Failure)

(c) Rigid-Object Physics Violation (Failure)

Domain-Adapted Video Guidance (Success)

(d) Domain-Adapted Video Guidance (Success)

Spatial Bilocation (Failure)

(e) Spatial Bilocation (Failure)

Spatial Bilocation (Success)

(f) Spatial Bilocation (Success)

GVP-WM Projects Video Plans onto the Manifold of Feasible Trajectories in World Models via Latent Collocation

Overview of GVP-WM

Overview of GVP-WM. A video plan, which may contain physically infeasible transitions (e.g., motion blur or object teleportation), is encoded into a sequence of latent states \(\{z^{\mathrm{vid}}_{t:T-1}\}\) using a pretrained visual encoder \(E_\phi\) of the world model. Video-guided latent collocation optimizes a latent trajectory \(\{z_{t+1:T}\}\) and corresponding actions \(\{a_{t:T-1}\}\) by minimizing an augmented Lagrangian objective, which balances video alignment (\(\mathcal{L}_{\mathrm{vid}}\)), terminal goal reaching (\(\mathcal{L}_{\mathrm{goal}}\)), and world-model dynamics (\(\mathcal{L}_{\mathrm{dyn}}\)). The actions from the optimal latent trajectory satisfying the world-model dynamics constraints are executed using model predictive control.


Video Plan Guidance

A video plan \(\tau_{\mathrm{vid}}\) is a sequence of visual observations generated by a conditional video generative model \(\mathcal{G}\) that provides temporally ordered visual foresight for completing a task. \(\mathcal{G}\) can be instantiated using an image-to-video (I2V) diffusion-based video model that produces temporally coherent visual transitions between the start and goal observations. We map the generated video plan into the latent state space of the action-conditioned world model using its underlying visual encoder, yielding a sequence of latent states \(z^{\mathrm{vid}}_{0:T}\).

In addition, we employ a semantic alignment loss between the optimized latent trajectory and the video plan. Specifically, the video alignment loss is defined as

\[ \mathcal{L}_{\mathrm{vid}}(z_t, z^{\mathrm{vid}}_t) = \left\| \phi(z_t) - \phi(z^{\mathrm{vid}}_t) \right\|^2, \]

where both optimized and video latent states are projected onto the unit \(\ell_2\) hypersphere by \(\phi(z) = z / \|z\|_2\). This loss penalizes angular deviation between latent embeddings while remaining invariant to their magnitude.

Video-Guided Latent Collocation

We formulate grounding video plans into feasible action sequences as a video-guided direct collocation problem. In direct collocation, both the latent states \(\mathcal{Z} = z_0, \ldots, z_T\) and the actions \(\mathcal{A} = a_0, \ldots, a_{T-1}\) are treated as decision variables. Our goal is to solve for a trajectory \((\mathcal{Z}^*, \mathcal{A}^*)\) that minimizes the divergence between the latent states of the optimized trajectory and the video plan \(z^{\mathrm{vid}}\), while satisfying the world model dynamics \(f_\psi\), leading to the following constrained optimization problem:

\[ \begin{aligned} \min_{\mathcal{Z}, \mathcal{A}} \quad & \lambda_{\text{v}} \sum_{t=1}^{T-1} \mathcal{L}_{\text{vid}}(z_t, z^{\text{vid}}_t) + \lambda_{\text{g}} \mathcal{L}_{\text{goal}}(z_T, z_g) + \lambda_{\text{r}} \sum_{t=0}^{T-1} \| a_t \|^2 \\ \text{s.t.} \quad & z_{t+1} = f_\psi(z_{t-H:t}, a_{t-H:t}), \quad \forall t \in \{0, \ldots, T-1\}, \\ & a_{\min} \preceq a_t \preceq a_{\max}, \quad \forall t \in \{0, \ldots, T-1\}. \end{aligned} \]

To efficiently solve the non-linear constrained optimization problem, we employ the Augmented Lagrangian Method (ALM). The loss is defined as:

\[ \mathcal{L}_\rho(\mathcal{Z}, \mathcal{A}, \Lambda) = \tilde{C}(\mathcal{Z}, \mathcal{A}) + \sum_{t=0}^{T-1} \left(\lambda_t^\top \mathcal{L}_{\mathrm{dyn}}^{t} + \frac{\rho}{2} \| \mathcal{L}_{\mathrm{dyn}}^{t} \|^2 \right), \]

where \(\tilde{C}\) denotes the latent cost objective. \(\Lambda = \{\lambda_0, \ldots, \lambda_{T-1}\}\) are the Lagrange multipliers, and \(\rho > 0\) is a scalar penalty parameter. The term \(\mathcal{L}_{\mathrm{dyn}}^{t}\) corresponds to the dynamics constraint violation at timestep \(t\), defined as:

\[ \mathcal{L}_{\mathrm{dyn}}^{t}(\mathcal{Z}, \mathcal{A}; f_\psi) = z_{t+1} - f_\psi(z_{t-H:t}, a_{t-H:t}). \]

Optimization proceeds using a standard primal–dual approach, alternating between gradient-based updates of the primal variables \((\mathcal{Z}, \mathcal{A})\) and the dual variables \(\Lambda\). In particular, we perform \(I_{\mathrm{ALM}}\) inner (primal) iterations per outer ALM step, updating the Lagrange multipliers and increasing the penalty parameter after each outer iteration.

GVP-WM jointly Optimizes Latent States and Actions under World-model Dynamics while Preserving Video Alignment


Require: initial state \(s_0 = (o_0, p_0)\); goal observation \(o_g\); video model \(\mathcal{G}\); world model \((E_\phi, f_\psi)\); MPC parameters \((T, K)\); ALM iteration and penalty parameters


1:Generate a video plan: \(\tau_{\mathrm{vid}} \sim \mathcal{G}(\cdot \mid o_0, o_g, c)\)
2:Encode video plan into latent space: \(z^{\mathrm{vid}}_{0:T} \leftarrow E_\phi(\tau_{\mathrm{vid}})\)
3:Initialize primal variables: \(\mathcal{Z} \leftarrow z^{\mathrm{vid}}_{0:T}\), \(\mathcal{A} \leftarrow \mathbf{0}\)
4:for \(t = 0\) to \(T-1\) step \(K\) do
5:Set initial latent state of trajectory: \(z_t \leftarrow E_\phi(s_t)\)
6:Initialize dual variables: \(\lambda \leftarrow \mathbf{0}\), \(\rho \leftarrow \rho_0\)
7:for \(k = 1\) to \(O_{\mathrm{ALM}}\) do
8:for \(i = 1\) to \(I_{\mathrm{ALM}}\) do
9:Video Guidance: \(\mathcal{L}_{\mathrm{vid}}(z_{t+1:T-1}, z^{\mathrm{vid}}_{t+1:T-1})\)
10:Terminal Latent Goal: \(\mathcal{L}_{\mathrm{goal}}(z_T, z_g)\)
11:Dynamic Constraints: \(\mathcal{L}_{\mathrm{dyn}}^{t}(z_{t-H:t+1}, a_{t-H:t}; f_\psi)\)
12:Compute Augmented Lagrangian \(\mathcal{L}_{\rho}\)
13:Primal update: \((\mathcal{Z}, \mathcal{A})\) via a gradient step on \(\mathcal{L}_{\rho}\)
14:end for
15:Dual update: \(\lambda \leftarrow \lambda + \rho \, \mathcal{L}_{\mathrm{dyn}}\)
16:Penalty update: \(\rho \leftarrow \min(\gamma \rho, \rho_{\max})\)
17:end for
18:Execute \(a_{t:t+K}\) from \(\mathcal{A}\) with sampling refinement
19:Update current state: \(s_{t+K} \leftarrow (o_{t+K}, p_{t+K})\)
20:end for

Grounding I2V-Generated Video Plans

Table 1. Success Rate comparison of methods on PushT and Wall for planning horizons T. WAN-0S: zero-shot; WAN-FT: domain-adapted; ORACLE: expert (upper bound).

Method PushT Wall
T=25 T=50 T=80 T=25 T=50
MPC-CEM0.740.280.060.920.74
MPC-GD0.320.040.000.040.04
UniPi (WAN-0S)0.000.000.000.300.14
UniPi (WAN-FT)0.100.000.000.400.18
UniPi (ORACLE)0.520.180.081.001.00
GVP-WM (WAN-0S)0.560.120.040.860.76
GVP-WM (WAN-FT)0.800.300.060.940.90
GVP-WM (ORACLE)0.980.720.361.001.00

Robustness to Motion Blur

Table 3. Success Rate under increasing levels of motion blur. MB-k denotes temporal averaging over k consecutive frames.

Source Method PushT Wall
T=25 T=50 T=80 T=25 T=50
MB-10UniPi0.030.000.000.000.02
GVP-WM0.820.460.080.941.00
MB-5UniPi0.040.000.060.400.52
GVP-WM0.940.540.161.001.00
MB-3UniPi0.160.000.000.520.82
GVP-WM0.960.700.341.001.00
ORACLEUniPi0.520.180.081.001.00
GVP-WM0.980.720.361.001.00

BibTeX

@article{ziakas2026grounding,
  title={Grounding Generated Videos in Feasible Plans via World Models},
  author={Ziakas, Christos and Bar, Amir and Russo, Alessandra},
  journal={arXiv},
  year={2026},
  eprint={2602.01960},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}