🌍 LatentWorld: Grounded Latents for Entity-Centric 4D Scene Generation

Jinhyung Park¹ Navyata Sanghvi¹ Erica Weng¹ Shawn Hunt³ Shinya Tanaka³ Hironobu Fujiyoshi² Kris Kitani¹

¹Carnegie Mellon University ²DENSO Corporation ³DENSO International America, Inc.
CVPR 2026

🎬 Generated 4D Driving Scenes

LatentWorld generates temporally coherent 4D semantic-occupancy driving scenes. Because every foreground actor is a single grounded latent and ego motion is applied as an explicit rigid transform over all latents, actors keep their identity through time, reducing the flickering, merging, and splitting common to dense voxel generators.

🛠️ How LatentWorld Works

We represent a scene as a sparse set of grounded 3D latents, each a point with a position (x, y, z), a semantic class, a BEV yaw, and a feature vector. Exactly one latent is assigned to each foreground actor (vehicle, pedestrian, cyclist, and so on) to preserve identity and enable direct control, while background regions (road, buildings, vegetation) are covered by many latents for fine-grained structure. A VAE encodes semantic voxels into this latent set and decodes each latent into a small set of semantic 3D Gaussians that are splatted back to an occupancy grid.

Method overview. (Semantic Voxels to Grounded 3D Latents) a VAE encodes voxels into an editable latent point set and decodes to semantic Gaussians for voxel splatting. (Controllable 3D Scene Generation) a layout diffusion transformer G_L generates positions, classes, and orientations, then a feature diffusion transformer G_F predicts per-latent geometry. (4D Motion Generation) a motion diffusion transformer G_M produces future ego and actor trajectories; moving the latents and unrolling the decoder yields coherent 4D occupancy.

Generation is factorized into three diffusion stages over the latent set:

Layout diffusion (G_L). Generates the editable scaffold: latent positions, semantic classes (encoded as bits so discrete classes share one continuous diffusion schedule), and foreground orientations.
Feature diffusion (G_F). Conditioned on the layout, generates a per-latent feature capturing fine local geometry, so the same layout can yield diverse fine-grained geometry.
Motion diffusion (G_M). Generates future waypoints and headings for the ego vehicle and dynamic actors; ego motion is applied as a rigid transform to all latents, and unbounded rollouts are produced via an outpainting scheme.

🎛️ Interpretable, Entity-Level Control

Because the layout is an explicit, interpretable point set, a user can directly edit it (move, insert, remove, or rotate individual actors) before committing to fine geometry. Re-running feature generation then produces high-fidelity geometry that follows the edits, and the same layout can be decoded into diverse realizations.

Controllable generation. Left: a generated layout and its decoded semantic grid. Right: we manually arrange the layout to compose a complex traffic scene and sample two sets of latent features. The grounded latents give explicit control while feature generation captures high-fidelity geometric variation.

Heading control. Applying a foreground latent's yaw to its predicted Gaussians (offsets and orientations) makes the user-specified heading directly reflected in the generated geometry, enabling reliable orientation control for downstream motion.

🎞️ Factorized Generation: Layout → Geometry → Motion

Factorizing generation across the persistent latent set keeps coarse structure separate from fine geometry and from motion, so each stage stays interpretable and the final 4D scene remains consistent.

Row 1: generated latent layouts with generated actor waypoints, capturing coarse structure and multi-actor placement. Row 2: feature generation and decoding to semantic Gaussians (splatted to voxels) produce diverse, realistic 3D scenes faithful to the layout. Row 3: applying the generated motion to the same latents yields coherent 4D sequences with precise actor movement and stable background.

⚖️ Comparison with DynamicCity

Dense voxel generators encode ego motion only implicitly and bake actor motion into voxels, so they struggle to separate ego and actor motion during turns, causing flicker, merging and splitting, and unstable background. LatentWorld's grounded latents, with one latent per actor and an explicit ego transform, preserve identity and inter-actor separation.

DynamicCity

LatentWorld (Ours)

Waymo qualitative comparison. DynamicCity's generations exhibit foreground flicker and frequent vehicle splitting, and ego turns yield background inconsistencies across frames. LatentWorld preserves identity and inter-actor separation, giving stable trajectories and consistent background. Zoom in for details.

📊 Quantitative Results

We compare feature distributions of generated vs. real scenes using pretrained 3D autoencoders, reporting MMD (lower is better) under three views: geometry, semantics, and geometry+semantics. Values shown are the primary Avg metric.

3D Scene Generation on CarlaSC

Method	Geometry ↓	Semantics ↓	Geo+Sem ↓
SemCity	10.47	15.90	10.04
PDD	12.36	13.17	13.27
DynamicCity	20.45	12.74	9.98
LatentWorld (Ours)	6.44	11.30	6.69

Best overall across all three metrics, with the largest gains on foreground classes (e.g., Vehicle, Pedestrian). Full per-class breakdown is in the paper.

4D Scene Generation on Waymo

Method	Geometry ↓	Semantics ↓	Geo+Sem ↓
DynamicCity	3.93	3.32	2.61
LatentWorld (Ours)	1.62	1.50	0.96

Clear improvements over DynamicCity, especially on foreground categories that require precise localization of small or fast-moving actors.

4D Occupancy Forecasting on Waymo

Horizon	0 s	1 s	2 s
mIoU ↑	96.8	73.1	60.1

Given 1 s of history, even at a 2 s horizon LatentWorld retains 60.1 mIoU (Vehicle 60.7 / Pedestrian 50.6 IoU). Vehicle motion prediction: ADE 1.03 m, FDE 2.32 m, indicating accurate, temporally consistent rollouts rather than merely plausible generations.

🔬 Analysis & Ablations

Per-Actor Geometric Diversity

The per-latent features capture subtle but important local geometry. Generated vehicle dimensions closely match the ground-truth distribution across length, width, and height, spanning both small cars and large trucks, so the model does not collapse to an average vehicle template.

Vehicle sizes in Waymo: ground truth vs. our generations.

Pedestrians and Other Actors

LatentWorld generates and tracks all dynamic foreground classes: vehicles, pedestrians, and cyclists. Because each actor is assigned a single grounded latent, individual pedestrians stay temporally consistent even in crowded scenes, avoiding the merging and splitting of dense voxel methods.

A crowded street scene; individual pedestrians (blue) remain consistent across frames.

Unbounded Generation via Outpainting

As the ego vehicle advances, the scene must extend beyond the initial window. We freeze existing latents and denoise only new latents into the new forward region, clipping them to the forward half and adding a quadratic "push" (guided by a weight λ) so they fill the new area without disturbing already-generated content.

Push-weight ablation. From left: initial generation; forward shift (new region empty); naive outpainting (latents drift back, under-fill); clipping only; λ=2.0 (over-pushed, boundary gap); λ=1.0 (even coverage, stable prior content).

Push weight λ	Geo ↓	Sem ↓	Geo+Sem ↓
0.0	3.98	3.56	2.31
0.5	1.81	1.81	1.07
1.0	1.57	1.86	1.04
1.5	1.63	1.85	1.09

What the Motion Model Needs

Two signals are both necessary for coherent 4D motion: explicit foreground orientation (without it, vehicles drift or rotate with incorrect headings) and scene-wide context (conditioning the motion model only on foreground, without background structure, lets trajectories leave drivable surfaces).

Conditioning signals for motion (two rollouts per setting). Left: no explicit orientation gives incorrect headings. Right: conditioning only on foreground lets trajectories leave drivable regions.

How Many Latents?

More latents widen the bottleneck and improve reconstruction and geometric fidelity, but very high counts place multiple latents in small neighborhoods and destabilize semantic class assignment. We select 768 latents based on the holistic geometry+semantics metric.

# Latents	Recon mIoU ↑	Geo ↓	Sem ↓	Geo+Sem ↓
256	85.45	13.32	10.95	7.33
512	92.90	9.90	11.11	6.97
768	93.63	6.44	11.30	6.69
1024	94.71	6.38	12.36	7.30

CarlaSC. mIoU ↑ is reconstruction quality; MMD ↓ is generation quality.