MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Wei Yu^*, Runjia Qian^*, Yumeng Li^*, Liquan Wang, Songheng Yin, Sri Siddarth Chakaravarthy P, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg

March, 2026

Preprint

Abstract

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory is still a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle with depicting moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting model’s native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while letting the model inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence versus implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing and autoregressive rollout.

Type

Preprint

Publication

arXiv preprint arXiv:2603.17117

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Abstract

Yumeng Li

Applied Scientist