A Geometry-Aware Video-Instruction Dataset for Embodied Navigation

Mingfei Han (open to job)1,3, Liang Ma1, Kamila Zhumakhanova1, Ekaterina Radionova Jingyi Zhang2 Xiaojun Chang1,4 Xiaodan Liang1,2 Ivan Laptev1
*Equal contribution
1Department of Computer Vision, MBZUAI, 2Shenzhen Campus of Sun Yat-Sen University, 3ReLER Lab, AAII, UTS, 4University of Science and Technology of China

Overview of our RoomTour3D data generation. Spatial and geometry awareness, object variety and open vocabulary are enabled by multiple expert models. Depth estimation, camera poses and texts are integrated to advance open-world embodied navigation.


Vision-and-Language Navigation (VLN) field suffers from limited diversity intraining data, primarily constrained by artificially curated simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations.

Unlike existing VLN datasets that rely on manual curations and structured navigational cues, RoomTour3D leverages these videos to generate open-ended human walking trajectories and open-world navigable instructions. We developed an automatic pipeline that integrates reconstructed 3D scenes, depth, and open-vocabulary objects, enhancing geometry-aware and open-world capabilities.

The dataset includes ∼100K open-ended trajectories with ∼200K instructions, 2K geometric trajectories from 1847 room tour environments, and intermediate products like 3D scenes and object tags, all now available for release. Our experiments demonstrate RoomTour3D’s potential in training robust embodied navigation agents, enhancing performances across multiple VLN tasks like CVDN, SOON, R2R, and REVERIE, with improvements exceeding 6%, achieving an outstanding 9.8% boost on SOON and setting new state-of-the-art results. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agent, showcasing the potential and challenges of advancing towards open-world navigation.