Vision-and-Language Navigation (VLN) field suffers from limited diversity intraining data, primarily constrained by artificially curated simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations.
Unlike existing VLN datasets that rely on manual curations and structured navigational cues, RoomTour3D leverages these videos to generate open-ended human walking trajectories and open-world navigable instructions. We developed an automatic pipeline that integrates reconstructed 3D scenes, depth, and open-vocabulary objects, enhancing geometry-aware and open-world capabilities.
The dataset includes ∼100K open-ended trajectories with ∼200K instructions, 2K geometric trajectories from 1847 room tour environments, and intermediate products like 3D scenes and object tags, all now available for release. Our experiments demonstrate RoomTour3D’s potential in training robust embodied navigation agents, enhancing performances across multiple VLN tasks like CVDN, SOON, R2R, and REVERIE, with improvements exceeding 6%, achieving an outstanding 9.8% boost on SOON and setting new state-of-the-art results. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agent, showcasing the potential and challenges of advancing towards open-world navigation.