SysNav: Multi-Level Systematic Cooperation Enables Real-World, Cross-Embodiment Object Navigation

1Carnegie Mellon University, 2New York University, 3Nanyang Technological University
示例图片

Contributions

  • We propose SysNav, a three-level object navigation system that decouples semantic reasoning, navigation planning, and motion control. This design enables cross-embodiment generalization across wheeled robots, quadrupeds, and humanoids, allowing each component to focus on its respective strengths.
  • We design a hierarchical navigation strategy that treats rooms as minimal decision-making units. The VLM performs high-level semantic reasoning over a structured scene representation for room-level decisions, while efficient classical exploration methods handle in-room navigation, leveraging both VLM's semantic strengths and the spatial structure of indoor environments.
  • Our system not only supports standard object navigation but also enables conditional object navigation—such as navigation conditioned on object attributes or spatial relations—through its structured scene representation.
  • We conduct extensive evaluations including 190 real-world experiments across three robot embodiments, achieving 4-5x improvement in navigation efficiency over existing baselines, and evaluate on four simulation benchmarks (HM3D-v1, HM3D-v2, MP3D, and HM3D-OVON), achieving state-of-the-art performance. To the best of our knowledge, this is the first system capable of reliably and efficiently completing object navigation at building-scale.

Wheeled Robot Wheeled Robot: Long-range Object Navigation

Wheeled Robot Wheeled Robot: Object Navigation

23 demos - demos - Click any video to view in detail

Wheeled Robot Wheeled Robot: Object Navigation with Self-attribute Condition

10 demos - Click any video to view in detail

Wheeled Robot Wheeled Robot: Object Navigation with Spatial Condition

6 demos - Click any video to view in detail

Quadruped Quadruped: Object Navigation

7 demos - Click any video to view in detail

Quadruped Quadruped: Object Navigation with Self-attribute Condition

6 demos - Click any video to view in detail

Quadruped Quadruped: Object Navigation with Spatial Condition

5 demos - Click any video to view in detail

Humanoid Humanoid: Object Navigation

7 demos - Click any video to view in detail

Humanoid Humanoid: Object Navigation with Self-attribute Condition

5 demos - Click any video to view in detail

Humanoid Humanoid: Object Navigation with Spatial Condition

2 demos - Click any video to view in detail

Wheeled Robot Wheeled Robot: Semi-known Environment Object Navigation

Key Moments - Click image to jump • Click 🔍 to zoom
Frame 1 0:10
Target object is not in the environment during the mapping run.
Frame 2 0:21
Robot completes the mapping run.
Frame 3 0:36
A laptop is put in the meeting room after the mapping run.
Frame 4 0:57
Robot finds a laptop near the desk in the meeting room, completes.

Quantitative Results in Real-world

Real-world Quantitative Results

Quantitative Results in Simulation Benchmarks

Simulation Quantitative Results

Supplementary Materials (RA-L Revision)

Additional figures and statistics supporting the revised manuscript.

15 Real-World Test Environments

Use the arrows to step through floor plans of all 15 evaluated scenes (4 floor-scale + 11 unit-test).

Floor plan
Loading...

Environment Scale & Path Length

Setting Scenes Episodes Area (m²) GT Path Length (m) Real Path Length (m)
Simulation Benchmarks 47 8,195 98.38 6.78 10.72
Real-world (all scenes) 15 78 432.17 16.24 48.24
Real-world (4 floor-scale) 4 4 1,007.21 84.43 280.22

For simulation, Real Path Length is averaged over 4,840 successful episodes only. For real-world experiments, path-length statistics cover the 78 trials with complete rosbag recordings; environment-scale statistics cover all 15 scenes.

Per-Setting Distribution

Area distribution
GT geodesic distance distribution
Traversed path length distribution

Mean (bar) and interquartile range (error bar, P25–P75) of scene area, GT geodesic distance, and traversed path length. Real-world environments are roughly 4× larger than simulation benchmarks on average; the 4 floor-scale scenes reach ~10×.

VLM Runtime Profile

VLM latency histogram

Across 1,000 VLM calls, mean latency is 1.15 s and 95% of queries return within 2.03 s. The aggregate query rate across the three query types (Next-room, Early-stop, Target-verification) averages 6.4 queries/min. To keep VLM inference off the critical path, SysNav prefetches the next-room query as soon as the robot anticipates finishing the current room, so most queries complete while the robot is still in motion and the system rarely pauses on VLM responses.

×
×