Recent advancements in vision-language navigation are focusing on enhancing the robustness and efficiency of navigation agents in complex environments. Notably, the introduction of models like WalkGPT and SPAN-Nav emphasizes the integration of spatial awareness and depth reasoning, addressing the limitations of existing large vision-language models that often struggle with real-world navigation tasks. These models leverage pixel-grounded segmentation and occupancy predictions to provide more reliable navigation guidance, which is crucial for applications in urban settings and accessibility. Additionally, frameworks such as HiMemVLN and HaltNav are tackling the challenges of memory retention and local adaptability, ensuring that agents can respond dynamically to changing environments without exhaustive prompts. The shift towards utilizing structured spatial representations, like floor plans and topological maps, further enhances the agents' ability to navigate with minimal instructions. Collectively, these developments signal a move toward more intelligent, context-aware navigation systems that can operate effectively in real-world scenarios, potentially transforming applications in robotics, autonomous vehicles, and smart urban planning.
LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which...
Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to...
Recent embodied navigation approaches leveraging Vision-Language Models (VLMs) demonstrate strong generalization in versatile Vision-Language Navigation (VLN). However, reliable path planning in compl...
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have dri...
Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection a...
Existing Vision-Language Navigation (VLN) task requires agents to follow verbose instructions, ignoring some potentially useful global spatial priors, limiting their capability to reason about spatial...
Vision-and-Language Navigation (VLN) is shifting from rigid, step-by-step instruction following toward open-vocabulary, goal-oriented autonomy. Achieving this transition without exhaustive routing pro...
Existing aerial Vision-Language Navigation (VLN) methods predominantly adopt a detection-and-planning pipeline, which converts open-vocabulary detections into discrete textual scene graphs. These appr...
Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strat...
Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-...