ARXIV:2604.02190 · AI FOR AUTONOMOUS VEHICLES · SUBMITTED 03 APR · 20:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

Yongkang Li · Lijun Zhou · Sixu Yan · Bencheng Liao · Tianyi Yan · Kaixin Xiong · +8 at arXiv

A unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning.

Evidence 0 refs | 0 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning.

METHOD

Full abstract

Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving…

WHY NOW

AI for Autonomous Vehicles moved forward this cycle; last verified April 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning.

Evidence0 refs | 0 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

A unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning.

Segment

AI for Autonomous Vehicles

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "f5d13c44-f4c8-49cb-acbf-fbf3cd095c2c", "arxiv_id": "2604.02190", "canonical_route": "/paper/unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving", "endpoints": { "paper_pack": "/api/v1/paper/unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving/paper-pack", "build_passport": "/api/v1/paper/unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving", "normalized_query": "2604.02190", "route": "/paper/unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving", "paper_ref": "unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving#webpage", "url": "https://sciencetostartup.com/paper/unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving", "name": "UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving", "description": "A unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving#scholarlyArticle", "headline": "UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving", "description": "A unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning.", "url": "https://sciencetostartup.com/paper/unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving", "sameAs": "https://arxiv.org/abs/2604.02190", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.02190" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T15:48:45.000Z", "author": [ { "@type": "Person", "name": "Yongkang Li", "affiliation": { "@type": "Organization", "name": "Huazhong University of Science and Technology" } }, { "@type": "Person", "name": "Haiyang Sun", "affiliation": { "@type": "Organization", "name": "Xiaomi EV" } }, { "@type": "Person", "name": "Xinggang Wang", "affiliation": { "@type": "Organization", "name": "Huazhong University of Science and Technology" } }, { "@type": "Person", "name": "Lijun Zhou", "affiliation": { "@type": "Organization", "name": "Xiaomi EV" } } ], "codeRepository": "https://github.com/xiaomi-research/unidrivevla", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "AI for Autonomous Vehicles" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving#software", "name": "UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving - Source Code", "description": "A unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning.", "codeRepository": "https://github.com/xiaomi-research/unidrivevla", "url": "https://github.com/xiaomi-research/unidrivevla" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "AI for Autonomous Vehicles", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "UniDriveVLA: Unifying Understanding, Perception, and Action ", "item": "https://sciencetostartup.com/paper/unidrivevla-unifying-understanding-perception-and-action-planning-for-autonomous-driving" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"UniDriveVLA: Unifying Understanding, Perception, and Action \"?", "acceptedAnswer": { "@type": "Answer", "text": "A unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "This technology could be integrated into self-driving car systems or offered as a middleware for autonomous vehicle manufacturers to enhance spatial awareness and decision-making capabilities." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "Develop a robust autonomous driving system that can operate efficiently in complex urban environments by leveraging the enhanced perception and reasoning capabilities of UniDriveVLA." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "This solution could replace existing autonomous systems that struggle with integrating 2D and 3D perception data, offering more reliable and intelligent driving decisions." } } ] } ] }

Competitive landscape

A unified vision-language-action system that enhances autonomous driving by decoupling spatial perception and semantic reasoning.

Segment

AI for Autonomous Vehicles

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline