ARXIV:2604.19728 · UNIFIED TRAINING FRAMEWORK · SUBMITTED 22 APR · 20:32 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields available

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Jean Mercat · Sedrick Keh · Kushal Arora · Isabella Huang · Paarth Shah · Haruki Nishimura · +2 at arXiv

Develop a unified framework for streamlined training of vision-language-action models in robotics.

Ship in 2-4 weeks›Score6.0Evidence verified

Opportunity summary

Pain Develop a unified framework for streamlined training of vision-language-action models in robotics.

Evidence 0 refs | 4 sources | 83% coverage

Blocker Evidence verified

Open Build Read PDF Signal Canvas Track

PROBLEM

Develop a unified framework for streamlined training of vision-language-action models in robotics. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines.

METHOD

Full abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.

RESULT

ScienceToStartup currently rates this 6.0/10 on the public viability pass. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. A public repository is linked, so build verification can inspect implementation evidence…

WHY NOW

Unified Training Framework moved forward this cycle; last verified April 2026. Public score 6.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score6.0

PainDevelop a unified framework for streamlined training of vision-language-action models in robotics.

Evidence0 refs | 4 sources | 83% coverage

Blockerno shell-level blocker reported

Analysis summary

Develop a unified framework for streamlined training of vision-language-action models in robotics.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields available

Competitive landscape

Develop a unified framework for streamlined training of vision-language-action models in robotics.

Segment

Unified Training Framework

Adoption evidence

Public code linked for build inspection

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "be3515dc-c4e7-404f-a9b5-fa10fdef262c", "arxiv_id": "2604.19728", "canonical_route": "/paper/vla-foundry-a-unified-framework-for-training-vision-language-action-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "vla-foundry-a-unified-framework-for-training-vision-language-action-models", "endpoints": { "paper_pack": "/api/v1/paper/vla-foundry-a-unified-framework-for-training-vision-language-action-models/paper-pack", "build_passport": "/api/v1/paper/vla-foundry-a-unified-framework-for-training-vision-language-action-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "VLA Foundry: A Unified Framework for Training Vision-Language-Action Models", "normalized_query": "2604.19728", "route": "/paper/vla-foundry-a-unified-framework-for-training-vision-language-action-models", "paper_ref": "vla-foundry-a-unified-framework-for-training-vision-language-action-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/vla-foundry-a-unified-framework-for-training-vision-language-action-models#webpage", "url": "https://sciencetostartup.com/paper/vla-foundry-a-unified-framework-for-training-vision-language-action-models", "name": "VLA Foundry: A Unified Framework for Training Vision-Language-Action Models", "description": "Develop a unified framework for streamlined training of vision-language-action models in robotics.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/vla-foundry-a-unified-framework-for-training-vision-language-action-models#scholarlyArticle", "headline": "VLA Foundry: A Unified Framework for Training Vision-Language-Action Models", "description": "Develop a unified framework for streamlined training of vision-language-action models in robotics.", "url": "https://sciencetostartup.com/paper/vla-foundry-a-unified-framework-for-training-vision-language-action-models", "sameAs": "https://arxiv.org/abs/2604.19728", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.19728" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-21T17:51:51.000Z", "author": [ { "@type": "Person", "name": "Jean Mercat" }, { "@type": "Person", "name": "Sedrick Keh" }, { "@type": "Person", "name": "Kushal Arora" }, { "@type": "Person", "name": "Isabella Huang" }, { "@type": "Person", "name": "Paarth Shah" }, { "@type": "Person", "name": "Haruki Nishimura" }, { "@type": "Person", "name": "Shun Iwase" }, { "@type": "Person", "name": "Katherine Liu" } ], "codeRepository": "https://github.com/TRI-ML/vla_foundry", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 6 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Unified Training Framework" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/vla-foundry-a-unified-framework-for-training-vision-language-action-models#software", "name": "VLA Foundry: A Unified Framework for Training Vision-Language-Action Models - Source Code", "description": "Develop a unified framework for streamlined training of vision-language-action models in robotics.", "codeRepository": "https://github.com/TRI-ML/vla_foundry", "url": "https://github.com/TRI-ML/vla_foundry" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Unified Training Framework", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "VLA Foundry: A Unified Framework for Training Vision-Languag", "item": "https://sciencetostartup.com/paper/vla-foundry-a-unified-framework-for-training-vision-language-action-models" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"VLA Foundry: A Unified Framework for Training Vision-Languag\"?", "acceptedAnswer": { "@type": "Answer", "text": "Develop a unified framework for streamlined training of vision-language-action models in robotics." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "VLA Foundry can be productized as a comprehensive framework for robotics companies to develop and optimize integrated VLA models easily without needing disparate systems." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "A robotics company could use the framework to rapidly prototype and improve integrated perception and action models for autonomous robots." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "Replaces existing fragmented training processes in robotics that rely on separate handling of vision, language, and action data, simplifying research and development pipelines." } } ] } ] }

Competitive landscape

Develop a unified framework for streamlined training of vision-language-action models in robotics.

Segment

Unified Training Framework

Adoption evidence

Public code linked for build inspection

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline