ARXIV:2603.10422 · AGENTS · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

World2Act: Latent Action Post-Training via Skill-Compositional World Models

arXiv

World2Act enhances Vision-Language-Action policies by aligning actions with video-dynamics latents for improved robustness and generalization.

Blocked on Code›Score8.0Evidence unverified

Opportunity summary

Pain World2Act enhances Vision-Language-Action policies by aligning actions with video-dynamics latents for improved robustness and generalization.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

World2Act enhances Vision-Language-Action policies by aligning actions with video-dynamics latents for improved robustness and generalization. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect…

METHOD

Full abstract

World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes.

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 8.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainWorld2Act enhances Vision-Language-Action policies by aligning actions with video-dynamics latents for improved robustness and generalization.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

World2Act enhances Vision-Language-Action policies by aligning actions with video-dynamics latents for improved robustness and generalization.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

World2Act enhances Vision-Language-Action policies by aligning actions with video-dynamics latents for improved robustness and generalization.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "51c16cde-0ec0-4cdb-841d-438fb606e08b", "arxiv_id": "2603.10422", "canonical_route": "/paper/world2act-latent-action-post-training-via-skill-compositional-world-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "world2act-latent-action-post-training-via-skill-compositional-world-models", "endpoints": { "paper_pack": "/api/v1/paper/world2act-latent-action-post-training-via-skill-compositional-world-models/paper-pack", "build_passport": "/api/v1/paper/world2act-latent-action-post-training-via-skill-compositional-world-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "World2Act: Latent Action Post-Training via Skill-Compositional World Models", "normalized_query": "2603.10422", "route": "/paper/world2act-latent-action-post-training-via-skill-compositional-world-models", "paper_ref": "world2act-latent-action-post-training-via-skill-compositional-world-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/world2act-latent-action-post-training-via-skill-compositional-world-models#webpage", "url": "https://sciencetostartup.com/paper/world2act-latent-action-post-training-via-skill-compositional-world-models", "name": "World2Act: Latent Action Post-Training via Skill-Compositional World Models", "description": "World2Act enhances Vision-Language-Action policies by aligning actions with video-dynamics latents for improved robustness and generalization.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/world2act-latent-action-post-training-via-skill-compositional-world-models#scholarlyArticle", "headline": "World2Act: Latent Action Post-Training via Skill-Compositional World Models", "description": "World2Act enhances Vision-Language-Action policies by aligning actions with video-dynamics latents for improved robustness and generalization.", "url": "https://sciencetostartup.com/paper/world2act-latent-action-post-training-via-skill-compositional-world-models", "sameAs": "https://arxiv.org/abs/2603.10422", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.10422" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-11T05:11:44.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "World2Act: Latent Action Post-Training via Skill-Composition", "item": "https://sciencetostartup.com/paper/world2act-latent-action-post-training-via-skill-compositional-world-models" } ] } ] }

Competitive landscape

World2Act enhances Vision-Language-Action policies by aligning actions with video-dynamics latents for improved robustness and generalization.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

World2Act: Latent Action Post-Training via Skill-Compositional World Models

World2Act: Latent Action Post-Training via Skill-Compositional World Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline