ARXIV:2604.02289 · 3D GENERATION · SUBMITTED 03 APR · 20:50 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Chongjie Ye · Cheng Cao · Chuanyu Pan · Yiming Hao · Yihao Zhi · Yuanming Hu · +1 at arXiv

Omni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation, leveraging 2D data as a prior to overcome 3D data scarcity.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain Omni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation, leveraging 2D data as a prior to overcome 3D data scarcity.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Omni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation, leveraging 2D data as a prior to overcome 3D data scarcity. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making…

METHOD

Full abstract

Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. Code availability is…

WHY NOW

3D Generation moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainOmni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation, leveraging 2D data as a prior to overcome 3D data scarcity.

Evidence0 refs | 0 sources | 33% coverage

Blockerno shell-level blocker reported

Analysis summary

Omni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation, leveraging 2D data as a prior to overcome 3D data scarcity.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Omni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation, leveraging 2D data as a prior to overcome 3D data scarcity.

Segment

3D Generation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "74e687ba-ed88-4c6c-a987-bfcfa9ad3e4e", "arxiv_id": "2604.02289", "canonical_route": "/paper/omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation", "endpoints": { "paper_pack": "/api/v1/paper/omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation/paper-pack", "build_passport": "/api/v1/paper/omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation", "normalized_query": "2604.02289", "route": "/paper/omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation", "paper_ref": "omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation#webpage", "url": "https://sciencetostartup.com/paper/omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation", "name": "Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation", "description": "Omni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation, leveraging 2D data as a prior to overcome 3D data scarcity.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation#scholarlyArticle", "headline": "Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation", "description": "Omni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation, leveraging 2D data as a prior to overcome 3D data scarcity.", "url": "https://sciencetostartup.com/paper/omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation", "sameAs": "https://arxiv.org/abs/2604.02289", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.02289" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T17:29:38.000Z", "author": [ { "@type": "Person", "name": "Chongjie Ye" }, { "@type": "Person", "name": "Cheng Cao" }, { "@type": "Person", "name": "Chuanyu Pan" }, { "@type": "Person", "name": "Yiming Hao" }, { "@type": "Person", "name": "Yihao Zhi" }, { "@type": "Person", "name": "Yuanming Hu" }, { "@type": "Person", "name": "Xiaoguang Han" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "3D Generation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "3D Generation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Omni123: Exploring 3D Native Foundation Models with Limited ", "item": "https://sciencetostartup.com/paper/omni123-exploring-3d-native-foundation-models-with-limited-3d-data-by-unifying-text-to-2d-and-3d-generation" } ] } ] }

Competitive landscape

Omni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation, leveraging 2D data as a prior to overcome 3D data scarcity.

Segment

3D Generation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline