ARXIV:2603.21986 · AUDIO-VIDEO AI · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

SII-GAIR · Sand. ai · : · Ethan Chern · Hansi Teng · Hanwen Sun · +40 at arXiv

A streamlined architecture that speeds up audio-video generative models with state-of-the-art performance.

Blocked on Code›Score6.0Evidence unverified

Opportunity summary

Pain A streamlined architecture that speeds up audio-video generative models with state-of-the-art performance.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A streamlined architecture that speeds up audio-video generative models with state-of-the-art performance. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence…

METHOD

Full abstract

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.

RESULT

ScienceToStartup currently rates this 6.0/10 on the public viability pass. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French.

WHY NOW

Audio-Video AI moved forward this cycle; last verified April 2026. Public score 6.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score6.0

PainA streamlined architecture that speeds up audio-video generative models with state-of-the-art performance.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

A streamlined architecture that speeds up audio-video generative models with state-of-the-art performance.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A streamlined architecture that speeds up audio-video generative models with state-of-the-art performance.

Segment

Audio-Video AI

Adoption evidence

No public code link in the paper record yet

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "b5a009c4-af28-438b-8c24-c2ae754da23a", "arxiv_id": "2603.21986", "canonical_route": "/paper/speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model", "endpoints": { "paper_pack": "/api/v1/paper/speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model/paper-pack", "build_passport": "/api/v1/paper/speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model", "normalized_query": "2603.21986", "route": "/paper/speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model", "paper_ref": "speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model#webpage", "url": "https://sciencetostartup.com/paper/speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model", "name": "Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model", "description": "A streamlined architecture that speeds up audio-video generative models with state-of-the-art performance.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model#scholarlyArticle", "headline": "Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model", "description": "A streamlined architecture that speeds up audio-video generative models with state-of-the-art performance.", "url": "https://sciencetostartup.com/paper/speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model", "sameAs": "https://arxiv.org/abs/2603.21986", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.21986" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-23T13:49:06.000Z", "author": [ { "@type": "Person", "name": "Ethan Chern", "affiliation": { "@type": "Organization", "name": "SII-GAIR" } }, { "@type": "Person", "name": "Hansi Teng", "affiliation": { "@type": "Organization", "name": "Sand.ai" } }, { "@type": "Person", "name": "Hao Wang", "affiliation": { "@type": "Organization", "name": "SII-GAIR" } } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 6 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Audio-Video AI" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Audio-Video AI", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Speed by Simplicity: A Single-Stream Architecture for Fast A", "item": "https://sciencetostartup.com/paper/speed-by-simplicity-a-single-stream-architecture-for-fast-audio-video-generative-foundation-model" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"Speed by Simplicity: A Single-Stream Architecture for Fast A\"?", "acceptedAnswer": { "@type": "Answer", "text": "A streamlined architecture that speeds up audio-video generative models with state-of-the-art performance." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "Develop an API or SaaS platform for media companies to use in their audio-video editing tools, allowing for faster and more efficient generative effects application." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "A faster video editing tool that applies high-quality audio effects and video transitions in real-time." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "This could replace current high-latency generative models used in professional video editing software." } } ] } ] }

Competitive landscape

A streamlined architecture that speeds up audio-video generative models with state-of-the-art performance.

Segment

Audio-Video AI

Adoption evidence

No public code link in the paper record yet

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline