ARXIV:2604.04348 · AUDIO GENERATION · SUBMITTED 07 APR · 20:15 UTC · FRESHNESS UNKNOWN

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Weiguo Pian · Saksham Singh Kushwaha · Zhimin Chen · Shijian Deng · Kai Wang · Yunhui Guo · +1 at arXiv

OmniSonic generates comprehensive audio scenes from video and text, capturing both on-screen and off-screen sounds.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain OmniSonic generates comprehensive audio scenes from video and text, capturing both on-screen and off-screen sounds.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

OmniSonic generates comprehensive audio scenes from video and text, capturing both on-screen and off-screen sounds. Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting…

METHOD

Full abstract

In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic…

WHY NOW

Audio Generation moved forward this cycle; last verified April 2026. Public score 8.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainOmniSonic generates comprehensive audio scenes from video and text, capturing both on-screen and off-screen sounds.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

OmniSonic generates comprehensive audio scenes from video and text, capturing both on-screen and off-screen sounds.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

OmniSonic generates comprehensive audio scenes from video and text, capturing both on-screen and off-screen sounds.

Segment

Audio Generation

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "a2d0f09c-59e0-42d0-887e-f26e6eb1ca94", "arxiv_id": "2604.04348", "canonical_route": "/paper/omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text", "endpoints": { "paper_pack": "/api/v1/paper/omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text/paper-pack", "build_passport": "/api/v1/paper/omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text", "normalized_query": "2604.04348", "route": "/paper/omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text", "paper_ref": "omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text#webpage", "url": "https://sciencetostartup.com/paper/omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text", "name": "OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text", "description": "OmniSonic generates comprehensive audio scenes from video and text, capturing both on-screen and off-screen sounds.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text#scholarlyArticle", "headline": "OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text", "description": "OmniSonic generates comprehensive audio scenes from video and text, capturing both on-screen and off-screen sounds.", "url": "https://sciencetostartup.com/paper/omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text", "sameAs": "https://arxiv.org/abs/2604.04348", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.04348" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-06T01:43:00.000Z", "author": [ { "@type": "Person", "name": "Weiguo Pian", "affiliation": { "@type": "Organization", "name": "The University of Texas at Dallas" } }, { "@type": "Person", "name": "Saksham Singh Kushwaha", "affiliation": { "@type": "Organization", "name": "The University of Texas at Dallas" } }, { "@type": "Person", "name": "Zhimin Chen", "affiliation": { "@type": "Organization", "name": "Clemson University" } }, { "@type": "Person", "name": "Shijian Deng", "affiliation": { "@type": "Organization", "name": "The University of Texas at Dallas" } }, { "@type": "Person", "name": "Kai Wang", "affiliation": { "@type": "Organization", "name": "University of Toronto" } }, { "@type": "Person", "name": "Yunhui Guo", "affiliation": { "@type": "Organization", "name": "The University of Texas at Dallas" } }, { "@type": "Person", "name": "Yapeng Tian", "affiliation": { "@type": "Organization", "name": "The University of Texas at Dallas" } } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Audio Generation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Audio Generation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "OmniSonic: Towards Universal and Holistic Audio Generation f", "item": "https://sciencetostartup.com/paper/omnisonic-towards-universal-and-holistic-audio-generation-from-video-and-text" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"OmniSonic: Towards Universal and Holistic Audio Generation f\"?", "acceptedAnswer": { "@type": "Answer", "text": "OmniSonic generates comprehensive audio scenes from video and text, capturing both on-screen and off-screen sounds." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "Transform the technology into a plugin for video editing software or streaming platforms, allowing creators to enhance their content with realistic audio automatically." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "Commercial application as a feature in video editing software to automatically add realistic and contextually correct audio to silent videos." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "This approach could potentially replace or complement existing Foley sound techniques traditionally used in film and video production for creating background and environmental audio." } } ] } ] }

Competitive landscape

OmniSonic generates comprehensive audio scenes from video and text, capturing both on-screen and off-screen sounds.

Segment

Audio Generation

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline