ARXIV:2604.04229 · AUDIO-VISUAL REPRESENTATION LEARNING · SUBMITTED 07 APR · 20:12 UTC · FRESHNESS UNKNOWN

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Donghuo Zeng · Hao Niu · Masato Taya · arXiv

HSC-MAE learns aligned audio-visual embeddings from label-free data by enforcing semantic consistency across global, local, and sample levels.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain HSC-MAE learns aligned audio-visual embeddings from label-free data by enforcing semantic consistency across global, local, and sample levels.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

HSC-MAE learns aligned audio-visual embeddings from label-free data by enforcing semantic consistency across global, local, and sample levels. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency…

METHOD

Full abstract

Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations. Code availability…

WHY NOW

Audio-Visual Representation Learning moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainHSC-MAE learns aligned audio-visual embeddings from label-free data by enforcing semantic consistency across global, local, and sample levels.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

HSC-MAE learns aligned audio-visual embeddings from label-free data by enforcing semantic consistency across global, local, and sample levels.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

HSC-MAE learns aligned audio-visual embeddings from label-free data by enforcing semantic consistency across global, local, and sample levels.

Segment

Audio-Visual Representation Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "c26a16a8-4a2c-4e9c-b5cf-7f817b0d1f15", "arxiv_id": "2604.04229", "canonical_route": "/paper/hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning", "endpoints": { "paper_pack": "/api/v1/paper/hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning/paper-pack", "build_passport": "/api/v1/paper/hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning", "normalized_query": "2604.04229", "route": "/paper/hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning", "paper_ref": "hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning#webpage", "url": "https://sciencetostartup.com/paper/hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning", "name": "Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning", "description": "HSC-MAE learns aligned audio-visual embeddings from label-free data by enforcing semantic consistency across global, local, and sample levels.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning#scholarlyArticle", "headline": "Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning", "description": "HSC-MAE learns aligned audio-visual embeddings from label-free data by enforcing semantic consistency across global, local, and sample levels.", "url": "https://sciencetostartup.com/paper/hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning", "sameAs": "https://arxiv.org/abs/2604.04229", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.04229" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-05T19:08:51.000Z", "author": [ { "@type": "Person", "name": "Donghuo Zeng" }, { "@type": "Person", "name": "Hao Niu" }, { "@type": "Person", "name": "Masato Taya" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Audio-Visual Representation Learning" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Audio-Visual Representation Learning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Hierarchical Semantic Correlation-Aware Masked Autoencoder f", "item": "https://sciencetostartup.com/paper/hierarchical-semantic-correlation-aware-masked-autoencoder-for-unsupervised-audio-visual-representation-learning" } ] } ] }

Competitive landscape

HSC-MAE learns aligned audio-visual embeddings from label-free data by enforcing semantic consistency across global, local, and sample levels.

Segment

Audio-Visual Representation Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline