ARXIV:2603.23684 · MOTION-TEXT RETRIEVAL · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

Nikolai Warner · Cameron Ethan Taylor · Irfan Essa · Apaar Sadhwani · arXiv

MoCHA enhances motion-text retrieval by canonicalizing captions to their motion-recoverable content, significantly improving accuracy and cross-dataset transfer.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain MoCHA enhances motion-text retrieval by canonicalizing captions to their motion-recoverable content, significantly improving accuracy and cross-dataset transfer.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

MoCHA enhances motion-text retrieval by canonicalizing captions to their motion-recoverable content, significantly improving accuracy and cross-dataset transfer. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions:…

METHOD

Full abstract

Text-motion retrieval systems learn shared embedding spaces from motion-caption pairs via contrastive objectives. However, each caption is not a deterministic label but a sample from a distribution of valid descriptions: different annotators produce different text for the same motion, mixing motion-recoverable semantics (action type, body parts, directionality) with annotator-specific style and inferred context that cannot be determined from 3D joint coordinates alone. Standard contrastive training treats each caption as the single positive target, overlooking this distributional structure and inducing within-motion embedding variance that weakens alignment. We propose MoCHA, a text canonicalization framework that reduces this variance by projecting each caption onto its motion-recoverable content prior to encoding, producing tighter positive clusters and better-separated embeddings. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. We present two learned variants: an LLM-based approach (GPT-5.2) and a distilled FlanT5 model requiring no LLM at inference time. MoCHA operates as a preprocessing step compatible with any retrieval architecture. Applied to MoPa (MotionPatches), MoCHA sets a new state of the art on both HumanML3D (H) and KIT-ML (K): the LLM variant achieves 13.9% T2M R@1 on H (+3.1pp) and 24.3% on K (+10.3pp), while the LLM-free T5 variant achieves gains of +2.5pp and +8.1pp. Canonicalization reduces within-motion text-embedding variance by 11-19% and improves cross-dataset transfer substantially, with H to K improving by 94% and K to H by 52%, demonstrating that standardizing the language space yields more transferable motion-language representations.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Canonicalization is a general principle: even deterministic rule-based methods improve cross-dataset transfer, though learned canonicalizers provide substantially larger gains. Code availability is flagged in…

WHY NOW

Motion-Text Retrieval moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainMoCHA enhances motion-text retrieval by canonicalizing captions to their motion-recoverable content, significantly improving accuracy and cross-dataset transfer.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

MoCHA enhances motion-text retrieval by canonicalizing captions to their motion-recoverable content, significantly improving accuracy and cross-dataset transfer.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

MoCHA enhances motion-text retrieval by canonicalizing captions to their motion-recoverable content, significantly improving accuracy and cross-dataset transfer.

Segment

Motion-Text Retrieval

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "abd635e8-2263-4e9b-bf25-150d4668d70d", "arxiv_id": "2603.23684", "canonical_route": "/paper/mocha-denoising-caption-supervision-for-motion-text-retrieval", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "mocha-denoising-caption-supervision-for-motion-text-retrieval", "endpoints": { "paper_pack": "/api/v1/paper/mocha-denoising-caption-supervision-for-motion-text-retrieval/paper-pack", "build_passport": "/api/v1/paper/mocha-denoising-caption-supervision-for-motion-text-retrieval/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "MoCHA: Denoising Caption Supervision for Motion-Text Retrieval", "normalized_query": "2603.23684", "route": "/paper/mocha-denoising-caption-supervision-for-motion-text-retrieval", "paper_ref": "mocha-denoising-caption-supervision-for-motion-text-retrieval", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/mocha-denoising-caption-supervision-for-motion-text-retrieval#webpage", "url": "https://sciencetostartup.com/paper/mocha-denoising-caption-supervision-for-motion-text-retrieval", "name": "MoCHA: Denoising Caption Supervision for Motion-Text Retrieval", "description": "MoCHA enhances motion-text retrieval by canonicalizing captions to their motion-recoverable content, significantly improving accuracy and cross-dataset transfer.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/mocha-denoising-caption-supervision-for-motion-text-retrieval#scholarlyArticle", "headline": "MoCHA: Denoising Caption Supervision for Motion-Text Retrieval", "description": "MoCHA enhances motion-text retrieval by canonicalizing captions to their motion-recoverable content, significantly improving accuracy and cross-dataset transfer.", "url": "https://sciencetostartup.com/paper/mocha-denoising-caption-supervision-for-motion-text-retrieval", "sameAs": "https://arxiv.org/abs/2603.23684", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.23684" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-24T19:47:00.000Z", "author": [ { "@type": "Person", "name": "Nikolai Warner" }, { "@type": "Person", "name": "Cameron Ethan Taylor" }, { "@type": "Person", "name": "Irfan Essa" }, { "@type": "Person", "name": "Apaar Sadhwani" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Motion-Text Retrieval" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Motion-Text Retrieval", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "MoCHA: Denoising Caption Supervision for Motion-Text Retriev", "item": "https://sciencetostartup.com/paper/mocha-denoising-caption-supervision-for-motion-text-retrieval" } ] } ] }

Competitive landscape

MoCHA enhances motion-text retrieval by canonicalizing captions to their motion-recoverable content, significantly improving accuracy and cross-dataset transfer.

Segment

Motion-Text Retrieval

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline