ARXIV:2603.07966 · MULTIMODAL UNDERSTANDING · SUBMITTED 19 MAR · 18:48 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

arXiv

EcoG-Bench is a new benchmark for evaluating multimodal models on their ability to ground speech with co-speech gestures in egocentric videos, revealing a significant performance gap compared to human accuracy…

Blocked on Code›Score7.0Evidence unverified

Opportunity summary

Pain EcoG-Bench is a new benchmark for evaluating multimodal models on their ability to ground speech with co-speech gestures in egocentric videos, revealing a significant performance gap compared to human accuracy and highlighting the importance of temporal alignment cues.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

METHOD

Full abstract

In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation} protocol. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting remains low (Gemini-3-Pro: \textbf{17.0\%}). Moreover, in a diagnostic ablation, replacing the native video--audio interface with timestamped frame samples and externally verified ASR (with word-level timing) substantially improves the same model (\textbf{17.0\%}$\to$\textbf{42.9\%}). Overall, EcoG-Bench provides a strict, executable testbed for event-level speech--gesture binding, and suggests that multimodal interfaces may bottleneck the observability of temporal alignment cues, independently of model reasoning.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Benchmarking state-of-the-art MLLMs reveals a severe executability gap: while human subjects achieve near-ceiling performance on EcoG-Bench (\textbf{96.9\%} strict Eco-Accuracy), the best native video-audio setting…

WHY NOW

Multimodal Understanding moved forward this cycle; last verified April 2026. Public score 7.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainEcoG-Bench is a new benchmark for evaluating multimodal models on their ability to ground speech with co-speech gestures in egocentric videos, revealing a significant performance gap compared to human accuracy and highlighting the importance of temporal alignment cues.

Evidence0 refs | 0 sources | 33% coverage

Blockermissing authors

Analysis summary

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

Segment

Multimodal Understanding

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "0c3a0d70-8885-4e3a-86fe-e038a5f75cbe", "arxiv_id": "2603.07966", "canonical_route": "/paper/listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time", "endpoints": { "paper_pack": "/api/v1/paper/listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time/paper-pack", "build_passport": "/api/v1/paper/listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time", "normalized_query": "2603.07966", "route": "/paper/listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time", "paper_ref": "listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time#webpage", "url": "https://sciencetostartup.com/paper/listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time", "name": "Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time", "description": "EcoG-Bench is a new benchmark for evaluating multimodal models on their ability to ground speech with co-speech gestures in egocentric videos, revealing a significant performance gap compared to human accuracy and highlighting the importance of temporal alignment cues.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time#scholarlyArticle", "headline": "Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time", "description": "EcoG-Bench is a new benchmark for evaluating multimodal models on their ability to ground speech with co-speech gestures in egocentric videos, revealing a significant performance gap compared to human accuracy and highlighting the importance of temporal alignment cues.", "url": "https://sciencetostartup.com/paper/listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time", "sameAs": "https://arxiv.org/abs/2603.07966", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.07966" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-09T05:08:02.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multimodal Understanding" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multimodal Understanding", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Listening with the Eyes: Benchmarking Egocentric Co-Speech G", "item": "https://sciencetostartup.com/paper/listening-with-the-eyes-benchmarking-egocentric-co-speech-grounding-across-space-and-time" } ] } ] }

Competitive landscape

Segment

Multimodal Understanding

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline