ARXIV:2603.26515 · VOICE AI AGENTS · SUBMITTED 30 MAR · 22:20 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems

Guangzhao Yang · Yu Pan · Shi Qiu · Ningjie Bai · arXiv

A lightweight, speech-only framework for real-time, robust turn-taking detection in voice AI agents, integrating acoustic and linguistic cues without adding latency.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A lightweight, speech-only framework for real-time, robust turn-taking detection in voice AI agents, integrating acoustic and linguistic cues without adding latency.

Evidence 0 refs | 6 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A lightweight, speech-only framework for real-time, robust turn-taking detection in voice AI agents, integrating acoustic and linguistic cues without adding latency. Many existing systems rely solely on acoustic or semantic cues, leading to suboptimal…

METHOD

Full abstract

Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments. Many existing systems rely solely on acoustic or semantic cues, leading to suboptimal accuracy and stability, while recent attempts to endow large language models with full-duplex capabilities require costly full-duplex data and incur substantial training and deployment overheads, limiting real-time performance. In this paper, we propose JAL-Turn, a lightweight and efficient speech-only turn-taking framework that adopts a joint acoustic-linguistic modeling paradigm, in which a cross-attention module adaptively integrates pre-trained acoustic representations with linguistic features to support low-latency prediction of hold vs shift states. By sharing a frozen ASR encoder, JAL-Turn enables turn-taking prediction to run fully in parallel with speech recognition, introducing no additional end-to-end latency or computational overhead. In addition, we introduce a scalable data construction pipeline that automatically derives reliable turn-taking labels from large-scale real-world dialogue corpora. Extensive experiments on public multilingual benchmarks and an in-house Japanese customer-service dataset show that JAL-Turn consistently outperforms strong state-of-the-art baselines in detection accuracy while maintaining superior real-time performance.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. In this paper, we propose JAL-Turn, a lightweight and efficient speech-only turn-taking framework that adopts a joint acoustic-linguistic modeling paradigm, in which a cross-attention…

WHY NOW

Voice AI Agents moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA lightweight, speech-only framework for real-time, robust turn-taking detection in voice AI agents, integrating acoustic and linguistic cues without adding latency.

Evidence0 refs | 6 sources | 33% coverage

Blockerno shell-level blocker reported

Analysis summary

A lightweight, speech-only framework for real-time, robust turn-taking detection in voice AI agents, integrating acoustic and linguistic cues without adding latency.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A lightweight, speech-only framework for real-time, robust turn-taking detection in voice AI agents, integrating acoustic and linguistic cues without adding latency.

Segment

Voice AI Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "3318b104-6422-4996-905c-5504f15a2a6e", "arxiv_id": "2603.26515", "canonical_route": "/paper/jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu", "endpoints": { "paper_pack": "/api/v1/paper/jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu/paper-pack", "build_passport": "/api/v1/paper/jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems", "normalized_query": "2603.26515", "route": "/paper/jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu", "paper_ref": "jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu#webpage", "url": "https://sciencetostartup.com/paper/jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu", "name": "JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems", "description": "A lightweight, speech-only framework for real-time, robust turn-taking detection in voice AI agents, integrating acoustic and linguistic cues without adding latency.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu#scholarlyArticle", "headline": "JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems", "description": "A lightweight, speech-only framework for real-time, robust turn-taking detection in voice AI agents, integrating acoustic and linguistic cues without adding latency.", "url": "https://sciencetostartup.com/paper/jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu", "sameAs": "https://arxiv.org/abs/2603.26515", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.26515" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-27T15:25:38.000Z", "author": [ { "@type": "Person", "name": "Guangzhao Yang" }, { "@type": "Person", "name": "Yu Pan" }, { "@type": "Person", "name": "Shi Qiu" }, { "@type": "Person", "name": "Ningjie Bai" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Voice AI Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Voice AI Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time a", "item": "https://sciencetostartup.com/paper/jal-turn-joint-acoustic-linguistic-modeling-for-real-time-and-robust-turn-taking-detection-in-full-duplex-spoken-dialogu" } ] } ] }

Competitive landscape

A lightweight, speech-only framework for real-time, robust turn-taking detection in voice AI agents, integrating acoustic and linguistic cues without adding latency.

Segment

Voice AI Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems

JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline