Speech AI

Proof pending

19papers

5.4viability

+50%30d

Proof pending

Proof pending. Core topic summary fields are still materializing.

State of the Field

Recent advancements in speech AI are focusing on enhancing multilingual capabilities, improving robustness against contextual biases, and refining speech synthesis quality. New benchmarks for languages like Korean and Arabic are being developed to evaluate speech language models more effectively. Techniques to mitigate hallucinations in speech models and frameworks for detecting speaker drift are also emerging. Additionally, unified models that integrate speech generation and understanding are being explored, alongside tools for identifying spurious correlations in speech datasets. These developments are crucial for builders aiming to create more reliable and versatile speech applications that can cater to diverse linguistic and contextual needs.

Last updated May 31, 2026

Topic-linked question coverage is still building for this proof surface.

Topic trend

Topic-specific paper and score movement from the daily diff ledger.

Papers

1-10 of 17

Research Paper·Jun 2, 2026

Efficient ASR Training with Conversations that Never Happened

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-...

8.0 viability

Research Paper·May 27, 2026

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English...

7.0 viability

Research Paper·Mar 25, 2026

From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Contextual automatic speech recognition (ASR) with Speech-LLMs is typically trained with oracle conversation history, but relies on error-prone history at inference, causing a train-test mismatch in t...

7.0 viabilityHas code

Research Paper·Mar 23, 2026

Ara-Best-RQ: Multi Dialectal Arabic SSL

We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech ...

7.0 viability

Research Paper·Apr 7, 2026

A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a sing...

7.0 viability

Research Paper·May 28, 2026·B2BMedia & Entertainment

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often f...

7.0 viabilityHas code

Research Paper·Apr 21, 2026

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

Hallucinations in Speech Large Language Models (SpeechLLMs) pose significant risks, yet existing detection methods typically rely on gold-standard outputs that are costly or impractical to obtain. Mor...

6.0 viability

Research Paper·Apr 1, 2026

Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling

Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities...

5.0 viability

Research Paper·Apr 29, 2026

A Toolkit for Detecting Spurious Correlations in Speech Datasets

We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording condit...

5.0 viability

Research Paper·Mar 26, 2026

Goodness-of-pronunciation without phoneme time alignment

In speech evaluation, an Automatic Speech Recognition (ASR) model often computes time boundaries and phoneme posteriors for input features. However, limited data for ASR training hinders expansion of ...

5.0 viability

Page 1 of 2

Speech AI

Proof pending

State of the Field

Topic trend

Papers

Efficient ASR Training with Conversations that Never Happened

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Ara-Best-RQ: Multi Dialectal Arabic SSL

A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling

A Toolkit for Detecting Spurious Correlations in Speech Datasets

Goodness-of-pronunciation without phoneme time alignment

Filters

Topic proof surfaces

Speech AI

Use this topic page as a durable research-area proof surface