Speech Processing

Proof pending

11papers

4.8viability

-100%30d

Proof pending

Proof pending. Core topic summary fields are still materializing.

State of the Field

Recent advancements in speech processing are increasingly focused on enhancing the effectiveness and efficiency of speech recognition and extraction systems, particularly in real-time applications. Researchers are developing frameworks that allow for robust target speaker extraction from overlapping speech, addressing the challenges posed by real-world audio environments. Techniques like the Chunk-wise Interleaved Splicing Paradigm and the two-stage Mask2Flow-TSE approach are demonstrating significant improvements in extraction fidelity and latency, making them suitable for consumer-level applications. Additionally, the emergence of multilingual benchmarks for phoneme discovery and unified speech encoders is fostering a deeper understanding of language-specific nuances, which could streamline the development of language-agnostic tools. The integration of zero-shot voice style conversion systems also highlights a growing interest in personalizing speech applications, potentially transforming user interactions across various platforms. Collectively, these efforts indicate a shift towards more adaptable, efficient, and user-friendly speech technologies that can meet diverse commercial needs.

Last updated May 22, 2026

Topic-linked question coverage is still building for this proof surface.

Papers

1-10 of 11

Research Paper·Mar 19, 2026

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of p...

7.0 viability

Research Paper·Mar 9, 2026

Learning Multiple Utterance-Level Attribute Representations with a Unified Speech Encoder

Speech foundation models trained with self-supervised learning produce generic speech representations that support a wide range of speech processing tasks. When further adapted with supervised learnin...

7.0 viability

Research Paper·Mar 16, 2026

VorTEX: Various overlap ratio for Target speech EXtraction

Target speech extraction (TSE) aims to recover a target speaker's voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limi...

7.0 viability

Research Paper·Apr 21, 2026

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to stre...

6.0 viability

Research Paper·Feb 23, 2026

StyleStream: Real-Time Zero-Shot Voice Style Conversion

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style...

6.0 viability

Research Paper·Mar 27, 2026

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across...

5.0 viability

Research Paper·Mar 13, 2026

Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

Target speaker extraction (TSE) extracts the target speaker's voice from overlapping speech mixtures given a reference utterance. Existing approaches typically fall into two categories: discriminative...

3.0 viability

Research Paper·Mar 18, 2026

Modeling Overlapped Speech with Shuffles

We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be us...

3.0 viability

Research Paper·Mar 12, 2026

TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are mu...

3.0 viability

Research Paper·Feb 19, 2026

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through mat...

3.0 viability

Page 1 of 2

Speech Processing

Proof pending

State of the Field

Papers

DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

Learning Multiple Utterance-Level Attribute Representations with a Unified Speech Encoder

VorTEX: Various overlap ratio for Target speech EXtraction

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

StyleStream: Real-Time Zero-Shot Voice Style Conversion

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

Modeling Overlapped Speech with Shuffles

TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

Filters

Topic proof surfaces

Speech Processing

Use this topic page as a durable research-area proof surface