Recent advancements in speech processing are focusing on enhancing the capabilities of models for tasks such as phoneme discovery, target speech extraction, and voice style conversion, with significant implications for commercial applications in multilingual communication and real-time audio processing. The introduction of benchmarks like DiscoPhon is facilitating unsupervised phoneme discovery across diverse languages, while frameworks like Mask2Flow-TSE and VorTEX are improving target speaker extraction from overlapping speech, crucial for applications in customer service and media transcription. Additionally, innovations in voice style conversion, exemplified by StyleStream, are enabling real-time transformations of speech attributes, which could revolutionize personalized voice assistants and entertainment. The shift toward unified models that can handle multiple tasks simultaneously, as seen in recent work, suggests a trend toward more efficient and versatile speech technologies, potentially reducing costs and improving user experiences in various commercial sectors.
We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of p...
Target speech extraction (TSE) aims to recover a target speaker's voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limi...
Speech foundation models trained with self-supervised learning produce generic speech representations that support a wide range of speech processing tasks. When further adapted with supervised learnin...
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style...
Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across...
We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be us...
Target speaker extraction (TSE) extracts the target speaker's voice from overlapping speech mixtures given a reference utterance. Existing approaches typically fall into two categories: discriminative...
Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are mu...
Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through mat...
In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generator...