Recent advancements in speech recognition are focusing on enhancing accuracy and accessibility across diverse languages and contexts. New methodologies, such as contrastive decoding frameworks, are addressing long-form transcription errors, significantly reducing word error rates while improving processing speed. Efforts to create high-quality, large-scale datasets for under-resourced languages, like the newly curated corpus for Nepal Bhasha and the extensive Portuguese podcast dataset, are also gaining momentum, enabling better model training and performance. Furthermore, innovative approaches to dialect-aware modeling are being developed to tackle the challenges posed by linguistic variations in low-resource languages like Taiwanese Hakka. The field is increasingly recognizing the importance of real-world applicability, as evidenced by research aimed at improving transcription accuracy for high-stakes scenarios, particularly for non-English speakers. These developments indicate a shift toward more inclusive and robust speech technologies that can effectively serve a broader range of users and applications.
Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplifi...
We present Ethio-ASR, a suite of multilingual CTC-based automatic speech recognition (ASR) models jointly trained on five Ethiopian languages: Amharic, Tigrinya, Oromo, Sidaama, and Wolaytta. These la...
Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises...
Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct w...
Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech found...
Nepal Bhasha (Newari), an endangered language of the Kathmandu Valley, remains digitally marginalized due to the severe scarcity of annotated speech resources. In this work, we introduce Nwāchā Munā, ...
Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new datas...
Automatic Speech Recognition (ASR) performance is heavily dependent on the availability of large-scale, high-quality datasets. For low-resource languages, existing open-source ASR datasets often suffe...
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode i...
Automatic Speech Recognition (ASR) in dialect-heavy settings remains challenging due to strong regional variation and limited labeled data. We propose GLoRIA, a parameter-efficient adaptation framewor...