WESR-Bench

Gold definitionUpdated Apr 2, 2026

Definition

WESR-Bench is an expert-annotated evaluation set (900+ utterances) designed for precise localization of 21 non-verbal vocal events. It features a novel position-aware protocol that separates ASR errors from event detection, enabling accurate measurement for both discrete and continuous events.

At a glance

Executive summary

WESR-Bench is a new, expertly labeled dataset and evaluation method for precisely finding non-verbal sounds like laughter or crying within speech. It helps researchers accurately measure how well AI systems can detect these sounds, even when mixed with spoken words, by separating speech recognition errors from sound detection errors.

TL;DR

WESR-Bench is a new dataset and evaluation tool that helps AI accurately find and pinpoint non-verbal sounds in speech, like laughs or cries, by providing clear definitions and a smart way to measure performance.

Key points

An expert-annotated evaluation set with a novel position-aware protocol for disentangling ASR errors from non-verbal event detection.
Solves the problem of insufficient task definitions, limited category coverage, and lack of standardized evaluation frameworks for non-verbal vocal event localization.
Used by researchers and ML engineers in speech processing, audio event detection, and multimodal AI.
Unlike previous methods with ambiguous granularity and limited categories, WESR-Bench offers a refined taxonomy and precise, ASR-error-disentangled localization.
Represents a research trend towards a more comprehensive understanding of vocal communication beyond linguistic content, and robust evaluation for complex audio tasks.

Use cases

Developing AI assistants that understand emotional cues (e.g., detecting frustration or amusement) from non-verbal vocalizations.
Improving content moderation systems to identify harmful non-verbal sounds or vocal events in user-generated content.
Enhancing accessibility tools for individuals with hearing impairments by transcribing or signaling non-verbal events.
Analyzing customer service calls for sentiment and engagement beyond spoken words, using detected non-verbal cues.
Creating more naturalistic virtual characters or avatars that react appropriately to non-verbal vocalizations in real-time.

Also known as

WESR