ErrEval

Gold definitionUpdated Apr 2, 2026

ErrEval is a novel, error-aware evaluation framework specifically developed to enhance the assessment of Automatic Question Generation (QG) systems. It precisely defines evaluation as a two-stage process: first, an explicit error diagnosis, and second, an informed scoring phase. The core mechanism involves a lightweight, plug-and-play Error Identifier that systematically detects and categorizes common defects in generated questions, such as factual hallucinations, answer mismatches, and structural or linguistic errors. These diagnostic signals are then explicitly incorporated to guide large language model (LLM) evaluators, enabling them to provide more fine-grained and grounded judgments. ErrEval matters because it directly tackles the limitations of existing black-box, holistic evaluation methods that often overlook critical QG defects and consequently overestimate question quality. By providing explicit error diagnostics, it significantly improves the alignment of automated evaluations with human judgments. This framework is primarily used by researchers and ML engineers involved in natural language generation, particularly in the development and evaluation of QG systems, as well as in broader applications requiring robust and transparent LLM-based evaluation.

The Challenge of QG Evaluation and ErrEval's Approach

Limitations of Existing QG Evaluation: Current Automatic Question Generation (QG) evaluation methods, including LLM-based evaluators, often adopt a black-box and holistic paradigm. This approach frequently leads to the neglect of critical defects like factual hallucinations and answer mismatches, resulting in an overestimation of question quality.
ErrEval's Core Innovation: ErrEval addresses these limitations by proposing a flexible, Error-aware Evaluation framework that enhances QG assessment through explicit error diagnostics. It shifts from holistic scoring to a more transparent, defect-focused approach, improving evaluation reliability.

The Two-Stage Process of ErrEval

Error Diagnosis with ErrEval's Identifier: The first stage of ErrEval involves a lightweight, plug-and-play Error Identifier. This component is designed to detect and categorize common errors in generated questions across structural, linguistic, and content-related aspects, providing specific diagnostic signals.

At a glance

Executive summary

ErrEval is a new way to evaluate automatically generated questions, making sure they are high quality and don't contain errors like made-up facts. It works by first finding specific mistakes in the questions and then using that information to give a more accurate and detailed score, which matches human judgment better than older methods.

TL;DR

ErrEval is a system that finds specific errors in automatically generated questions and uses those findings to give a much more accurate quality score.

Key points

A two-stage process involving explicit error diagnosis by a plug-and-play identifier, followed by informed scoring guided by these diagnostics.
Addresses the neglect of critical defects (e.g., factual hallucinations) and overestimation of quality in Automatic Question Generation (QG) by existing black-box evaluation methods.
Used by researchers and ML engineers working on Automatic Question Generation (QG), LLM evaluation, and natural language generation quality control.
Unlike black-box, holistic LLM-based evaluators, ErrEval provides explicit error modeling and diagnostic signals, leading to more fine-grained and grounded judgments.
Represents a research trend towards explainable and diagnostic evaluation for generative AI, moving beyond holistic scores to identify specific failure modes.

Use cases

Automated QA System Development: Evaluating the quality of questions generated for large-scale question-answering datasets to ensure accuracy and relevance.
Educational Content Creation: Assessing automatically generated quiz questions or study prompts to guarantee factual correctness and pedagogical soundness.
Chatbot and Dialogue System Testing: Validating the quality of questions generated by conversational AI agents to ensure they are coherent, relevant, and free from errors.
LLM Fine-tuning for QG: Providing targeted feedback during the fine-tuning of Large Language Models for question generation tasks, identifying specific error types to improve model performance.

Also known as

Error-aware Evaluation framework