ErrEval is a novel, error-aware evaluation framework specifically developed to enhance the assessment of Automatic Question Generation (QG) systems. It precisely defines evaluation as a two-stage process: first, an explicit error diagnosis, and second, an informed scoring phase. The core mechanism involves a lightweight, plug-and-play Error Identifier that systematically detects and categorizes common defects in generated questions, such as factual hallucinations, answer mismatches, and structural or linguistic errors. These diagnostic signals are then explicitly incorporated to guide large language model (LLM) evaluators, enabling them to provide more fine-grained and grounded judgments. ErrEval matters because it directly tackles the limitations of existing black-box, holistic evaluation methods that often overlook critical QG defects and consequently overestimate question quality. By providing explicit error diagnostics, it significantly improves the alignment of automated evaluations with human judgments. This framework is primarily used by researchers and ML engineers involved in natural language generation, particularly in the development and evaluation of QG systems, as well as in broader applications requiring robust and transparent LLM-based evaluation.
ErrEval is a new way to evaluate automatically generated questions, making sure they are high quality and don't contain errors like made-up facts. It works by first finding specific mistakes in the questions and then using that information to give a more accurate and detailed score, which matches human judgment better than older methods.
Error-aware Evaluation framework
Was this definition helpful?