Recent advancements in large language model (LLM) evaluation are focusing on enhancing the reliability and interpretability of automated assessments across diverse applications. A notable trend is the shift from isolated scoring to collaborative ranking, which improves the robustness of scholarly judgment by leveraging comparative evaluations rather than absolute scores. This approach addresses the variability in scoring systems across different contexts, making it particularly relevant for academic and professional settings. Additionally, frameworks like AutoChecklist and One-Eval are streamlining the evaluation process, allowing for customizable and traceable workflows that reduce manual effort and enhance reproducibility. Meanwhile, benchmarks such as GAIN and FaithSteer-BENCH are probing LLM decision-making under real-world constraints, revealing insights into how models navigate complex norm-goal conflicts. As these methodologies evolve, they promise to solve critical commercial challenges, including bias in automated grading and the need for consistent evaluation standards across languages and domains, ultimately paving the way for more equitable and effective LLM applications.
Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge. Beyond evaluation, these structured criteria can serve as signals for mod...
Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, re...
Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently. However, since score scales vary across conferences, time ...
As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loo...
Large language model (LLM)-based judges are widely adopted for automated evaluation and reward modeling, yet their judgments are often affected by judgment biases. Accurately evaluating these biases i...
Evaluating whether explanations faithfully reflect a model's reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distin...
Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it ...
We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather th...
Large language models (LLMs) are increasingly deployed as autonomous agents, yet evaluations focus primarily on task success rather than cultural appropriateness or evaluator reliability. We introduce...
What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency through...