Current research in AI evaluation is increasingly focused on enhancing the robustness and relevance of benchmarks across various domains. Recent work has introduced frameworks like SpatialBench-UC and TSAQA, which address specific challenges in evaluating text-to-image generation and time series analysis, respectively. These benchmarks emphasize the need for nuanced assessments that account for uncertainty and diverse task coverage. Additionally, frameworks such as Implicit Intelligence and RIFT highlight the limitations of existing models in understanding implicit instructions and maintaining task flow in complex workflows. The introduction of judge-aware models like BT-sigma reveals inconsistencies in LLM evaluations, suggesting that current systems may not be reliable for comparative assessments. Collectively, this body of work signals a shift towards more comprehensive evaluation methodologies that not only measure performance but also diagnose underlying cognitive limitations, paving the way for improvements in AI systems that can better align with human reasoning and contextual understanding.
Evaluating whether text-to-image models follow explicit spatial instructions is difficult to automate. Object detectors may miss targets or return multiple plausible detections, and simple geometric t...
The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with min...
LLM-as-judge systems promise scalable, consistent evaluation. We find the opposite: judges are consistent, but not with each other; they are consistent with themselves. Across 3,240 evaluations (9 jud...
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely...
Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agenti...
Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time ser...
Large Language Models (LLMs) are increasingly relied upon for complex workflows, yet their ability to maintain flow of instructions remains underexplored. Existing benchmarks conflate task complexity ...
Large language models (LLMs) have demonstrated strong performance on formal language tasks, yet whether this reflects genuine symbolic reasoning or pattern matching on familiar constructions remains u...
Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks...
Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accu...