Recent advancements in AI benchmarking are focusing on enhancing the evaluation of large language models (LLMs) across diverse applications, addressing critical gaps in performance assessment. New benchmarks like AgentDrive and DSAEval are being developed to provide structured, real-world scenarios for autonomous systems and data science agents, respectively, allowing for comprehensive evaluations of reasoning capabilities. These benchmarks highlight the importance of multimodal perception and iterative interactions, which are essential for tackling complex, unstructured problems. Additionally, frameworks such as Gaia2 and ConstraintBench are refining the assessment of LLMs in dynamic environments and constrained optimization tasks, revealing significant challenges in achieving optimal performance. The emergence of the ARC Prize 2025 emphasizes the growing interest in few-shot generalization and iterative refinement processes, indicating a shift towards more nuanced evaluation metrics. Collectively, these efforts aim to create robust standards that can guide the development of practical AI systems capable of operating effectively in real-world scenarios.
The rapid advancement of large language models (LLMs) has sparked growing interest in their integration into autonomous systems for reasoning-driven perception, planning, and decision-making. However,...
Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multip...
We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where env...
Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying p...
The ARC-AGI benchmark series serves as a critical measure of few-shot generalization on novel tasks, a core aspect of intelligence. The ARC Prize 2025 global competition targeted the newly released AR...
Large language models are increasingly applied to operational decision-making where the underlying structure is constrained optimization. Existing benchmarks evaluate whether LLMs can formulate optimi...
Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipeli...
True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by t...
AI algorithms for imperfect-information games are typically compared using performance metrics on individual games, making it difficult to assess robustness across game choices. Card games are a natur...
Large language models (LLMs) have become widely adopted as automated judges for evaluating AI-generated content. Despite their success, aligning LLM-based evaluations with human judgments remains chal...