Recent advancements in benchmarking methodologies are reshaping how researchers evaluate large language models (LLMs) across diverse applications. New benchmarks like SommBench and TrustMH-Bench specifically assess multilingual sommelier expertise and the trustworthiness of LLMs in mental health contexts, respectively, highlighting the need for nuanced evaluation criteria that go beyond traditional metrics. Meanwhile, frameworks such as LakeMLB and OctoBench focus on machine learning performance in data lake environments and scaffold-aware coding, addressing real-world complexities that prior benchmarks overlooked. The emergence of critique-resilient benchmarking techniques reflects a growing recognition of the challenges posed by increasingly sophisticated models, suggesting a shift toward adversarial evaluation methods that prioritize robustness over mere correctness. Collectively, these developments signal a concerted effort to enhance the relevance and applicability of benchmarks, aiming to bridge the gap between theoretical performance and practical utility in commercial settings, from healthcare to software development.
With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmark...
Modern data lakes have emerged as foundational platforms for large-scale machine learning, enabling flexible storage of heterogeneous data and structured analytics through table-oriented abstractions....
Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and pe...
While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domai...
The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Variou...
Scientific reasoning relies not only on logical inference but also on activating prior knowledge and experiential structures. Memory can efficiently reuse knowledge and enhance reasoning consistency a...
While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs' performance in rea...
The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-so...
Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely foc...
The intersection of Mean Field Games (MFGs) and Reinforcement Learning (RL) has fostered a growing family of algorithms designed to solve large-scale multi-agent systems. However, the field currently ...