Recent developments in benchmarking tools for artificial intelligence are addressing critical gaps in evaluating model performance across diverse applications. The introduction of benchmarks like TML-Bench and BIRD-Python highlights the need for reliable assessments in data science and programming tasks, emphasizing the importance of end-to-end correctness and contextual understanding. Meanwhile, BEHELM aims to unify evaluation metrics for large language models in software engineering, tackling issues of robustness and interpretability. TSRBench expands the scope to time series reasoning, revealing the limitations of current models in integrating multimodal data. Additionally, DRACO and DSH-Bench focus on complex research tasks and subject-driven image generation, respectively, offering structured frameworks for assessing accuracy and model capabilities. AdaptEval specifically targets code snippet adaptation, providing insights into the practical utility of language models in real-world coding scenarios. Together, these initiatives are refining the landscape of AI evaluation, paving the way for more effective and contextually aware applications in commercial settings.
While Text-to-SQL remains the dominant approach for database interaction, real-world analytics increasingly require the flexibility of general-purpose programming languages such as Python or Pandas to...
Large language models for code are advancing fast, yet our ability to evaluate them lags behind. Current benchmarks focus on narrow tasks and single metrics, which hide critical gaps in robustness, in...
Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is ...
We present DRACO (Deep Research Accuracy, Completeness, and Objectivity), a benchmark of complex deep research tasks. These tasks, which span 10 domains and draw on information sources from 40 countri...
Recent advancements in large language models (LLMs) have automated various software engineering tasks, with benchmarks emerging to evaluate their capabilities. However, for adaptation, a critical acti...