Benchmark Development

Proof pending

4papers

5.0viability

Proof pending

Proof pending. This topic has not reached the minimum paper threshold yet.

Topic-linked question coverage is still building for this proof surface.

Papers

1-4 of 4

Research Paper·Feb 16, 2026

MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs

AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution. These plans consist of ordered steps structured according to a Temporal Ex...

5.0 viability

Research Paper·Feb 26, 2026

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexi...

5.0 viability

Research Paper·Mar 4, 2026

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions. However, assessing how well LLMs can follow ...

5.0 viability

Research Paper·Feb 23, 2026

Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts. We present an adaptation of the Watson & Holmes dete...

5.0 viability

Benchmark Development

Proof pending

Papers

MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

Filters

Topic proof surfaces

Benchmark Development

Use this topic page as a durable research-area proof surface