CogToM

Gold definitionUpdated Apr 2, 2026

Definition

CogToM is a comprehensive, theoretically grounded benchmark designed to evaluate Large Language Models' (LLMs) Theory of Mind (ToM) capabilities. It comprises over 8000 bilingual instances across 46 diverse paradigms, moving beyond narrow false belief tasks to capture the full spectrum of human cognitive mechanisms.

At a glance

Executive summary

CogToM is a new, extensive test for AI models to see if they can understand others' minds, like humans do. It uses thousands of diverse scenarios to check if models truly grasp complex social cognition, revealing where they succeed and where they still fall short compared to human thinking.

TL;DR

CogToM is a big, new test for AI models to see if they can understand what others are thinking, much better than older, simpler tests.

Key points

A comprehensive, theoretically grounded benchmark with 8000+ bilingual instances across 46 cognitive paradigms.
Solves the problem of narrow existing Theory of Mind (ToM) benchmarks for Large Language Models (LLMs).
Used by researchers and ML engineers to evaluate and develop LLMs with advanced cognitive capabilities.
Offers a broader and more theoretically grounded assessment of ToM compared to restricted false belief tasks.
Aids in investigating LLM cognitive boundaries and understanding potential divergences from human cognition.

Use cases

Benchmarking new LLM architectures for their Theory of Mind capabilities before deployment.
Identifying specific cognitive weaknesses in current LLMs to guide future model development and fine-tuning.
Comparing the ToM performance of different frontier models (e.g., GPT-5.1 vs. Qwen3-Max) across diverse scenarios.
Researching the fundamental differences between artificial and human intelligence in social cognition.
Developing more robust and human-aligned AI systems by understanding their cognitive limitations.