ARXIV:2604.10825 · LLM BENCHMARKING · SUBMITTED 14 APR · 20:29 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

Zacharie Bugaud · arXiv

CheeseBench evaluates LLMs on rodent behavioral neuroscience paradigms, revealing current models lag behind animal baselines and are sensitive to interface parameters.

Ship in 2-4 weeks›Score5.0Evidence unverified

Opportunity summary

Pain CheeseBench evaluates LLMs on rodent behavioral neuroscience paradigms, revealing current models lag behind animal baselines and are sensitive to interface parameters.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

CheeseBench evaluates LLMs on rodent behavioral neuroscience paradigms, revealing current models lag behind animal baselines and are sensitive to interface parameters. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines.

METHOD

Full abstract

We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model…

WHY NOW

LLM Benchmarking moved forward this cycle; last verified April 2026. Public score 5.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainCheeseBench evaluates LLMs on rodent behavioral neuroscience paradigms, revealing current models lag behind animal baselines and are sensitive to interface parameters.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

CheeseBench evaluates LLMs on rodent behavioral neuroscience paradigms, revealing current models lag behind animal baselines and are sensitive to interface parameters.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

CheeseBench evaluates LLMs on rodent behavioral neuroscience paradigms, revealing current models lag behind animal baselines and are sensitive to interface parameters.

Segment

LLM Benchmarking

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "30ee7165-efd6-492b-8d17-62c91bf017e4", "arxiv_id": "2604.10825", "canonical_route": "/paper/cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms", "endpoints": { "paper_pack": "/api/v1/paper/cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms/paper-pack", "build_passport": "/api/v1/paper/cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms", "normalized_query": "2604.10825", "route": "/paper/cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms", "paper_ref": "cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms#webpage", "url": "https://sciencetostartup.com/paper/cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms", "name": "CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms", "description": "CheeseBench evaluates LLMs on rodent behavioral neuroscience paradigms, revealing current models lag behind animal baselines and are sensitive to interface parameters.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms#scholarlyArticle", "headline": "CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms", "description": "CheeseBench evaluates LLMs on rodent behavioral neuroscience paradigms, revealing current models lag behind animal baselines and are sensitive to interface parameters.", "url": "https://sciencetostartup.com/paper/cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms", "sameAs": "https://arxiv.org/abs/2604.10825", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.10825" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-12T21:37:26.000Z", "author": [ { "@type": "Person", "name": "Zacharie Bugaud" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 5 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Benchmarking" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Benchmarking", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "CheeseBench: Evaluating Large Language Models on Rodent Beha", "item": "https://sciencetostartup.com/paper/cheesebench-evaluating-large-language-models-on-rodent-behavioral-neuroscience-paradigms" } ] } ] }

Competitive landscape

CheeseBench evaluates LLMs on rodent behavioral neuroscience paradigms, revealing current models lag behind animal baselines and are sensitive to interface parameters.

Segment

LLM Benchmarking

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline