ARXIV:2602.11354 · AI FOR RESEARCH AUTOMATION · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

arXiv

ReplicatorBench offers a benchmark for evaluating LLM agents' ability to replicate scientific research in social sciences.

Blocked on Code›Score7.0Evidence unverified

Opportunity summary

Pain ReplicatorBench offers a benchmark for evaluating LLM agents' ability to replicate scientific research in social sciences.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

ReplicatorBench offers a benchmark for evaluating LLM agents' ability to replicate scientific research in social sciences. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate…

METHOD

Full abstract

The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground-truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent's ability to identify non-replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end-to-end benchmark, including human-verified replicable and non-replicable research claims in social and behavioral sciences for evaluating AI agents in research replication across three stages: (1) extraction and retrieval of replication data; (2) design and execution of computational experiments; and (3) interpretation of results, allowing a test of AI agents' capability to mimic the activities of human replicators in real world. To set a baseline of AI agents' capability, we develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments, to accomplish tasks in ReplicatorBench. We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access. Our findings reveal that while current LLM agents are capable of effectively designing and executing computational experiments, they struggle with retrieving resources, such as new data, necessary to replicate a claim. All code and data are publicly available at https://github.com/CenterForOpenScience/llm-benchmarking.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. In response, we introduce ReplicatorBench, an end-to-end benchmark, including human-verified replicable and non-replicable research claims in social and behavioral sciences for evaluating AI agents…

WHY NOW

AI for Research Automation moved forward this cycle; last verified April 2026. Public score 7.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainReplicatorBench offers a benchmark for evaluating LLM agents' ability to replicate scientific research in social sciences.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

ReplicatorBench offers a benchmark for evaluating LLM agents' ability to replicate scientific research in social sciences.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

ReplicatorBench offers a benchmark for evaluating LLM agents' ability to replicate scientific research in social sciences.

Segment

AI for Research Automation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(11)

LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research

2025Shuo Yan, Ruochen Li et al.

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

2025Tianyu Hua, Harper Hua et al.

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

2025Yanzheng Xiang, Hanqi Yan et al.

LLM4SR: A Survey on Large Language Models for Scientific Research

2025Ziming Luo, Zonglin Yang et al.

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

2024Zachary S. Siegel, Sayash Kapoor et al.

SciCode: A Research Coding Benchmark Curated by Scientists

2024Minyang Tian, Luyu Gao et al.

Executable Code Actions Elicit Better LLM Agents

2024Xingyao Wang, Yangyi Chen et al.

Mass Reproducibility and Replicability: A New Hope

2024Abel Brodeur, Derek Mikola et al.

BERTScore: Evaluating Text Generation with BERT

2019Tianyi Zhang, Varsha Kishore et al.

Estimating the reproducibility of psychological science

2015Alexander A. Aarts, Joanna E. Anderson et al.

ROUGE: A Package for Automatic Evaluation of Summaries

2004Chin-Yew Lin

{ "contract_version": "paper-r2", "paper_id": "e8dc8cf9-18c7-4fee-98d3-0550b9de5035", "arxiv_id": "2602.11354", "canonical_route": "/paper/replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences", "endpoints": { "paper_pack": "/api/v1/paper/replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences/paper-pack", "build_passport": "/api/v1/paper/replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences", "normalized_query": "2602.11354", "route": "/paper/replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences", "paper_ref": "replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences#webpage", "url": "https://sciencetostartup.com/paper/replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences", "name": "ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences", "description": "ReplicatorBench offers a benchmark for evaluating LLM agents' ability to replicate scientific research in social sciences.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences#scholarlyArticle", "headline": "ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences", "description": "ReplicatorBench offers a benchmark for evaluating LLM agents' ability to replicate scientific research in social sciences.", "url": "https://sciencetostartup.com/paper/replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences", "sameAs": "https://arxiv.org/abs/2602.11354", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2602.11354" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-02-11T20:42:10.000Z", "author": [ { "@type": "Person", "name": "Bang Nguyen", "affiliation": { "@type": "Organization", "name": "University of Notre Dame" } }, { "@type": "Person", "name": "Dominik Soós", "affiliation": { "@type": "Organization", "name": "Old Dominion University" } }, { "@type": "Person", "name": "Qian Ma", "affiliation": { "@type": "Organization", "name": "Pennsylvania State University" } }, { "@type": "Person", "name": "Rochana R. Obadage", "affiliation": { "@type": "Organization", "name": "Old Dominion University" } }, { "@type": "Person", "name": "Zack Ranjan", "affiliation": { "@type": "Organization", "name": "Pennsylvania State University" } }, { "@type": "Person", "name": "Sai Koneru", "affiliation": { "@type": "Organization", "name": "Pennsylvania State University" } }, { "@type": "Person", "name": "Timothy M. Errington", "affiliation": { "@type": "Organization", "name": "Center for Open Science" } }, { "@type": "Person", "name": "Shakhlo Nematova", "affiliation": { "@type": "Organization", "name": "Center for Open Science" } }, { "@type": "Person", "name": "Sarah Rajtmajer", "affiliation": { "@type": "Organization", "name": "Pennsylvania State University" } }, { "@type": "Person", "name": "Jian Wu", "affiliation": { "@type": "Organization", "name": "Old Dominion University" } }, { "@type": "Person", "name": "Meng Jiang", "affiliation": { "@type": "Organization", "name": "University of Notre Dame" } } ], "citation": [ { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "2af3eecb34c4c9e97c7b45691e5fbfa074fd5caa" }, "url": "https://www.semanticscholar.org/paper/2af3eecb34c4c9e97c7b45691e5fbfa074fd5caa" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "6d7a3a9bfcec869098344331ed4c090dd1fb7c9b" }, "url": "https://www.semanticscholar.org/paper/6d7a3a9bfcec869098344331ed4c090dd1fb7c9b" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "2dbec8020b9111f1c70e90b112e6318f7d02426c" }, "url": "https://www.semanticscholar.org/paper/2dbec8020b9111f1c70e90b112e6318f7d02426c" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "9678af027022328f143ee2b627ee93ba313a4e7c" }, "url": "https://www.semanticscholar.org/paper/9678af027022328f143ee2b627ee93ba313a4e7c" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "4c913d59d150fe7581386b87dfd9f90448a9adee" }, "url": "https://www.semanticscholar.org/paper/4c913d59d150fe7581386b87dfd9f90448a9adee" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "5bf3ea7b0825424c3c01e48e8fcd8215c950677e" }, "url": "https://www.semanticscholar.org/paper/5bf3ea7b0825424c3c01e48e8fcd8215c950677e" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "78fbb6e7a1c568a04e8c935aa9909d0c942ea5f6" }, "url": "https://www.semanticscholar.org/paper/78fbb6e7a1c568a04e8c935aa9909d0c942ea5f6" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "295065d942abca0711300b2b4c39829551060578" }, "url": "https://www.semanticscholar.org/paper/295065d942abca0711300b2b4c39829551060578" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "155391192091a99f44b377db1e0e7819f2317498" }, "url": "https://www.semanticscholar.org/paper/155391192091a99f44b377db1e0e7819f2317498" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "60b05f32c32519a809f21642ef1eb3eaf3848008" }, "url": "https://www.semanticscholar.org/paper/60b05f32c32519a809f21642ef1eb3eaf3848008" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "0b3ca6b57aca5c267a41e7b24962a55bf774c303" }, "url": "https://www.semanticscholar.org/paper/0b3ca6b57aca5c267a41e7b24962a55bf774c303" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "AI for Research Automation" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "AI for Research Automation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "ReplicatorBench: Benchmarking LLM Agents for Replicability i", "item": "https://sciencetostartup.com/paper/replicatorbench-benchmarking-llm-agents-for-replicability-in-social-and-behavioral-sciences" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"ReplicatorBench: Benchmarking LLM Agents for Replicability i\"?", "acceptedAnswer": { "@type": "Answer", "text": "ReplicatorBench offers a benchmark for evaluating LLM agents' ability to replicate scientific research in social sciences." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "Developing ReplicatorBench as an easy-to-use tool for academic institutions, letting researchers input a paper to initiate an AI-driven replicability assessment, providing detailed reports and insights." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "A SaaS platform for research institutions to automatically check and verify the replicability of published studies using AI agents, saving time and resources in academic validation processes." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "This tool can replace existing manual processes used for replication checks, which are often costly and time-consuming, streamlining the validation of social science research." } } ] } ] }

Competitive landscape

ReplicatorBench offers a benchmark for evaluating LLM agents' ability to replicate scientific research in social sciences.

Segment

AI for Research Automation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(11)

LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research

2025Shuo Yan, Ruochen Li et al.

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

2025Tianyu Hua, Harper Hua et al.

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

2025Yanzheng Xiang, Hanqi Yan et al.

LLM4SR: A Survey on Large Language Models for Scientific Research

2025Ziming Luo, Zonglin Yang et al.

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

2024Zachary S. Siegel, Sayash Kapoor et al.

SciCode: A Research Coding Benchmark Curated by Scientists

2024Minyang Tian, Luyu Gao et al.

Executable Code Actions Elicit Better LLM Agents

2024Xingyao Wang, Yangyi Chen et al.

Mass Reproducibility and Replicability: A New Hope

2024Abel Brodeur, Derek Mikola et al.

BERTScore: Evaluating Text Generation with BERT

2019Tianyi Zhang, Varsha Kishore et al.

Estimating the reproducibility of psychological science

2015Alexander A. Aarts, Joanna E. Anderson et al.

ROUGE: A Package for Automatic Evaluation of Summaries

2004Chin-Yew Lin

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(11)

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(11)

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline