ARXIV:2605.00803 · AI AGENTS · SUBMITTED 04 MAY · 20:24 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Can Coding Agents Reproduce Findings in Computational Materials Science?

Ziyang Huang · Yi Cao · Ali K. Shargh · Jing Luo · Ruidong Mei · Mohd Zaki · +12 at arXiv

A new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations.

Ship in 2-4 weeks›Score4.0Evidence unverified

Opportunity summary

Pain A new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations.

Evidence 0 refs | 4 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations. However, it is unclear whether such success transfers to computational scientific workflows, where…

METHOD

Full abstract

Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims. To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. AutoMat poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim. By working closely with subject matter experts, we curate a set of claims from real materials science papers to test whether coding agents can recover and execute the end-to-end workflow needed to support (or undermine) such claims. We then evaluate multiple representative coding agent settings across several foundation models. Our results show that current LLM-based agents obtain low overall success rates on AutoMat, with the best-performing setting achieving a success rate of only 54.1%. Error analysis further reveals that agents perform worst when workflows must be reconstructed from paper text alone and that they fail primarily due to incomplete procedures, methodological deviations, and execution fragility. Taken together, these findings position AutoMat as both a benchmark for computational scientific reproducibility and a tool for diagnosing the current limitations of agentic systems in AI-for-science settings.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability…

WHY NOW

AI Agents moved forward this cycle; last verified May 2026. Public score 4.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainA new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations.

Evidence0 refs | 4 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

ARXIV:2605.00803 · AI AGENTS · SUBMITTED 04 MAY · 20:24 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Can Coding Agents Reproduce Findings in Computational Materials Science?

Ziyang Huang · Yi Cao · Ali K. Shargh · Jing Luo · Ruidong Mei · Mohd Zaki · +12 at arXiv

A new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations.

Ship in 2-4 weeks›Score4.0Evidence unverified

Opportunity summary

Pain A new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations.

Evidence 0 refs | 4 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

METHOD

Full abstract

RESULT

WHY NOW

AI Agents moved forward this cycle; last verified May 2026. Public score 4.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainA new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations.

Evidence0 refs | 4 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Paper Pack

10.48550/arXiv.2605.00803

Can Coding Agents Reproduce Findings in Computational Materials Science?

A new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Parse run linked

A document parse run is attached to this paper.

Proof status

unverified

0 refs; 4 sources; 50% coverage.

What was readable

linkedon file14 anchorsderived fallback38 indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

4.0

Time to MVP

MVP estimate missing

Commercial

coderepo url

Export

Preparing verified analysis

lens / founder

PROBLEM

METHOD

RESULT

WHY NOW

AI Agents moved forward this cycle; last verified May 2026. Public score 4.0/10. Implementation evidence is present through a linked repository.

Claim map

Abstract-backed public claims while anchored extraction refreshes.

Strong 0Mixed 0Weak 4

Evidencepartial
A new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
ScienceToStartup currently rates this 4.0/10 on the public viability pass. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims. A public repository is linked, so build verification can inspect implementation evidence instead of treating the paper as PDF-only.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
AI Agents moved forward this cycle; last verified May 2026. Public score 4.0/10. Implementation evidence is present through a linked repository.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial

Constellation map

Paper-native neighborhood for concepts, methods, materials, markets, and competitors. Missing lanes stay labeled instead of disappearing behind commercialization gates.

Open full Signal Canvas

Concepts

not indexed

Methods

Materials

PDF linkedDocument parse run

Markets

AI Agents

Competitors

not indexed

Competitive landscape

A new benchmark, AutoMat, evaluates the ability of LLM-based coding agents to reproduce findings in computational materials science, revealing current limitations.

Segment

AI Agents

Adoption evidence

Public code linked for build inspection

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Buzz

No indexed public discussion is attached to 2605.00803 yet. That is a visibility signal, not a blank module: the monitor is watching the public channels below.

Hacker News

Not indexed yet

Bluesky

Not indexed yet

PDF

Preview the source document here, or use the hero PDF action for a new tab.

References(38)

Machine learning surrogates for CALPHAD inputs in mean-field precipitation modelling of IN738LC

2026Z. Y. Cao, M. J. Anderson et al.

Kimi K2.5: Visual Agentic Intelligence

2026Kimi Team Yifan Bai, Yifan Bai et al.

Machine learning interatomic potentials for monolayer hexagonal boron nitride: Thermal transport and defect effects via deep potential molecular dynamics

2026Guanghao Zhang, Zicheng Wang et al.

From prediction to synthesis: DFT-active learning-guided design of multimetallic catalysts for hydrogen evolution

2025Minhee Park, Minki Kim et al.

REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

2025Chuxuan Hu, Liyun Zhang et al.

Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science

2025Peter Alexander Jansen, Samiah Hassan et al.

MatTools: Benchmarking Large Language Models for Materials Science Tools

2025Siyu Liu, Jiamin Xu et al.

PaperBench: Evaluating AI's Ability to Replicate AI Research

2025Giulio Starace, Oliver Jaffe et al.

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

2025Yanzheng Xiang, Hanqi Yan et al.

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

2025Jiefu Ou, William Gantt Walden et al.

Characterising reproducibility debt in scientific software: A systematic literature review

2025Zara Hassan, Christoph Treude et al.

StackEval: Benchmarking LLMs in Coding Assistance

2024Nidhish Shah, Zülküf Genç et al.

Temperature-dependent discovery of BCC refractory multi-principal element alloys: Integrating deep learning and CALPHAD calculations

2024A. Shargh, C. D. Stiles et al.

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

2024Zachary S. Siegel, Sayash Kapoor et al.

Machine learning surrogate for 3D phase-field modeling of ferroelectric tip-induced electrical switching

2024Kévin Alhada–Lahbabi, D. Deleruyelle et al.

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

2024Xingyao Wang, Boxuan Li et al.

SciCode: A Research Coding Benchmark Curated by Scientists

2024Minyang Tian, Luyu Gao et al.

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

2024John Yang, Carlos E. Jimenez et al.

Prediction of hardness or yield strength for ODS steels based on machine learning

2024Tian-Xing Yang, Peng Dou

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

2024Naman Jain, King Han et al.

Showing 20 of 38 references

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

Prior WorkAutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

4.0

Extension

none indexed

Commercially relevant

Higher ViabilityAutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

7.0

Higher ViabilityReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

7.0

Higher ViabilityCollider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

7.0

Higher ViabilityAI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

7.0

Higher ViabilityTowards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics

7.0

Higher ViabilityA collaborative agent with two lightweight synergistic models for autonomous crystal materials research

7.0

Higher ViabilityEmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots

6.0

Higher ViabilityAgentic Design of Compositional Descriptors via Autoresearch for Materials Science Applications

8.0

Conflicting

Competing ApproachAI scientists produce results without reasoning scientifically

1.0

Related Resources

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

Agent drawer

5 surfaces preserved for agents. Humans can ignore.

Developer contracts, payload previews, evidence maps, and run controls stay here instead of the Read, Build, and Track workspace.

Run context

Paper: 2605.00803
Route: /paper/can-coding-agents-reproduce-findings-in-computational-materials-science
Active tab: read
Artifact: can-coding-agents-reproduce-findings-in-computational-materials-science

Available agents

Read extractor
Build planner
Track monitor
Competitive mapper
Related-paper scout

API/MCP endpoints

REST paper pack API/api/v1/paper/can-coding-agents-reproduce-findings-in-computational-materials-science/paper-pack
REST build passport API/api/v1/paper/can-coding-agents-reproduce-findings-in-computational-materials-science/build-passport
REST OpenAPI/api/openapi.json
MCP descriptor/api/mcp
MCP resourcesciencetostartup://surfaces/paper-workspace

Tool contracts

paper_packbuild_passportopportunity_kernelforesightsource_proofevidence_state

Payload preview

Inspect payload

{
  "contract_version": "paper-r2",
  "paper_id": "ef753902-f5e9-48a0-9c57-148f06cd43cd",
  "arxiv_id": "2605.00803",
  "canonical_route": "/paper/can-coding-agents-reproduce-findings-in-computational-materials-science",
  "active_tab": "synced from current hash by the drawer client",
  "selected_artifact": "can-coding-agents-reproduce-findings-in-computational-materials-science",
  "endpoints": {
    "paper_pack": "/api/v1/paper/can-coding-agents-reproduce-findings-in-computational-materials-science/paper-pack",
    "build_passport": "/api/v1/paper/can-coding-agents-reproduce-findings-in-computational-materials-science/build-passport",
    "mcp_resource": "sciencetostartup://surfaces/paper-workspace"
  }
}

Schema validation

paper-r2 contract: present
JSON-LD twin: SSR emitted
OpenAPI path parity: /api/openapi.json
MCP resource parity: paper-workspace

Job trace

queued: drawer opened by user action
running: inspect or copy payload
succeeded: payload available in SSR
failed: route errors appear in evidence cards

Evidence map

sources used: page freshness, source proof anchors, JSON-LD
missing sources: exposed by PaperPack and EvidenceState chips
derived fallbacks: marked unverified before handoff

Page Freshness

Canonical route, proof status, last verified, refs, sources, and coverage.

Page Freshness

Paper proof surface

Canonical route: /paper/can-coding-agents-reproduce-findings-in-computational-materials-science

stale

Proof freshness: stale
Proof status: unverified
Display score: 4/10
Last proof check: 2026-05-04
Score updated: 2026-05-04
Score fresh until: 2026-06-03
References: 0
Source count: 4
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

OpenAlex: pending — this preprint is not yet indexed by OpenAlex.

Agent Handoff

Endpoint list, payload shape, route context, and copyable handoff data.

Agent Handoff

Can Coding Agents Reproduce Findings in Computational Materials Science?

Canonical ID can-coding-agents-reproduce-findings-in-computational-materials-science | Route /paper/can-coding-agents-reproduce-findings-in-computational-materials-science

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/paper/can-coding-agents-reproduce-findings-in-computational-materials-science

MCP example

{
  "tool": "get_paper",
  "arguments": {
    "arxiv_id": "2605.00803"
  }
}

source_context

{
  "surface": "paper",
  "mode": "paper",
  "query": "Can Coding Agents Reproduce Findings in Computational Materials Science?",
  "normalized_query": "2605.00803",
  "route": "/paper/can-coding-agents-reproduce-findings-in-computational-materials-science",
  "paper_ref": "can-coding-agents-reproduce-findings-in-computational-materials-science",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Buildability Receipt

Verdict, compute envelope, blockers, signature state, and receipt links.

Paper proof page receipt window

Not build-ready: Can Coding Agents Reproduce Findings in Computational Materials Science?

/buildability/can-coding-agents-reproduce-findings-in-computational-materials-science