ARXIV:2604.01527 · AI CODING AGENTS · SUBMITTED 03 APR · 20:50 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

Smriti Jha · Matteo Paltenghi · Chandra Maddila · Vijayaraghavan Murali · Shubham Ugare · Satish Chandra · arXiv

A production-derived benchmark for evaluating AI coding agents, enabling more realistic performance assessment and driving improvements in agent capabilities.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A production-derived benchmark for evaluating AI coding agents, enabling more realistic performance assessment and driving improvements in agent capabilities.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A production-derived benchmark for evaluating AI coding agents, enabling more realistic performance assessment and driving improvements in agent capabilities. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench - a benchmark…

METHOD

Full abstract

Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings, yet existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure. This paper presents a methodology for curating production-derived benchmarks, illustrated through ProdCodeBench - a benchmark built from real sessions with a production AI coding assistant. We detail our data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks which address challenges in constructing reliable evaluation signals from monorepo environments. Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages. Our systematic analysis of four foundation models yields solve rates from 53.2% to 72.2% revealing that models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates. This suggests that iterative verification helps achieve effective agent behavior and that exposing codebase-specific verification mechanisms may significantly improve the performance of externally trained agents operating in unfamiliar environments. We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Our systematic analysis of four foundation models yields solve rates from 53.2% to 72.2% revealing that models making greater use of work validation tools,…

WHY NOW

AI Coding Agents moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA production-derived benchmark for evaluating AI coding agents, enabling more realistic performance assessment and driving improvements in agent capabilities.

Evidence0 refs | 0 sources | 33% coverage

Blockerno shell-level blocker reported

Analysis summary

A production-derived benchmark for evaluating AI coding agents, enabling more realistic performance assessment and driving improvements in agent capabilities.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A production-derived benchmark for evaluating AI coding agents, enabling more realistic performance assessment and driving improvements in agent capabilities.

Segment

AI Coding Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "69c344e6-2bd6-497a-a69d-1648e2ef11ef", "arxiv_id": "2604.01527", "canonical_route": "/paper/prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents", "endpoints": { "paper_pack": "/api/v1/paper/prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents/paper-pack", "build_passport": "/api/v1/paper/prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents", "normalized_query": "2604.01527", "route": "/paper/prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents", "paper_ref": "prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents#webpage", "url": "https://sciencetostartup.com/paper/prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents", "name": "ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents", "description": "A production-derived benchmark for evaluating AI coding agents, enabling more realistic performance assessment and driving improvements in agent capabilities.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents#scholarlyArticle", "headline": "ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents", "description": "A production-derived benchmark for evaluating AI coding agents, enabling more realistic performance assessment and driving improvements in agent capabilities.", "url": "https://sciencetostartup.com/paper/prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents", "sameAs": "https://arxiv.org/abs/2604.01527", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.01527" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T01:52:55.000Z", "author": [ { "@type": "Person", "name": "Smriti Jha" }, { "@type": "Person", "name": "Matteo Paltenghi" }, { "@type": "Person", "name": "Chandra Maddila" }, { "@type": "Person", "name": "Vijayaraghavan Murali" }, { "@type": "Person", "name": "Shubham Ugare" }, { "@type": "Person", "name": "Satish Chandra" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "AI Coding Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "AI Coding Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "ProdCodeBench: A Production-Derived Benchmark for Evaluating", "item": "https://sciencetostartup.com/paper/prodcodebench-a-production-derived-benchmark-for-evaluating-ai-coding-agents" } ] } ] }

Competitive landscape

A production-derived benchmark for evaluating AI coding agents, enabling more realistic performance assessment and driving improvements in agent capabilities.

Segment

AI Coding Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline