ARXIV:2605.03546 · AI AGENTS & CODE GENERATION · SUBMITTED 06 MAY · 20:24 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

ProgramBench: Can Language Models Rebuild Programs From Scratch?

John Yang · Kilian Lieret · Jeffrey Ma · Parth Thakkar · Dmitrii Pedchenko · Sten Sootla · +6 at arXiv

ProgramBench: A new benchmark to evaluate language models' ability to architect and build full software projects from scratch, revealing current limitations.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain ProgramBench: A new benchmark to evaluate language models' ability to architect and build full software projects from scratch, revealing current limitations.

Evidence 0 refs | 4 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

ProgramBench: A new benchmark to evaluate language models' ability to architect and build full software projects from scratch, revealing current limitations. Agents are being deployed to seed, maintain, and grow codebases over extended periods…

METHOD

Full abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Models favor monolithic, single-file implementations that diverge sharply from human-written code. A public repository is linked, so build verification can inspect implementation evidence instead…

WHY NOW

AI Agents & Code Generation moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainProgramBench: A new benchmark to evaluate language models' ability to architect and build full software projects from scratch, revealing current limitations.

Evidence0 refs | 4 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

ProgramBench: A new benchmark to evaluate language models' ability to architect and build full software projects from scratch, revealing current limitations.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

ProgramBench: A new benchmark to evaluate language models' ability to architect and build full software projects from scratch, revealing current limitations.

Segment

AI Agents & Code Generation

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "61447e58-5878-4b8e-92b2-544c50047100", "arxiv_id": "2605.03546", "canonical_route": "/paper/programbench-can-language-models-rebuild-programs-from-scratch", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "programbench-can-language-models-rebuild-programs-from-scratch", "endpoints": { "paper_pack": "/api/v1/paper/programbench-can-language-models-rebuild-programs-from-scratch/paper-pack", "build_passport": "/api/v1/paper/programbench-can-language-models-rebuild-programs-from-scratch/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "ProgramBench: Can Language Models Rebuild Programs From Scratch?", "normalized_query": "2605.03546", "route": "/paper/programbench-can-language-models-rebuild-programs-from-scratch", "paper_ref": "programbench-can-language-models-rebuild-programs-from-scratch", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/programbench-can-language-models-rebuild-programs-from-scratch#webpage", "url": "https://sciencetostartup.com/paper/programbench-can-language-models-rebuild-programs-from-scratch", "name": "ProgramBench: Can Language Models Rebuild Programs From Scratch?", "description": "ProgramBench: A new benchmark to evaluate language models' ability to architect and build full software projects from scratch, revealing current limitations.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/programbench-can-language-models-rebuild-programs-from-scratch#scholarlyArticle", "headline": "ProgramBench: Can Language Models Rebuild Programs From Scratch?", "description": "ProgramBench: A new benchmark to evaluate language models' ability to architect and build full software projects from scratch, revealing current limitations.", "url": "https://sciencetostartup.com/paper/programbench-can-language-models-rebuild-programs-from-scratch", "sameAs": "https://arxiv.org/abs/2605.03546", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.03546" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-05T09:17:02.000Z", "author": [ { "@type": "Person", "name": "John Yang" }, { "@type": "Person", "name": "Kilian Lieret" }, { "@type": "Person", "name": "Jeffrey Ma" }, { "@type": "Person", "name": "Parth Thakkar" }, { "@type": "Person", "name": "Dmitrii Pedchenko" }, { "@type": "Person", "name": "Sten Sootla" }, { "@type": "Person", "name": "Emily McMilin" }, { "@type": "Person", "name": "Pengcheng Yin" }, { "@type": "Person", "name": "Rui Hou" }, { "@type": "Person", "name": "Gabriel Synnaeve" }, { "@type": "Person", "name": "Diyi Yang" }, { "@type": "Person", "name": "Ofir Press" } ], "codeRepository": "https://github.com/SWE-bench/SWE-bench", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "AI Agents & Code Generation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/programbench-can-language-models-rebuild-programs-from-scratch#software", "name": "ProgramBench: Can Language Models Rebuild Programs From Scratch? - Source Code", "description": "ProgramBench: A new benchmark to evaluate language models' ability to architect and build full software projects from scratch, revealing current limitations.", "codeRepository": "https://github.com/SWE-bench/SWE-bench", "url": "https://github.com/SWE-bench/SWE-bench" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "AI Agents & Code Generation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "ProgramBench: Can Language Models Rebuild Programs From Scra", "item": "https://sciencetostartup.com/paper/programbench-can-language-models-rebuild-programs-from-scratch" } ] } ] }

Competitive landscape

ProgramBench: A new benchmark to evaluate language models' ability to architect and build full software projects from scratch, revealing current limitations.

Segment

AI Agents & Code Generation

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

ProgramBench: Can Language Models Rebuild Programs From Scratch?

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline