ARXIV:2603.16011 · CODE OPTIMIZATION · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Evaluating Agentic Optimization on Large Codebases

arXiv

FormulaCode is a benchmark for evaluating the optimization capabilities of LLM coding agents on real-world codebases.

Blocked on Code›Score7.0Evidence unverified

Opportunity summary

Pain FormulaCode is a benchmark for evaluating the optimization capabilities of LLM coding agents on real-world codebases.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

FormulaCode is a benchmark for evaluating the optimization capabilities of LLM coding agents on real-world codebases. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to…

METHOD

Full abstract

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula-code.github.io

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Project website at: https://formula-code.github.io

WHY NOW

Code Optimization moved forward this cycle; last verified April 2026. Public score 7.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainFormulaCode is a benchmark for evaluating the optimization capabilities of LLM coding agents on real-world codebases.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

FormulaCode is a benchmark for evaluating the optimization capabilities of LLM coding agents on real-world codebases.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

FormulaCode is a benchmark for evaluating the optimization capabilities of LLM coding agents on real-world codebases.

Segment

Code Optimization

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "69c4c097-15c5-49e5-9bbf-36a274aba4db", "arxiv_id": "2603.16011", "canonical_route": "/paper/evaluating-agentic-optimization-on-large-codebases", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "evaluating-agentic-optimization-on-large-codebases", "endpoints": { "paper_pack": "/api/v1/paper/evaluating-agentic-optimization-on-large-codebases/paper-pack", "build_passport": "/api/v1/paper/evaluating-agentic-optimization-on-large-codebases/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Evaluating Agentic Optimization on Large Codebases", "normalized_query": "2603.16011", "route": "/paper/evaluating-agentic-optimization-on-large-codebases", "paper_ref": "evaluating-agentic-optimization-on-large-codebases", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/evaluating-agentic-optimization-on-large-codebases#webpage", "url": "https://sciencetostartup.com/paper/evaluating-agentic-optimization-on-large-codebases", "name": "Evaluating Agentic Optimization on Large Codebases", "description": "FormulaCode is a benchmark for evaluating the optimization capabilities of LLM coding agents on real-world codebases.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/evaluating-agentic-optimization-on-large-codebases#scholarlyArticle", "headline": "Evaluating Agentic Optimization on Large Codebases", "description": "FormulaCode is a benchmark for evaluating the optimization capabilities of LLM coding agents on real-world codebases.", "url": "https://sciencetostartup.com/paper/evaluating-agentic-optimization-on-large-codebases", "sameAs": "https://arxiv.org/abs/2603.16011", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.16011" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-16T23:40:19.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Code Optimization" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Code Optimization", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Evaluating Agentic Optimization on Large Codebases", "item": "https://sciencetostartup.com/paper/evaluating-agentic-optimization-on-large-codebases" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "Why now — the timing is ripe due to the rapid adoption of LLM coding assistants like GitHub Copilot, increasing cloud costs driving demand for optimization, and the growing complexity of software repositories that require scalable AI solutions. Market conditions favor tools that can demonstrate measurable performance gains in real-world scenarios, making this benchmark a key differentiator." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "A cloud service provider integrates this benchmark into their AI-assisted development platform to automatically identify and fix performance bottlenecks in customer codebases, offering it as a premium feature that reduces compute costs and improves application speed for enterprise clients." } } ] } ] }

Competitive landscape

FormulaCode is a benchmark for evaluating the optimization capabilities of LLM coding agents on real-world codebases.

Segment

Code Optimization

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Evaluating Agentic Optimization on Large Codebases

Evaluating Agentic Optimization on Large Codebases

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline