ARXIV:2604.12290 · AGENTS · SUBMITTED 15 APR · 16:59 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Yizhe Chi · Deyao Hong · Dapeng Jiang · Tianwei Luo · Kaisen Yang · Boshi Zhang · +15 at arXiv

A new benchmark for self-evolving AI agents that iteratively optimize engineering designs using simulators and verifiers, pushing the boundaries of real-world problem-solving.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A new benchmark for self-evolving AI agents that iteratively optimize engineering designs using simulators and verifiers, pushing the boundaries of real-world problem-solving.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new benchmark for self-evolving AI agents that iteratively optimize engineering designs using simulators and verifiers, pushing the boundaries of real-world problem-solving. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization…

METHOD

Full abstract

Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization -- an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget -- spanning $47$ tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power-law decay in improvement frequency ($\sim$ 1/iteration) and magnitude ($\sim$ 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard-won improvements under a fixed budget. Frontier-Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains…

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA new benchmark for self-evolving AI agents that iteratively optimize engineering designs using simulators and verifiers, pushing the boundaries of real-world problem-solving.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A new benchmark for self-evolving AI agents that iteratively optimize engineering designs using simulators and verifiers, pushing the boundaries of real-world problem-solving.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new benchmark for self-evolving AI agents that iteratively optimize engineering designs using simulators and verifiers, pushing the boundaries of real-world problem-solving.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "2cd8bb4f-e223-4a3d-991f-749549b8bf1f", "arxiv_id": "2604.12290", "canonical_route": "/paper/frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization", "endpoints": { "paper_pack": "/api/v1/paper/frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization/paper-pack", "build_passport": "/api/v1/paper/frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization", "normalized_query": "2604.12290", "route": "/paper/frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization", "paper_ref": "frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization#webpage", "url": "https://sciencetostartup.com/paper/frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization", "name": "Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization", "description": "A new benchmark for self-evolving AI agents that iteratively optimize engineering designs using simulators and verifiers, pushing the boundaries of real-world problem-solving.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization#scholarlyArticle", "headline": "Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization", "description": "A new benchmark for self-evolving AI agents that iteratively optimize engineering designs using simulators and verifiers, pushing the boundaries of real-world problem-solving.", "url": "https://sciencetostartup.com/paper/frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization", "sameAs": "https://arxiv.org/abs/2604.12290", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.12290" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-14T05:02:06.000Z", "author": [ { "@type": "Person", "name": "Yizhe Chi" }, { "@type": "Person", "name": "Deyao Hong" }, { "@type": "Person", "name": "Dapeng Jiang" }, { "@type": "Person", "name": "Tianwei Luo" }, { "@type": "Person", "name": "Kaisen Yang" }, { "@type": "Person", "name": "Boshi Zhang" }, { "@type": "Person", "name": "Zhe Cao" }, { "@type": "Person", "name": "Xiaoyan Fan" }, { "@type": "Person", "name": "Bingxiang He" }, { "@type": "Person", "name": "Han Hao" }, { "@type": "Person", "name": "Weiyang Jin" }, { "@type": "Person", "name": "Dianqiao Lei" }, { "@type": "Person", "name": "Qingle Liu" }, { "@type": "Person", "name": "Houde Qian" }, { "@type": "Person", "name": "Bowen Wang" }, { "@type": "Person", "name": "Situ Wang" }, { "@type": "Person", "name": "Youjie Zheng" }, { "@type": "Person", "name": "Yifan Zhou" }, { "@type": "Person", "name": "Calvin Xiao" }, { "@type": "Person", "name": "Eren Cai" }, { "@type": "Person", "name": "Qinhuai Na" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Frontier-Eng: Benchmarking Self-Evolving Agents on Real-Worl", "item": "https://sciencetostartup.com/paper/frontier-eng-benchmarking-self-evolving-agents-on-real-world-engineering-tasks-with-generative-optimization" } ] } ] }

Competitive landscape

A new benchmark for self-evolving AI agents that iteratively optimize engineering designs using simulators and verifiers, pushing the boundaries of real-world problem-solving.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline