ARXIV:2603.26137 · SOFTWARE ENGINEERING AI · SUBMITTED 30 MAR · 21:58 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

Xianpeng · Sun · Haonan Sun · Tian Yu · Sheng Ma · Qincheng Zhang · +2 at arXiv

A time-consistent benchmark and methodology for evaluating repository-aware software engineering AI agents, improving prompt construction and temporal validity.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A time-consistent benchmark and methodology for evaluating repository-aware software engineering AI agents, improving prompt construction and temporal validity.

Evidence 9 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A time-consistent benchmark and methodology for evaluating repository-aware software engineering AI agents, improving prompt construction and temporal validity. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code…

METHOD

Full abstract

Evaluation of repository-aware software engineering systems is often confounded by synthetic task design, prompt leakage, and temporal contamination between repository knowledge and future code changes. We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1]. Each historical pull request is transformed into a natural-language task through an LLM-assisted prompt-generation pipeline, and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant. We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities. Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model. These results show that prompt construction is a first-order benchmark variable. More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. These results show that prompt construction is a first-order benchmark variable. Code availability is flagged in the production record; the public repository link still…

WHY NOW

Software Engineering AI moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA time-consistent benchmark and methodology for evaluating repository-aware software engineering AI agents, improving prompt construction and temporal validity.

Evidence9 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A time-consistent benchmark and methodology for evaluating repository-aware software engineering AI agents, improving prompt construction and temporal validity.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A time-consistent benchmark and methodology for evaluating repository-aware software engineering AI agents, improving prompt construction and temporal validity.

Segment

Software Engineering AI

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "ebd9df58-ba6f-4666-b343-96558bc1b191", "arxiv_id": "2603.26137", "canonical_route": "/paper/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "atime-consistent-benchmark-for-repository-level-software-engineering-evaluation", "endpoints": { "paper_pack": "/api/v1/paper/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation/paper-pack", "build_passport": "/api/v1/paper/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation", "normalized_query": "2603.26137", "route": "/paper/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation", "paper_ref": "atime-consistent-benchmark-for-repository-level-software-engineering-evaluation", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation#webpage", "url": "https://sciencetostartup.com/paper/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation", "name": "ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation", "description": "A time-consistent benchmark and methodology for evaluating repository-aware software engineering AI agents, improving prompt construction and temporal validity.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation#scholarlyArticle", "headline": "ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation", "description": "A time-consistent benchmark and methodology for evaluating repository-aware software engineering AI agents, improving prompt construction and temporal validity.", "url": "https://sciencetostartup.com/paper/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation", "sameAs": "https://arxiv.org/abs/2603.26137", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.26137" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-27T07:46:18.000Z", "author": [ { "@type": "Person", "name": "Xianpeng" }, { "@type": "Person", "name": "Sun" }, { "@type": "Person", "name": "Haonan Sun" }, { "@type": "Person", "name": "Tian Yu" }, { "@type": "Person", "name": "Sheng Ma" }, { "@type": "Person", "name": "Qincheng Zhang" }, { "@type": "Person", "name": "Lifei Rao" }, { "@type": "Person", "name": "Chen Tian" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Software Engineering AI" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Software Engineering AI", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "ATime-Consistent Benchmark for Repository-Level Software Eng", "item": "https://sciencetostartup.com/paper/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation" } ] } ] }

Competitive landscape

A time-consistent benchmark and methodology for evaluating repository-aware software engineering AI agents, improving prompt construction and temporal validity.

Segment

Software Engineering AI

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline