ARXIV:2603.17104 · CODING AGENTS · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

arXiv

A benchmark and tool for improving faithfulness in long-horizon coding agents through specification tracking.

Blocked on Code›Score7.0Evidence unverified

Opportunity summary

Pain A benchmark and tool for improving faithfulness in long-horizon coding agents through specification tracking.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A benchmark and tool for improving faithfulness in long-horizon coding agents through specification tracking. Real research coding often does not: the intended system is progressively disclosed through in- teraction, requiring the agent to track…

METHOD

Full abstract

Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is progressively disclosed through in- teraction, requiring the agent to track durable design commitments across a long session. We introduce a benchmark for this setting and study faithfulne Ss Loss U nder eM ergent s Pecification (SLUMP), defined as the reduc- tion in final implementation faithfulness un- der emergent specification relative to a single- shot specification control. The benchmark con- tains 20 recent ML papers (10 ICML 2025, 10 NeurIPS 2025), 371 atomic verifiable compo- nents, and interaction scripts of approximately 60 coding requests that progressively disclose the target design without revealing the paper itself. Final repositories are scored with a five-level component-faithfulness rubric and accompanied by an exposure audit to verify that scored components are recoverable from the visible interaction. Evaluated on Claude Code and Codex, the single-shot specification control achieves higher overall implementation fidelity on 16/20 and 14/20 papers, respectively. Structural integration degrades under emergent specification on both platforms, while seman- tic faithfulness loss is substantial on Claude Code and small on Codex. As a mitigation case study, we introduce ProjectGuard, an exter- nal project-state layer for specification tracking. On Claude Code, ProjectGuard recovers 90% of the faithfulness gap, increases fully faith- ful components from 118 to 181, and reduces severe failures from 72 to 49. These results identify specification tracking as a distinct eval- uation target for long-horizon coding agents.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Evaluated on Claude Code and Codex, the single-shot specification control achieves higher overall implementation fidelity on 16/20 and 14/20 papers, respectively.

WHY NOW

Coding Agents moved forward this cycle; last verified April 2026. Public score 7.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA benchmark and tool for improving faithfulness in long-horizon coding agents through specification tracking.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

A benchmark and tool for improving faithfulness in long-horizon coding agents through specification tracking.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

A benchmark and tool for improving faithfulness in long-horizon coding agents through specification tracking.

Segment

Coding Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "6ee60618-dc8d-4466-b0e7-378685952a3c", "arxiv_id": "2603.17104", "canonical_route": "/paper/when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents", "endpoints": { "paper_pack": "/api/v1/paper/when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents/paper-pack", "build_passport": "/api/v1/paper/when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents", "normalized_query": "2603.17104", "route": "/paper/when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents", "paper_ref": "when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents#webpage", "url": "https://sciencetostartup.com/paper/when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents", "name": "When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents", "description": "A benchmark and tool for improving faithfulness in long-horizon coding agents through specification tracking.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents#scholarlyArticle", "headline": "When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents", "description": "A benchmark and tool for improving faithfulness in long-horizon coding agents through specification tracking.", "url": "https://sciencetostartup.com/paper/when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents", "sameAs": "https://arxiv.org/abs/2603.17104", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.17104" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-17T19:53:35.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Coding Agents" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Coding Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "When the Specification Emerges: Benchmarking Faithfulness Lo", "item": "https://sciencetostartup.com/paper/when-the-specification-emerges-benchmarking-faithfulness-loss-in-long-horizon-coding-agents" } ] } ] }

Competitive landscape

A benchmark and tool for improving faithfulness in long-horizon coding agents through specification tracking.

Segment

Coding Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline