ARXIV:2604.05955 · AGENTS · SUBMITTED 08 APR · 03:21 UTC · FRESHNESS UNKNOWN

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

Kai Yu · Zhenhao Zhou · Junhao Zeng · Ying Wang · Xueying Du · Zhiqiang Yuan · +5 at arXiv

A benchmark and LLM-based verifier to evaluate code patch quality beyond test pass rates, revealing significant design constraint violations in current AI agents.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A benchmark and LLM-based verifier to evaluate code patch quality beyond test pass rates, revealing significant design constraint violations in current AI agents.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A benchmark and LLM-based verifier to evaluate code patch quality beyond test pass rates, revealing significant design constraint violations in current AI agents. In practice, however, acceptable patches must also comply with project-specific design…

METHOD

Full abstract

Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces \textit{design-aware issue resolution} and presents \bench{}, a benchmark that makes such implicit design constraints explicit and measurable. \bench{} is constructed by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier, yielding 495 issues and 1,787 validated constraints across six repositories, aligned with SWE-bench-Verified and SWE-bench-Pro. Experiments with state-of-the-art agents show that test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. While providing issue-specific design guidance reduces violations, substantial non-compliance remains, highlighting a fundamental gap in current agent capabilities and motivating design-aware evaluation beyond functional correctness.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Experiments with state-of-the-art agents show that test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying, design violations are…

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA benchmark and LLM-based verifier to evaluate code patch quality beyond test pass rates, revealing significant design constraint violations in current AI agents.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A benchmark and LLM-based verifier to evaluate code patch quality beyond test pass rates, revealing significant design constraint violations in current AI agents.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A benchmark and LLM-based verifier to evaluate code patch quality beyond test pass rates, revealing significant design constraint violations in current AI agents.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "4aa7bffe-b4cd-4372-84bb-89c6f364cb1f", "arxiv_id": "2604.05955", "canonical_route": "/paper/does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution", "endpoints": { "paper_pack": "/api/v1/paper/does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution/paper-pack", "build_passport": "/api/v1/paper/does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution", "normalized_query": "2604.05955", "route": "/paper/does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution", "paper_ref": "does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution#webpage", "url": "https://sciencetostartup.com/paper/does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution", "name": "Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution", "description": "A benchmark and LLM-based verifier to evaluate code patch quality beyond test pass rates, revealing significant design constraint violations in current AI agents.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution#scholarlyArticle", "headline": "Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution", "description": "A benchmark and LLM-based verifier to evaluate code patch quality beyond test pass rates, revealing significant design constraint violations in current AI agents.", "url": "https://sciencetostartup.com/paper/does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution", "sameAs": "https://arxiv.org/abs/2604.05955", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.05955" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-07T14:47:27.000Z", "author": [ { "@type": "Person", "name": "Kai Yu" }, { "@type": "Person", "name": "Zhenhao Zhou" }, { "@type": "Person", "name": "Junhao Zeng" }, { "@type": "Person", "name": "Ying Wang" }, { "@type": "Person", "name": "Xueying Du" }, { "@type": "Person", "name": "Zhiqiang Yuan" }, { "@type": "Person", "name": "Junwei Liu" }, { "@type": "Person", "name": "Ziyu Zhou" }, { "@type": "Person", "name": "Yujia Wang" }, { "@type": "Person", "name": "Chong Wang" }, { "@type": "Person", "name": "Xin Peng" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Does Pass Rate Tell the Whole Story? Evaluating Design Const", "item": "https://sciencetostartup.com/paper/does-pass-rate-tell-the-whole-story-evaluating-design-constraint-compliance-in-llm-based-issue-resolution" } ] } ] }

Competitive landscape

A benchmark and LLM-based verifier to evaluate code patch quality beyond test pass rates, revealing significant design constraint violations in current AI agents.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline