ARXIV:2603.26130 · AI CODE REVIEW · SUBMITTED 30 MAR · 21:54 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

Deepak Kumar · arXiv

A benchmark and evaluation framework to measure and improve AI code review quality, revealing current AI limitations and guiding future development.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A benchmark and evaluation framework to measure and improve AI code review quality, revealing current AI limitations and guiding future development.

Evidence 24 refs | 4 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A benchmark and evaluation framework to measure and improve AI code review quality, revealing current AI limitations and guiding future development. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only…

METHOD

Full abstract

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of human-flagged issues on the diff-only configuration, demonstrating that AI code review remains far below human expert performance despite strong results on code generation benchmarks. Pull requests are drawn from active open-source repositories, filtered from 700 candidates using a Repository Quality Score, and evaluated under three frozen context configurations: diff only (config_A), diff with file content (config_B), and full context (config_C), enabling systematic ablation of context provision strategies. All 8 models degrade monotonically from config_A to config_C, even when context is provided via structured semantic layers including AST-extracted function context and import graph resolution. The dominant mechanism is a collapse of Type2_Contextual issue detection at config_B, consistent with attention dilution in long contexts: a structured 2,000-token diff-with-summary prompt outperforms a 2,500-token full-context prompt enriched with execution context, behaviour mapping, and test signatures across all 8 models. The top four models are statistically indistinguishable (mean score 0.147-0.153) while a clear tier gap separates them from the remaining four (mean score <= 0.113). Dataset, contexts, annotations, and evaluation harness are released publicly.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Evaluated against an LLM-as-judge framework validated at kappa=0.75, 8 frontier models detect only 15-31% of human-flagged issues on the diff-only configuration, demonstrating that AI…

WHY NOW

AI Code Review moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA benchmark and evaluation framework to measure and improve AI code review quality, revealing current AI limitations and guiding future development.

Evidence24 refs | 4 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A benchmark and evaluation framework to measure and improve AI code review quality, revealing current AI limitations and guiding future development.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A benchmark and evaluation framework to measure and improve AI code review quality, revealing current AI limitations and guiding future development.

Segment

AI Code Review

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "97802c99-f3e7-4a2d-9037-a11331698256", "arxiv_id": "2603.26130", "canonical_route": "/paper/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback", "endpoints": { "paper_pack": "/api/v1/paper/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback/paper-pack", "build_passport": "/api/v1/paper/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback", "normalized_query": "2603.26130", "route": "/paper/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback", "paper_ref": "swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback#webpage", "url": "https://sciencetostartup.com/paper/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback", "name": "SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback", "description": "A benchmark and evaluation framework to measure and improve AI code review quality, revealing current AI limitations and guiding future development.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback#scholarlyArticle", "headline": "SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback", "description": "A benchmark and evaluation framework to measure and improve AI code review quality, revealing current AI limitations and guiding future development.", "url": "https://sciencetostartup.com/paper/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback", "sameAs": "https://arxiv.org/abs/2603.26130", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.26130" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-27T07:24:26.000Z", "author": [ { "@type": "Person", "name": "Deepak Kumar" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "AI Code Review" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "AI Code Review", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "SWE-PRBench: Benchmarking AI Code Review Quality Against Pul", "item": "https://sciencetostartup.com/paper/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback" } ] } ] }

Competitive landscape

A benchmark and evaluation framework to measure and improve AI code review quality, revealing current AI limitations and guiding future development.

Segment

AI Code Review

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline