SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

Stale70d ago24 refs / 4 sources / Verification pending

Export Brief Open in Build Loop Connect with Author

Use This Via API or MCP

Use this Signal Canvas via API or MCP

Route this paper proof surface into REST, MCP, or developer workflows while preserving the same evidence receipt and related-resource context.

Signal Canvas guide REST guide MCP guide

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-03-30
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 24
Source count: 4
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

Canonical ID swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback | Route /signal-canvas/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback",
    "query_text": "Summarize SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback",
  "normalized_query": "2603.26130",
  "route": "/signal-canvas/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback",
  "paper_ref": "swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Paper mode· single-doc scopescope: swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 12Mixed 0Weak 0

Evidencepartial
We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality.
Implicationpartial
Directly stated in the abstract and introduction.
Verificationpartial
partial
Evidencepartial
8 frontier models detect only 15–31% of human-flagged issues on the diff-only configuration
Implicationpartial
Explicitly stated in the abstract with a numerical range.
Verificationpartial
partial
Evidencepartial
demonstrating that AI code review remains far below human expert performance
Implicationpartial
Stated in the abstract as a conclusion drawn from the benchmark results.
Verificationpartial
partial
Evidencepartial
All 8 models degrade monotonically from config_A to config_C, even when context is provided via structured semantic layers
Implicationpartial
Directly stated in the abstract and supported by the analysis of context configurations.
Verificationpartial
partial
Evidencepartial
The dominant mechanism is a collapse of Type2_Contextual issue detection at config_B, consistent with attention dilution in long contexts
Implicationpartial
The abstract identifies this as the 'dominant mechanism' and provides a supporting explanation.
Verificationpartial
partial
Evidencepartial
a structured 2,000-token diff-with-summary prompt outperforms a 2,500-token full-context prompt enriched with execution context, behaviour mapping, and test signatures across all 8 models.
Implicationpartial
Explicitly stated in the abstract as a key finding regarding prompt engineering.
Verificationpartial
partial
Evidencepartial
The top four models are statistically indistinguishable (mean score 0.147-0.153) while a clear tier gap separates them from the remaining four (mean score <= 0.113).
Implicationpartial
The abstract describes a 'tier gap' and provides mean scores for the top and bottom groups.
Verificationpartial
partial
Evidencepartial
Repository Quality Score (RQS) components. Component What it measures Max pts Review culture Share of substantive human review comments 30 PR recency Merged PRs in last 90 days 25 Test quality Test files, CI presence, coverage tooling 20
Implicationpartial
The abstract mentions filtering using RQS, and the provided text details its components and weighting.
Verificationpartial
partial
Evidencepartial
We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality.
Implicationpartial
Directly stated in the abstract and introduction.
Verificationpartial
partial
Evidencepartial
8 frontier models detect only 15–31% of human-flagged issues on the diff-only configuration
Implicationpartial
Directly stated in the abstract and supported by baseline results.
Verificationpartial
partial
Evidencepartial
All 8 models degrade monotonically from config_A to config_C, even when context is provided via structured semantic layers
Implicationpartial
Explicitly stated in the abstract and supported by the description of configurations.
Verificationpartial
partial
Evidencepartial
The dominant mechanism is a collapse of Type2_Contextual issue detection at config_B, consistent with attention dilution in long contexts
Implicationpartial
Stated as the dominant mechanism in the abstract, explaining the observed degradation.
Verificationpartial
partial

Startup potential card

Share on X LinkedIn

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

Use this Signal Canvas via API or MCP

Signal Canvas proof surface

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

GitHub Code Pulse

Claim map

Startup potential card

Use Signal Canvas as the narrative proof surface

Evidence Receipt

Watch and verify: SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

Compute envelope

Evidence ids

Freshness

Related Resources

BUILDER'S SANDBOX

Build This Paper

Recommended Stack

Startup Essentials

MVP Investment

Talent Scout

Hash state

Signature state

Blockers

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

Use this Signal Canvas via API or MCP

Signal Canvas proof surface

SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

GitHub Code Pulse

Claim map

Keep exploring

Startup potential card

Use Signal Canvas as the narrative proof surface

Evidence Receipt

Watch and verify: SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback

Compute envelope

Evidence ids

Freshness

Related Resources

BUILDER'S SANDBOX

Build This Paper

Recommended Stack

Startup Essentials

Hash state

Signature state

Blockers