Evidence Receipt. Related Resources.
SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback
Use This Via API or MCP
Use this Signal Canvas via API or MCP
Route this paper proof surface into REST, MCP, or developer workflows while preserving the same evidence receipt and related-resource context.
Page Freshness
Signal Canvas proof surface
Canonical route: /signal-canvas/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback
- Proof freshness
- stale
- Proof status
- unverified
- Display score
- 7/10
- Last proof check
- 2026-03-30
- Score updated
- 2026-04-02
- Score fresh until
- 2026-05-02
- References
- 24
- Source count
- 4
- Coverage
- 50%
This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.
Agent Handoff
SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback
Canonical ID swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback | Route /signal-canvas/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback
REST example
curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedbackMCP example
{
"tool": "search_signal_canvas",
"arguments": {
"mode": "paper",
"paper_ref": "swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback",
"query_text": "Summarize SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback"
}
}source_context
{
"surface": "signal_canvas",
"mode": "paper",
"query": "SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback",
"normalized_query": "2603.26130",
"route": "/signal-canvas/swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback",
"paper_ref": "swe-prbench-benchmarking-ai-code-review-quality-against-pull-request-feedback",
"topic_slug": null,
"benchmark_ref": null,
"dataset_ref": null
}Preparing verified analysis
Dimensions overall score 7.0
GitHub Code Pulse
No public code linked for this paper yet.
Claim map
- Evidencepartial
We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality.
ImplicationpartialDirectly stated in the abstract and introduction.
Verificationpartialpartial
- Evidencepartial
8 frontier models detect only 15–31% of human-flagged issues on the diff-only configuration
ImplicationpartialExplicitly stated in the abstract with a numerical range.
Verificationpartialpartial
- Evidencepartial
demonstrating that AI code review remains far below human expert performance
ImplicationpartialStated in the abstract as a conclusion drawn from the benchmark results.
Verificationpartialpartial
- Evidencepartial
All 8 models degrade monotonically from config_A to config_C, even when context is provided via structured semantic layers
ImplicationpartialDirectly stated in the abstract and supported by the analysis of context configurations.
Verificationpartialpartial
- Evidencepartial
The dominant mechanism is a collapse of Type2_Contextual issue detection at config_B, consistent with attention dilution in long contexts
ImplicationpartialThe abstract identifies this as the 'dominant mechanism' and provides a supporting explanation.
Verificationpartialpartial
- Evidencepartial
a structured 2,000-token diff-with-summary prompt outperforms a 2,500-token full-context prompt enriched with execution context, behaviour mapping, and test signatures across all 8 models.
ImplicationpartialExplicitly stated in the abstract as a key finding regarding prompt engineering.
Verificationpartialpartial
- Evidencepartial
The top four models are statistically indistinguishable (mean score 0.147-0.153) while a clear tier gap separates them from the remaining four (mean score <= 0.113).
ImplicationpartialThe abstract describes a 'tier gap' and provides mean scores for the top and bottom groups.
Verificationpartialpartial
- Evidencepartial
Repository Quality Score (RQS) components. Component What it measures Max pts Review culture Share of substantive human review comments 30 PR recency Merged PRs in last 90 days 25 Test quality Test files, CI presence, coverage tooling 20
ImplicationpartialThe abstract mentions filtering using RQS, and the provided text details its components and weighting.
Verificationpartialpartial
- Evidencepartial
We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality.
ImplicationpartialDirectly stated in the abstract and introduction.
Verificationpartialpartial
- Evidencepartial
8 frontier models detect only 15–31% of human-flagged issues on the diff-only configuration
ImplicationpartialDirectly stated in the abstract and supported by baseline results.
Verificationpartialpartial
- Evidencepartial
All 8 models degrade monotonically from config_A to config_C, even when context is provided via structured semantic layers
ImplicationpartialExplicitly stated in the abstract and supported by the description of configurations.
Verificationpartialpartial
- Evidencepartial
The dominant mechanism is a collapse of Type2_Contextual issue detection at config_B, consistent with attention dilution in long contexts
ImplicationpartialStated as the dominant mechanism in the abstract, explaining the observed degradation.
Verificationpartialpartial