Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering

stale

Proof freshness: stale
Proof status: unverified
Display score: 5/10
Last proof check: 2026-03-30
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 78
Source count: 3
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering | Route /signal-canvas/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering",
    "query_text": "Summarize Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering",
  "normalized_query": "2603.26567",
  "route": "/signal-canvas/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering",
  "paper_ref": "beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 12

References: 78

Proof: Verification pending

Freshness state: computing

Source paper: Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

PDF: https://arxiv.org/pdf/2603.26567v1

Source count: 3

Coverage: 50%

Last proof check: 2026-03-30T21:57:08.942Z

Signal Canvas receipt window

Watch and verify: Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

/buildability/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering

Watchwatch

Subject: Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 12Mixed 0Weak 0

Evidencepartial
In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects.
Implicationpartial
This is explicitly stated in the abstract and introduction as a novel contribution.
Verificationpartial
partial
Evidencepartial
Our findings indicate that while LLMs achieve moderate accuracy (around58%) on repository-level QA, much of this success can be attributed to memorization of previously seen Stack Overflow content rather than genuine reasoning over source code.
Implicationpartial
This is a direct result reported in the abstract and further elaborated in the introduction.
Verificationpartial
partial
Evidencepartial
RAG provided measurable improvements, with graph-based retrieval yielding the largest gains; however, even the best configuration only increased accuracy to approximately64%.
Implicationpartial
The abstract and introduction highlight the effectiveness of graph-based RAG.
Verificationpartial
partial
Evidencepartial
The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning.
Implicationpartial
This is a key finding explicitly stated in the abstract and supported by the ablation study description.
Verificationpartial
partial
Evidencepartial
We discuss key limitations of LLMs on repository-level tasks, such as performance degradation on unseen questions and sensitivity to irrelevant context
Implicationpartial
This limitation is mentioned in the abstract and further detailed in the 'Insights and Resources' section.
Verificationpartial
partial
Evidencepartial
In source-code RAG setups, retrieval solely based on flat textual similarity is insufficient, because code inherently encodes structural and semantic relationships—such as function calls, class hierarchies, and data-flow dependencies—that extend beyond what text similarity can capture [59].
Implicationpartial
This is a technical justification for the proposed graph-based approach, stated in the introduction.
Verificationpartial
partial
Evidencepartial
RAG provided measurable improvements, with graph-based retrieval yielding the largest gains; however, even the best configuration only increased accuracy to approximately64%.
Implicationpartial
This is a specific quantitative result reported in the abstract.
Verificationpartial
partial
Evidencepartial
In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects.
Implicationpartial
This is explicitly stated in the abstract and introduction, highlighting its novelty and scope.
Verificationpartial
partial
Evidencepartial
Our findings indicate that while LLMs achieve moderate accuracy (around58%) on repository-level QA, much of this success can be attributed to memorization of previously seen Stack Overflow content rather than genuine reasoning over source code.
Implicationpartial
The abstract and analysis excerpt provide a specific percentage for baseline LLM accuracy.
Verificationpartial
partial
Evidencepartial
RAG provided measurable improvements, with graph-based retrieval yielding the largest gains; however, even the best configuration only increased accuracy to approximately64%.
Implicationpartial
The abstract and analysis excerpt state that RAG provides improvements, with graph-based retrieval yielding the largest gains.
Verificationpartial
partial
Evidencepartial
RAG provided measurable improvements, with graph-based retrieval yielding the largest gains; however, even the best configuration only increased accuracy to approximately64%.
Implicationpartial
The analysis excerpt provides a specific upper bound for accuracy with augmentation.
Verificationpartial
partial
Evidencepartial
Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository-scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning.
Implicationpartial
This is a key finding explicitly stated in the abstract and analysis.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface