ARXIV:2603.26567 · LLM EVALUATION · SUBMITTED 30 MAR · 21:57 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Yoseph Berhanu Alebachew · Hunter Leary · Swanand Vaishampayan · Chris Brown · arXiv

A new dataset and evaluation framework to benchmark LLMs on understanding entire code repositories, revealing limitations in current reasoning capabilities.

Ship in 2-4 weeks›Score5.0Evidence unverified

Opportunity summary

Pain A new dataset and evaluation framework to benchmark LLMs on understanding entire code repositories, revealing limitations in current reasoning capabilities.

Evidence 78 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new dataset and evaluation framework to benchmark LLMs on understanding entire code repositories, revealing limitations in current reasoning capabilities. However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the…

METHOD

Full abstract

Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository-scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning. To our knowledge, this is the first empirical study to provide such evidence in repository-level QA. We release StackRepoQA to encourage further research into benchmarks, evaluation protocols, and augmentation strategies that disentangle memorization from reasoning, advancing LLMs as reliable tool for repository-scale program comprehension.

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Code availability is flagged in the…

WHY NOW

LLM Evaluation moved forward this cycle; last verified April 2026. Public score 5.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainA new dataset and evaluation framework to benchmark LLMs on understanding entire code repositories, revealing limitations in current reasoning capabilities.

Evidence78 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A new dataset and evaluation framework to benchmark LLMs on understanding entire code repositories, revealing limitations in current reasoning capabilities.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new dataset and evaluation framework to benchmark LLMs on understanding entire code repositories, revealing limitations in current reasoning capabilities.

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "ff0e8811-07e7-44bf-ba94-019e5c237fad", "arxiv_id": "2603.26567", "canonical_route": "/paper/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering", "endpoints": { "paper_pack": "/api/v1/paper/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering/paper-pack", "build_passport": "/api/v1/paper/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering", "normalized_query": "2603.26567", "route": "/paper/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering", "paper_ref": "beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering#webpage", "url": "https://sciencetostartup.com/paper/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering", "name": "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering", "description": "A new dataset and evaluation framework to benchmark LLMs on understanding entire code repositories, revealing limitations in current reasoning capabilities.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering#scholarlyArticle", "headline": "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering", "description": "A new dataset and evaluation framework to benchmark LLMs on understanding entire code repositories, revealing limitations in current reasoning capabilities.", "url": "https://sciencetostartup.com/paper/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering", "sameAs": "https://arxiv.org/abs/2603.26567", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.26567" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-27T16:30:54.000Z", "author": [ { "@type": "Person", "name": "Yoseph Berhanu Alebachew" }, { "@type": "Person", "name": "Hunter Leary" }, { "@type": "Person", "name": "Swanand Vaishampayan" }, { "@type": "Person", "name": "Chris Brown" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 5 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Evaluation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Evaluation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Beyond Code Snippets: Benchmarking LLMs on Repository-Level ", "item": "https://sciencetostartup.com/paper/beyond-code-snippets-benchmarking-llms-on-repository-level-question-answering" } ] } ] }

Competitive landscape

A new dataset and evaluation framework to benchmark LLMs on understanding entire code repositories, revealing limitations in current reasoning capabilities.

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline