ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-03-30
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 9
Source count: 3
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID atime-consistent-benchmark-for-repository-level-software-engineering-evaluation | Route /signal-canvas/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "atime-consistent-benchmark-for-repository-level-software-engineering-evaluation",
    "query_text": "Summarize ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation",
  "normalized_query": "2603.26137",
  "route": "/signal-canvas/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation",
  "paper_ref": "atime-consistent-benchmark-for-repository-level-software-engineering-evaluation",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 12

References: 9

Proof: Verification pending

Freshness state: computing

Source paper: ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

PDF: https://arxiv.org/pdf/2603.26137v1

Source count: 3

Coverage: 50%

Last proof check: 2026-03-30T21:58:57.202Z

Signal Canvas receipt window

Watch and verify: ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

/buildability/atime-consistent-benchmark-for-repository-level-software-engineering-evaluation

Watchwatch

Subject: ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 12Mixed 0Weak 0

Evidencepartial
We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1].
Implicationpartial
This is a core methodological contribution explicitly stated in the abstract and elaborated on in the introduction.
Verificationpartial
partial
Evidencepartial
and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant.
Implicationpartial
This describes the experimental design for evaluating the impact of repository knowledge, as stated in the abstract.
Verificationpartial
partial
Evidencepartial
Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model.
Implicationpartial
This is a specific quantitative result reported in the abstract and detailed in the results section and figures.
Verificationpartial
partial
Evidencepartial
These results show that prompt construction is a first-order benchmark variable.
Implicationpartial
This is a direct conclusion drawn from the experimental results regarding prompt granularity.
Verificationpartial
partial
Evidencepartial
More broadly, the benchmark highlights that temporal consistency and prompt control are core validity requirements for repository-aware software engineering evaluation.
Implicationpartial
This is a broader conclusion about the implications of the benchmark methodology and findings.
Verificationpartial
partial
Evidencepartial
We also report a baseline characterization study on two open-source repositories, DragonFly and React, using three Claude-family models and four prompt granularities.
Implicationpartial
The repositories used for the baseline study are explicitly listed in the abstract and the 'Category Setting' table.
Verificationpartial
partial
Evidencepartial
Task source Historical PRs merged in(𝑇 0, 𝑇1]
Implicationpartial
The source of tasks for the benchmark is clearly defined in the abstract and the 'Category Setting' table.
Verificationpartial
partial
Evidencepartial
The distribution does not simply shift upward smoothly; rather, prompt strengthening moves substantial probability mass out of the zero-performance bin and into the high-F1 and exact-match bins.
Implicationpartial
This observation is made from the F1 distribution figures for both repositories, indicating the impact of prompt quality on task solvability.
Verificationpartial
partial
Evidencepartial
We present a time-consistent benchmark methodology that snapshots a repository at time T0, constructs repository-derived code knowledge using only artifacts available before T0, and evaluates on engineering tasks derived from pull requests merged in the future interval (T0, T1].
Implicationpartial
This is a core methodological contribution explicitly described in the abstract and introduction.
Verificationpartial
partial
Evidencepartial
and the benchmark is formalized as a matched A/B comparison in which the same software engineering agent is evaluated with and without repository-derived code knowledge while all other variables are held constant.
Implicationpartial
This describes the experimental setup for evaluating the impact of repository knowledge, as stated in the abstract.
Verificationpartial
partial
Evidencepartial
Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model.
Implicationpartial
This is a key result reported in the abstract and supported by figures and tables showing F1 scores across different prompt granularities.
Verificationpartial
partial
Evidencepartial
Across both repositories, file-level F1 increases monotonically from minimal to guided prompts, reaching 0.8081 on DragonFly and 0.8078 on React for the strongest tested model.
Implicationpartial
Specific numerical results are provided for the highest performing models and prompts on the tested repositories.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface