ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-04-03
Score updated: 2026-04-03
Score fresh until: 2026-05-03
References: 0
Source count: 0
Coverage: 33%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety | Route /signal-canvas/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety",
    "query_text": "Summarize ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety",
  "normalized_query": "2604.02022",
  "route": "/signal-canvas/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety",
  "paper_ref": "atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 8

References: Pending verification

Proof: Verification pending

Freshness state: computing

Source paper: ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

PDF: https://arxiv.org/pdf/2604.02022v1

Source count: Pending verification

Coverage: 33%

Last proof check: 2026-04-03T20:50:40.576Z

Signal Canvas receipt window

Watch and verify: ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

/buildability/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety

Watchwatch

Subject: ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
with 1,954 invoked tools drawn from pools spanning 2,084 available tools
Implicationpartial
Directly stated numeric evidence provided in the abstract.
Verificationpartial
partial
Evidencepartial
ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm.
Implicationpartial
Explicitly stated in the abstract as a core methodological contribution.
Verificationpartial
partial
Evidencepartial
The benchmark contains 1,000 trajectories (503 safe and 497 unsafe)
Implicationpartial
Directly stated numeric evidence provided in the abstract.
Verificationpartial
partial
Evidencepartial
averaging 9.01 turns and 3.95k tokens
Implicationpartial
Directly stated numeric evidence provided in the abstract.
Verificationpartial
partial
Evidencepartial
Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators
Implicationpartial
Strongly supported by the statement about experiments, though specific performance metrics are not provided in the abstract.
Verificationpartial
partial
Evidencepartial
Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism.
Implicationpartial
Explicitly stated as a limitation of prior work, forming the motivation for ATBench.
Verificationpartial
partial
Evidencepartial
while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.
Implicationpartial
Directly stated as a capability enabled by ATBench, though the evidence for this enabling is implied by the benchmark's design.
Verificationpartial
partial
Evidencepartial
Data quality is supported by rule-based and LLM-based filtering plus full human audit.
Implicationpartial
Explicitly stated as a method used to ensure data quality.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface