ARXIV:2604.02022 · AGENT SAFETY · SUBMITTED 03 APR · 20:50 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

Yu Li · Haoyu Luo · Yuejin Xie · Yuqian Fu · Zhonghao Yang · Shuai Shao · +7 at arXiv

ATBench provides a realistic and diverse benchmark for evaluating the long-horizon safety of LLM-based agents, enabling better risk assessment and mitigation.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain ATBench provides a realistic and diverse benchmark for evaluating the long-horizon safety of LLM-based agents, enabling better risk assessment and mitigation.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

ATBench provides a realistic and diverse benchmark for evaluating the long-horizon safety of LLM-based agents, enabling better risk assessment and mitigation. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety…

METHOD

Full abstract

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark…

WHY NOW

Agent Safety moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainATBench provides a realistic and diverse benchmark for evaluating the long-horizon safety of LLM-based agents, enabling better risk assessment and mitigation.

Evidence0 refs | 0 sources | 33% coverage

Blockerno shell-level blocker reported

Analysis summary

ATBench provides a realistic and diverse benchmark for evaluating the long-horizon safety of LLM-based agents, enabling better risk assessment and mitigation.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

ATBench provides a realistic and diverse benchmark for evaluating the long-horizon safety of LLM-based agents, enabling better risk assessment and mitigation.

Segment

Agent Safety

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "b6a4a82e-188f-4b17-9000-dad26c36a5cc", "arxiv_id": "2604.02022", "canonical_route": "/paper/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety", "endpoints": { "paper_pack": "/api/v1/paper/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety/paper-pack", "build_passport": "/api/v1/paper/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety", "normalized_query": "2604.02022", "route": "/paper/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety", "paper_ref": "atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety#webpage", "url": "https://sciencetostartup.com/paper/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety", "name": "ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety", "description": "ATBench provides a realistic and diverse benchmark for evaluating the long-horizon safety of LLM-based agents, enabling better risk assessment and mitigation.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety#scholarlyArticle", "headline": "ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety", "description": "ATBench provides a realistic and diverse benchmark for evaluating the long-horizon safety of LLM-based agents, enabling better risk assessment and mitigation.", "url": "https://sciencetostartup.com/paper/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety", "sameAs": "https://arxiv.org/abs/2604.02022", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.02022" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T13:26:20.000Z", "author": [ { "@type": "Person", "name": "Yu Li" }, { "@type": "Person", "name": "Haoyu Luo" }, { "@type": "Person", "name": "Yuejin Xie" }, { "@type": "Person", "name": "Yuqian Fu" }, { "@type": "Person", "name": "Zhonghao Yang" }, { "@type": "Person", "name": "Shuai Shao" }, { "@type": "Person", "name": "Qihan Ren" }, { "@type": "Person", "name": "Wanying Qu" }, { "@type": "Person", "name": "Yanwei Fu" }, { "@type": "Person", "name": "Yujiu Yang" }, { "@type": "Person", "name": "Jing Shao" }, { "@type": "Person", "name": "Xia Hu" }, { "@type": "Person", "name": "Dongrui Liu" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agent Safety" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agent Safety", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "ATBench: A Diverse and Realistic Trajectory Benchmark for Lo", "item": "https://sciencetostartup.com/paper/atbench-a-diverse-and-realistic-trajectory-benchmark-for-long-horizon-agent-safety" } ] } ] }

Competitive landscape

ATBench provides a realistic and diverse benchmark for evaluating the long-horizon safety of LLM-based agents, enabling better risk assessment and mitigation.

Segment

Agent Safety

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline