ARXIV:2605.12673 · AI AGENT BENCHMARKING · SUBMITTED 14 MAY · 20:10 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Hao Wang · Hanchen Li · Qiuyang Mang · Alvin Cheung · Koushik Sen · Dawn Song · arXiv

An automated red-teaming system that audits AI agent benchmarks to identify and patch reward-hacking exploits, ensuring more robust evaluations.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain An automated red-teaming system that audits AI agent benchmarks to identify and patch reward-hacking exploits, ensuring more robust evaluations.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

An automated red-teaming system that audits AI agent benchmarks to identify and patch reward-hacking exploits, ensuring more robust evaluations. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously…

METHOD

Full abstract

Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. Code availability is…

WHY NOW

AI Agent Benchmarking moved forward this cycle; last verified May 2026. Public score 8.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainAn automated red-teaming system that audits AI agent benchmarks to identify and patch reward-hacking exploits, ensuring more robust evaluations.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

An automated red-teaming system that audits AI agent benchmarks to identify and patch reward-hacking exploits, ensuring more robust evaluations.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

An automated red-teaming system that audits AI agent benchmarks to identify and patch reward-hacking exploits, ensuring more robust evaluations.

Segment

AI Agent Benchmarking

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "4bd30498-ee82-42a6-8c7f-243504774e01", "arxiv_id": "2605.12673", "canonical_route": "/paper/do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack", "endpoints": { "paper_pack": "/api/v1/paper/do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack/paper-pack", "build_passport": "/api/v1/paper/do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack", "normalized_query": "2605.12673", "route": "/paper/do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack", "paper_ref": "do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack#webpage", "url": "https://sciencetostartup.com/paper/do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack", "name": "Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack", "description": "An automated red-teaming system that audits AI agent benchmarks to identify and patch reward-hacking exploits, ensuring more robust evaluations.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack#scholarlyArticle", "headline": "Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack", "description": "An automated red-teaming system that audits AI agent benchmarks to identify and patch reward-hacking exploits, ensuring more robust evaluations.", "url": "https://sciencetostartup.com/paper/do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack", "sameAs": "https://arxiv.org/abs/2605.12673", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.12673" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-12T19:22:45.000Z", "author": [ { "@type": "Person", "name": "Hao Wang" }, { "@type": "Person", "name": "Hanchen Li" }, { "@type": "Person", "name": "Qiuyang Mang" }, { "@type": "Person", "name": "Alvin Cheung" }, { "@type": "Person", "name": "Koushik Sen" }, { "@type": "Person", "name": "Dawn Song" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "AI Agent Benchmarking" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "AI Agent Benchmarking", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Do Androids Dream of Breaking the Game? Systematically Audit", "item": "https://sciencetostartup.com/paper/do-androids-dream-of-breaking-the-game-systematically-auditing-ai-agent-benchmarks-with-benchjack" } ] } ] }

Competitive landscape

An automated red-teaming system that audits AI agent benchmarks to identify and patch reward-hacking exploits, ensuring more robust evaluations.

Segment

AI Agent Benchmarking

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline