ARXIV:2604.28093 · LLM EVALUATION · SUBMITTED 01 MAY · 15:05 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Ivan Bercovich · arXiv

Guidelines for creating adversarial, difficult, and legible benchmark tasks for terminal-agent evaluations to improve LLM capability assessment.

Ship in 2-4 weeks›Score4.0Evidence unverified

Opportunity summary

Pain Guidelines for creating adversarial, difficult, and legible benchmark tasks for terminal-agent evaluations to improve LLM capability assessment.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Guidelines for creating adversarial, difficult, and legible benchmark tasks for terminal-agent evaluations to improve LLM capability assessment. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without…

METHOD

Full abstract

Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes -- AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong things, and reward-hackable environments -- are predictable consequences of treating task authoring as prompt authoring. We catalog these failure modes, argue that real difficulty is conceptual rather than environmental, and discuss recent empirical evidence that over 15% of tasks in popular terminal-agent benchmarks are reward-hackable. We hope this serves as a useful reference for benchmark maintainers, task contributors, and researchers using benchmark scores as evidence.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. We hope this serves as a useful reference for benchmark maintainers, task contributors, and researchers using benchmark scores as evidence. Code availability is flagged…

WHY NOW

LLM Evaluation moved forward this cycle; last verified May 2026. Public score 4.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainGuidelines for creating adversarial, difficult, and legible benchmark tasks for terminal-agent evaluations to improve LLM capability assessment.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

Guidelines for creating adversarial, difficult, and legible benchmark tasks for terminal-agent evaluations to improve LLM capability assessment.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Guidelines for creating adversarial, difficult, and legible benchmark tasks for terminal-agent evaluations to improve LLM capability assessment.

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "5edf6129-c338-4129-8356-174c7e7a82b6", "arxiv_id": "2604.28093", "canonical_route": "/paper/what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design", "endpoints": { "paper_pack": "/api/v1/paper/what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design/paper-pack", "build_passport": "/api/v1/paper/what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design", "normalized_query": "2604.28093", "route": "/paper/what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design", "paper_ref": "what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design#webpage", "url": "https://sciencetostartup.com/paper/what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design", "name": "What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design", "description": "Guidelines for creating adversarial, difficult, and legible benchmark tasks for terminal-agent evaluations to improve LLM capability assessment.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design#scholarlyArticle", "headline": "What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design", "description": "Guidelines for creating adversarial, difficult, and legible benchmark tasks for terminal-agent evaluations to improve LLM capability assessment.", "url": "https://sciencetostartup.com/paper/what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design", "sameAs": "https://arxiv.org/abs/2604.28093", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.28093" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-30T16:37:37.000Z", "author": [ { "@type": "Person", "name": "Ivan Bercovich" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Evaluation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Evaluation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "What Makes a Good Terminal-Agent Benchmark Task: A Guideline", "item": "https://sciencetostartup.com/paper/what-makes-a-good-terminal-agent-benchmark-task-a-guideline-for-adversarial-difficult-and-legible-evaluation-design" } ] } ] }

Competitive landscape

Guidelines for creating adversarial, difficult, and legible benchmark tasks for terminal-agent evaluations to improve LLM capability assessment.

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline