ARXIV:2605.12233 · AGENTS · SUBMITTED 13 MAY · 20:36 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: partial proof status

No More, No Less: Task Alignment in Terminal Agents

Sina Mavali · David Pape · Jonathan Evertz · Samira Abedini · Devansh Srivastav · Thorsten Eisenhofer · +2 at arXiv

Introduces TAB, a benchmark for evaluating task alignment in terminal agents, revealing a gap between task completion and selective instruction following.

Ship in 2-4 weeks›Score5.0Evidence partial

Opportunity summary

Pain Introduces TAB, a benchmark for evaluating task alignment in terminal agents, revealing a gap between task completion and selective instruction following.

Evidence 0 refs | 5 sources | 83% coverage

Blocker Evidence partial

Open Build Read PDF Signal Canvas Track

PROBLEM

Introduces TAB, a benchmark for evaluating task alignment in terminal agents, revealing a gap between task completion and selective instruction following. To do so, they must interpret instructions encountered in the environment (e.g., README…

METHOD

Full abstract

Terminal agents are increasingly capable of executing complex, long-horizon tasks autonomously from a single user prompt. To do so, they must interpret instructions encountered in the environment (e.g., README files, code comments, stack traces) and determine their relevance to the task. This creates a fundamental challenge: relevant cues must be followed to complete a task, whereas irrelevant or misleading ones must be ignored. Existing benchmarks do not capture this ability. An agent may appear capable by blindly following all instructions, or appear robust by ignoring them altogether. We introduce TAB (Task Alignment Benchmark), a suite of 89 terminal tasks derived from Terminal-Bench 2.1. Each task is intentionally underspecified, with missing information provided as a necessary cue embedded in a natural environmental artifact, alongside a plausible but irrelevant distractor. Solving these tasks requires selectively using the cue while ignoring the distractor. Applying TAB to ten frontier agents reveals a systematic gap between task capability and task alignment. The strongest Terminal-Bench agent achieves high task completion but low task alignment on TAB. Evaluating six prompt-injection defenses further shows that suppressing distractor execution also suppresses the cues required for task completion. These results demonstrate that task-aligned agents require selective use of environmental instructions rather than blanket acceptance or rejection.

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. The strongest Terminal-Bench agent achieves high task completion but low task alignment on TAB. A public repository is linked, so build verification can inspect…

WHY NOW

Agents moved forward this cycle; last verified May 2026. Public score 5.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainIntroduces TAB, a benchmark for evaluating task alignment in terminal agents, revealing a gap between task completion and selective instruction following.

Evidence0 refs | 5 sources | 83% coverage

Blockerno shell-level blocker reported

Analysis summary

Introduces TAB, a benchmark for evaluating task alignment in terminal agents, revealing a gap between task completion and selective instruction following.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: partial proof status

Competitive landscape

Introduces TAB, a benchmark for evaluating task alignment in terminal agents, revealing a gap between task completion and selective instruction following.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "847bd86f-7e68-4a74-bd31-33099e3c1da5", "arxiv_id": "2605.12233", "canonical_route": "/paper/no-more-no-less-task-alignment-in-terminal-agents", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "no-more-no-less-task-alignment-in-terminal-agents", "endpoints": { "paper_pack": "/api/v1/paper/no-more-no-less-task-alignment-in-terminal-agents/paper-pack", "build_passport": "/api/v1/paper/no-more-no-less-task-alignment-in-terminal-agents/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "No More, No Less: Task Alignment in Terminal Agents", "normalized_query": "2605.12233", "route": "/paper/no-more-no-less-task-alignment-in-terminal-agents", "paper_ref": "no-more-no-less-task-alignment-in-terminal-agents", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/no-more-no-less-task-alignment-in-terminal-agents#webpage", "url": "https://sciencetostartup.com/paper/no-more-no-less-task-alignment-in-terminal-agents", "name": "No More, No Less: Task Alignment in Terminal Agents", "description": "Introduces TAB, a benchmark for evaluating task alignment in terminal agents, revealing a gap between task completion and selective instruction following.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/no-more-no-less-task-alignment-in-terminal-agents#scholarlyArticle", "headline": "No More, No Less: Task Alignment in Terminal Agents", "description": "Introduces TAB, a benchmark for evaluating task alignment in terminal agents, revealing a gap between task completion and selective instruction following.", "url": "https://sciencetostartup.com/paper/no-more-no-less-task-alignment-in-terminal-agents", "sameAs": "https://arxiv.org/abs/2605.12233", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.12233" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-12T15:06:15.000Z", "author": [ { "@type": "Person", "name": "Sina Mavali" }, { "@type": "Person", "name": "David Pape" }, { "@type": "Person", "name": "Jonathan Evertz" }, { "@type": "Person", "name": "Samira Abedini" }, { "@type": "Person", "name": "Devansh Srivastav" }, { "@type": "Person", "name": "Thorsten Eisenhofer" }, { "@type": "Person", "name": "Sahar Abdelnabi" }, { "@type": "Person", "name": "Lea Schönherr" } ], "codeRepository": "https://github.com/Dormant-Neurons/tab", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 5 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/no-more-no-less-task-alignment-in-terminal-agents#software", "name": "No More, No Less: Task Alignment in Terminal Agents - Source Code", "description": "Introduces TAB, a benchmark for evaluating task alignment in terminal agents, revealing a gap between task completion and selective instruction following.", "codeRepository": "https://github.com/Dormant-Neurons/tab", "url": "https://github.com/Dormant-Neurons/tab" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "No More, No Less: Task Alignment in Terminal Agents", "item": "https://sciencetostartup.com/paper/no-more-no-less-task-alignment-in-terminal-agents" } ] } ] }

Competitive landscape

Introduces TAB, a benchmark for evaluating task alignment in terminal agents, revealing a gap between task completion and selective instruction following.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

No More, No Less: Task Alignment in Terminal Agents

No More, No Less: Task Alignment in Terminal Agents

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline