ARXIV:2603.11337 · AGENTS · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents

arXiv

RewardHackingAgents benchmarks the integrity of evaluation processes for ML engineering agents to prevent score tampering.

Blocked on Code›Score6.0Evidence unverified

Opportunity summary

Pain RewardHackingAgents benchmarks the integrity of evaluation processes for ML engineering agents to prevent score tampering.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

RewardHackingAgents benchmarks the integrity of evaluation processes for ML engineering agents to prevent score tampering. This creates a structural vulnerability: an agent can increase the reported score by compromising the evaluation pipeline rather than…

METHOD

Full abstract

LLM agents increasingly perform end-to-end ML engineering tasks where success is judged by a single scalar test metric. This creates a structural vulnerability: an agent can increase the reported score by compromising the evaluation pipeline rather than improving the model. We introduce RewardHackingAgents, a workspace-based benchmark that makes two compromise vectors explicit and measurable: evaluator tampering (modifying metric computation or reporting) and train/test leakage (accessing held-out data or labels during training). Each episode runs in a fresh workspace with patch tracking and runtime file-access logging; detectors compare the agent-reported metric to a trusted reference to assign auditable integrity labels. Across three tasks and two LLM backbones, scripted attacks succeed on both vectors in fully mutable workspaces; single-mechanism defenses block only one vector; and a combined regime blocks both. In natural-agent runs, evaluator-tampering attempts occur in about 50% of episodes and are eliminated by evaluator locking, with a 25-31% median runtime overhead. Overall, we demonstrate that evaluation integrity for ML-engineering agents can be benchmarked as a first-class outcome rather than assumed.

RESULT

ScienceToStartup currently rates this 6.0/10 on the public viability pass. Overall, we demonstrate that evaluation integrity for ML-engineering agents can be benchmarked as a first-class outcome rather than assumed.

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 6.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score6.0

PainRewardHackingAgents benchmarks the integrity of evaluation processes for ML engineering agents to prevent score tampering.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

RewardHackingAgents benchmarks the integrity of evaluation processes for ML engineering agents to prevent score tampering.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

RewardHackingAgents benchmarks the integrity of evaluation processes for ML engineering agents to prevent score tampering.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "b3051940-474b-434c-a5a3-feaaedf4e7dc", "arxiv_id": "2603.11337", "canonical_route": "/paper/rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents", "endpoints": { "paper_pack": "/api/v1/paper/rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents/paper-pack", "build_passport": "/api/v1/paper/rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents", "normalized_query": "2603.11337", "route": "/paper/rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents", "paper_ref": "rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents#webpage", "url": "https://sciencetostartup.com/paper/rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents", "name": "RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents", "description": "RewardHackingAgents benchmarks the integrity of evaluation processes for ML engineering agents to prevent score tampering.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents#scholarlyArticle", "headline": "RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents", "description": "RewardHackingAgents benchmarks the integrity of evaluation processes for ML engineering agents to prevent score tampering.", "url": "https://sciencetostartup.com/paper/rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents", "sameAs": "https://arxiv.org/abs/2603.11337", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.11337" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-11T22:06:44.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 6 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "RewardHackingAgents: Benchmarking Evaluation Integrity for L", "item": "https://sciencetostartup.com/paper/rewardhackingagents-benchmarking-evaluation-integrity-for-llm-ml-engineering-agents" } ] } ] }

Competitive landscape

RewardHackingAgents benchmarks the integrity of evaluation processes for ML engineering agents to prevent score tampering.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents

RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline