ARXIV:2603.28063 · AI ALIGNMENT THEORY · SUBMITTED 31 MAR · 20:24 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Reward Hacking as Equilibrium under Finite Evaluation

Jiacheng Wang · Jinbin Huang · arXiv

This paper theoretically proves that reward hacking is an inherent structural equilibrium in AI systems, not a bug, and proposes a computable index to predict its severity.

Blocked on Code›Score3.0Evidence unverified

Opportunity summary

Pain This paper theoretically proves that reward hacking is an inherent structural equilibrium in AI systems, not a bug, and proposes a computable index to predict its severity.

Evidence 20 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

This paper theoretically proves that reward hacking is an inherent structural equilibrium in AI systems, not a bug, and proposes a computable index to predict its severity. This result establishes reward hacking as a…

METHOD

Full abstract

We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed. Our framework instantiates the multi-task principal-agent model of Holmstrom and Milgrom (1991) in the AI alignment setting, but exploits a structural feature unique to AI systems -- the known, differentiable architecture of reward models -- to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment. We further prove that the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows -- because quality dimensions expand combinatorially while evaluation costs grow at most linearly per tool -- so that hacking severity increases structurally and without bound. Our results unify the explanation of sycophancy, length gaming, and specification gaming under a single theoretical structure and yield an actionable vulnerability assessment procedure. We further conjecture -- with partial formal analysis -- the existence of a capability threshold beyond which agents transition from gaming within the evaluation system (Goodhart regime) to actively degrading the evaluation system itself (Campbell regime), providing the first economic formalization of Bostrom's (2014) "treacherous turn."

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional…

WHY NOW

AI Alignment Theory moved forward this cycle; last verified April 2026. Public score 3.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainThis paper theoretically proves that reward hacking is an inherent structural equilibrium in AI systems, not a bug, and proposes a computable index to predict its severity.

Evidence20 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

This paper theoretically proves that reward hacking is an inherent structural equilibrium in AI systems, not a bug, and proposes a computable index to predict its severity.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

This paper theoretically proves that reward hacking is an inherent structural equilibrium in AI systems, not a bug, and proposes a computable index to predict its severity.

Segment

AI Alignment Theory

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "2309fc82-1b2a-447a-9d8a-190bb228ca59", "arxiv_id": "2603.28063", "canonical_route": "/paper/reward-hacking-as-equilibrium-under-finite-evaluation", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "reward-hacking-as-equilibrium-under-finite-evaluation", "endpoints": { "paper_pack": "/api/v1/paper/reward-hacking-as-equilibrium-under-finite-evaluation/paper-pack", "build_passport": "/api/v1/paper/reward-hacking-as-equilibrium-under-finite-evaluation/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Reward Hacking as Equilibrium under Finite Evaluation", "normalized_query": "2603.28063", "route": "/paper/reward-hacking-as-equilibrium-under-finite-evaluation", "paper_ref": "reward-hacking-as-equilibrium-under-finite-evaluation", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/reward-hacking-as-equilibrium-under-finite-evaluation#webpage", "url": "https://sciencetostartup.com/paper/reward-hacking-as-equilibrium-under-finite-evaluation", "name": "Reward Hacking as Equilibrium under Finite Evaluation", "description": "This paper theoretically proves that reward hacking is an inherent structural equilibrium in AI systems, not a bug, and proposes a computable index to predict its severity.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/reward-hacking-as-equilibrium-under-finite-evaluation#scholarlyArticle", "headline": "Reward Hacking as Equilibrium under Finite Evaluation", "description": "This paper theoretically proves that reward hacking is an inherent structural equilibrium in AI systems, not a bug, and proposes a computable index to predict its severity.", "url": "https://sciencetostartup.com/paper/reward-hacking-as-equilibrium-under-finite-evaluation", "sameAs": "https://arxiv.org/abs/2603.28063", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.28063" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-30T06:06:40.000Z", "author": [ { "@type": "Person", "name": "Jiacheng Wang" }, { "@type": "Person", "name": "Jinbin Huang" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "AI Alignment Theory" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "AI Alignment Theory", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Reward Hacking as Equilibrium under Finite Evaluation", "item": "https://sciencetostartup.com/paper/reward-hacking-as-equilibrium-under-finite-evaluation" } ] } ] }

Competitive landscape

This paper theoretically proves that reward hacking is an inherent structural equilibrium in AI systems, not a bug, and proposes a computable index to predict its severity.

Segment

AI Alignment Theory

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Reward Hacking as Equilibrium under Finite Evaluation

Reward Hacking as Equilibrium under Finite Evaluation

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline