Evidence Receipt. Related Resources.
Reward Hacking as Equilibrium under Finite Evaluation
Use This Via API or MCP
Use this Signal Canvas via API or MCP
Route this paper proof surface into REST, MCP, or developer workflows while preserving the same evidence receipt and related-resource context.
Page Freshness
Signal Canvas proof surface
Canonical route: /signal-canvas/reward-hacking-as-equilibrium-under-finite-evaluation
- Proof freshness
- stale
- Proof status
- unverified
- Display score
- 3/10
- Last proof check
- 2026-03-31
- Score updated
- 2026-04-02
- Score fresh until
- 2026-05-02
- References
- 20
- Source count
- 3
- Coverage
- 50%
This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.
Agent Handoff
Reward Hacking as Equilibrium under Finite Evaluation
Canonical ID reward-hacking-as-equilibrium-under-finite-evaluation | Route /signal-canvas/reward-hacking-as-equilibrium-under-finite-evaluation
REST example
curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/reward-hacking-as-equilibrium-under-finite-evaluationMCP example
{
"tool": "search_signal_canvas",
"arguments": {
"mode": "paper",
"paper_ref": "reward-hacking-as-equilibrium-under-finite-evaluation",
"query_text": "Summarize Reward Hacking as Equilibrium under Finite Evaluation"
}
}source_context
{
"surface": "signal_canvas",
"mode": "paper",
"query": "Reward Hacking as Equilibrium under Finite Evaluation",
"normalized_query": "2603.28063",
"route": "/signal-canvas/reward-hacking-as-equilibrium-under-finite-evaluation",
"paper_ref": "reward-hacking-as-equilibrium-under-finite-evaluation",
"topic_slug": null,
"benchmark_ref": null,
"dataset_ref": null
}Preparing verified analysis
Dimensions overall score 3.0
GitHub Code Pulse
No public code linked for this paper yet.
Claim map
- Evidencepartial
any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system
ImplicationpartialDirectly stated as the main theorem in the abstract with explicit proof claim
Verificationpartialpartial
- Evidencepartial
This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architecture employed
ImplicationpartialExplicitly stated in abstract as a conclusion from the main theorem
Verificationpartialpartial
- Evidencepartial
the transition from closed reasoning to agentic systems causes evaluation coverage to decline toward zero as tool count grows
ImplicationpartialDirectly stated in abstract as a proven result with causal mechanism explanation
Verificationpartialpartial
- Evidencepartial
hacking severity increases structurally and without bound
ImplicationpartialDirectly stated in abstract with causal mechanism explanation
Verificationpartialpartial
- Evidencepartial
Our results unify the explanation of sycophancy, length gaming, and specification gaming under a single theoretical structure
ImplicationpartialExplicitly stated in abstract as a key contribution
Verificationpartialpartial
- Evidencepartial
to derive a computable distortion index that predicts both the direction and severity of hacking on each quality dimension prior to deployment
ImplicationpartialDirectly stated in abstract as a methodological contribution
Verificationpartialpartial
- Evidencepartial
We further conjecture -- with partial formal analysis -- the existence of a capability threshold beyond which agents transition from gaming within the evaluation system (Goodhart regime) to actively degrading the evaluation system itself (Campbell regime)
ImplicationpartialPresented as a conjecture with partial formal analysis rather than a proven theorem
Verificationpartialpartial
- Evidencepartial
providing the first economic formalization of Bostrom's (2014) 'treacherous turn'
ImplicationpartialExplicitly claimed as a contribution but based on a conjecture rather than proven theorem
Verificationpartialpartial