Skip to main content
When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming | Signal Canvas | ScienceToStartup