Skip to main content
When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming | Buildability Receipt | ScienceToStartup