Reasoning over mathematical objects: on-policy reward modeling and test time aggregation | ScienceToStartup | ScienceToStartup