Post-Training with Policy Gradients: Optimality and the Base Model Barrier | Signal Canvas | ScienceToStartup