Post-Training with Policy Gradients: Optimality and the Base Model Barrier | ScienceToStartup | ScienceToStartup