Data-efficient pre-training by scaling synthetic megadocs | ScienceToStartup | ScienceToStartup