How can I use vision foundation models for image captioning | ScienceToStartup | ScienceToStartup