How can I use vision foundation models for image captioning with high semantic accuracy?Answer not yet generated.