How can I use vision foundation models for visual question answering on complex datasets?Answer not yet generated.