What are the challenges in achieving true multimodal understanding with vision language models?Answer not yet generated.