What is the shift towards specialized multimodal models for fine-grained understanding?Answer not yet generated.