Recent advancements in multimodal AI are addressing critical challenges in model reliability and efficiency, particularly in the context of hallucinations and data selection. New benchmarks and tuning methods, such as FINER and ScalSelect, are enhancing the accuracy of multimodal large language models (MLLMs) by focusing on fine-grained queries and optimizing data usage, respectively. Meanwhile, frameworks like FiLoRA and Cheers are enabling better control over feature reliance and unifying visual comprehension with generation, which could significantly improve applications in areas like sentiment analysis and creative content generation. The introduction of models like MoST, which effectively integrates speech and text, highlights a shift towards more specialized architectures that leverage modality-specific learning. These developments not only promise to enhance performance across various benchmarks but also aim to solve commercial problems related to data efficiency and model interpretability, making multimodal systems more viable for real-world applications.
Multimodal foundation models integrate heterogeneous signals across modalities, yet it remains poorly understood how their predictions depend on specific internal feature groups and whether such relia...
This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jo...
Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grou...
Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-sca...
A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual represent...
We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMo...
Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related qu...
Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities across modalities such as images and text. However, tabular data, despite being a critical real-world modal...
While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversatio...
As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM eva...