Current research in multimodal learning is increasingly focused on enhancing the adaptability and efficiency of large multimodal models (LMMs) in real-world applications. Recent work has introduced innovative frameworks that address challenges such as modality incompleteness and the need for efficient training methods. For instance, advancements like High-Fidelity In-Context Learning improve task adaptation by optimizing the use of demonstration configurations, while Dual Decomposed Low-Rank Experts Collaboration tackles continual missing modality learning through a novel architecture that minimizes cross-task interference. Additionally, the integration of vision-language models with knowledge graphs is enhancing the alignment of diverse modalities, enabling more robust reasoning capabilities. Techniques like self-improvement frameworks, which leverage unlabeled data for training, are also gaining traction, indicating a shift toward reducing reliance on costly human-annotated datasets. Collectively, these developments suggest a concerted effort to create more versatile and efficient multimodal systems capable of addressing complex, real-world challenges.
In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to d...
Adapting Large Multimodal Models (LMMs) to real-world scenarios poses the dual challenges of learning from sequential data streams while handling frequent modality incompleteness, a task known as Cont...
Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human ...
Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-a...
Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at ...
Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We p...
Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its...
Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when app...
Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmenta...