Recent advancements in multimodal models are addressing key challenges in visual reasoning and semantic understanding, particularly in the context of computational efficiency and spatial awareness. Techniques such as frequency-modulated visual restoration are enabling models to maintain high accuracy while significantly reducing computational load, which is crucial for real-world applications where resources are limited. Additionally, the introduction of cognitively-inspired tokens is enhancing spatial reasoning capabilities, allowing models to better interpret visual perspectives and improve interactions in complex environments. The development of distance invariant position encoding is mitigating visual fading in long-context scenarios, ensuring that visual signals remain relevant regardless of text length. These innovations collectively aim to refine the balance between understanding and generation, paving the way for more robust applications in fields like scientific discovery and diagram comprehension, ultimately making multimodal models more effective across diverse tasks and industries.
Unified Multimodal Models struggle to bridge the fundamental gap between the abstract representations needed for visual understanding and the detailed primitives required for generation. Existing appr...
Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes...
Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persist...
Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. Ho...
Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful gene...
Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately ...
Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify...
We present Innovator-VL, a scientific multimodal large language model designed to advance understanding and reasoning across diverse scientific domains while maintaining excellent performance on gener...
Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differ...
Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this,...