Multimodal LLMs

Proof pending

25papers

6.7viability

-87%30d

Proof pending

Proof pending. Core topic summary fields are still materializing.

State of the Field

Recent advancements in multimodal large language models (MLLMs) are addressing critical challenges in visual reasoning and cognitive inference, with a focus on enhancing model performance across various applications. Researchers are exploring methods to improve the integration of visual and textual information, as evidenced by new frameworks that enhance reasoning capabilities without extensive retraining. For instance, approaches like Inertia-aware Visual Excitation aim to mitigate cognitive hallucinations by dynamically adjusting visual attention, while frameworks such as MicroWorld leverage multimodal attributed property graphs to bolster scientific reasoning in specialized domains like microscopy. Additionally, the introduction of benchmarks like MMTR-Bench allows for a more nuanced evaluation of MLLMs' abilities to reconstruct masked text from visual context, further pushing the boundaries of their capabilities. These developments not only promise to refine model accuracy but also hold potential for practical applications in fields such as urban planning, historical analysis, and biomedical research, where precise visual and contextual understanding is paramount.

Last updated May 26, 2026

Topic-linked question coverage is still building for this proof surface.

Topic trend

Topic-specific paper and score movement from the daily diff ledger.

Papers

1-10 of 25

Research Paper·Apr 13, 2026

Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning

In recent years, rapid advances in Multimodal Large Language Models (MLLMs) have increasingly stimulated research on ancient Chinese scripts. As the evolution of written characters constitutes a funda...

8.0 viabilityHas code

Research Paper·Apr 22, 2026

Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-t...

8.0 viability

Research Paper·Apr 7, 2026

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of...

8.0 viability

Research Paper·Apr 17, 2026

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with gen...

7.0 viability

Research Paper·May 11, 2026

MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-s...

7.0 viabilityHas code

Research Paper·Apr 9, 2026

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains oft...

7.0 viability

Research Paper·Apr 1, 2026

Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they ...

7.0 viability

Research Paper·Apr 3, 2026

Token Warping Helps MLLMs Look from Nearby Viewpoints

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain ...

7.0 viability

Research Paper·Apr 3, 2026

CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

Charts are ubiquitous in scientific and financial literature for presenting structured data. However, chart reasoning remains challenging for multimodal large language models (MLLMs) due to the lack o...

7.0 viability

Research Paper·Apr 23, 2026

Can MLLMs "Read" What is Missing?

We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional...

7.0 viability

Page 1 of 3

Multimodal LLMs

Proof pending

State of the Field

Topic trend

Papers

Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning

Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

Token Warping Helps MLLMs Look from Nearby Viewpoints

CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

Can MLLMs "Read" What is Missing?

Filters

Topic proof surfaces

Multimodal LLMs

Use this topic page as a durable research-area proof surface