22 papers - avg viability 5.8
Vision-Language Models (VLMs) are evolving to enhance their efficiency and reasoning capabilities by integrating bio-inspired techniques and adaptive sampling strategies. Recent advancements focus on improving visual representation through methods like training-free adaptive visual representations and dynamic feature modulation, which allow VLMs to process visual information more selectively and effectively. These innovations address significant challenges such as computational inefficiencies, redundancy in visual tokens, and the need for better alignment between visual and linguistic data. The development of frameworks that enable real-time reasoning and robust domain adaptation is crucial for builders aiming to deploy VLMs in practical applications, particularly in fields like autonomous driving and complex visual reasoning tasks. As VLMs become more capable of handling diverse visual inputs and reasoning requirements, they open new avenues for applications across various industries.
A VLM benchmark and dataset for understanding UML class diagrams, with a LoRA-based fine-tune outperforming existing models.
A fine-grained quantization strategy for large vision language models that enhances accuracy while reducing computational overhead.
Penguin-VL offers a lightweight, high-fidelity VLM solution for resource-constrained devices, outperforming leading VLMs in key tasks by leveraging a novel LLM-based vision encoder.
Fine-tuned VLMs on a new ambiguous VQA dataset to generate strategic responses, enabling them to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies.
A diagnostic benchmark for vision-language models to identify image distortions, revealing significant weaknesses in low-level perception.
LensVLM selectively expands compressed visual representations of text, enabling VLMs to maintain accuracy at significantly higher compression ratios.
A lightweight Vision Language Model for autonomous devices, optimized for local and edge deployment with strong performance on Japanese and English benchmarks for tasks like tool recognition and anomaly detection.
Quantifying the human visual exposome by coupling ecological momentary assessment with vision-language models to decode the visible world's association with mental health.
Perceptio enhances vision-language models with explicit 2D and 3D spatial reasoning by generating semantic segmentation and depth tokens directly within the autoregressive sequence.
Parallel-ICL enhances large vision-language models by improving inference speed while maintaining accuracy through innovative context chunking.