Recent developments in vision-language models (VLMs) are increasingly focused on enhancing efficiency and interpretability while addressing deployment challenges in real-world applications. Researchers are exploring compact architectures that maintain high performance without the extensive computational overhead typically associated with large models. For instance, recent work has introduced innovative strategies for post-training quantization, enabling VLMs to operate effectively in resource-constrained environments. Additionally, tools like VisualScratchpad are being developed to improve model transparency, allowing users to analyze visual concepts during inference and identify failure modes. This shift towards fine-grained analysis and adaptive response strategies is evident in new benchmarks that categorize ambiguity in visual question answering, pushing VLMs to better manage uncertainty. Collectively, these advancements not only enhance the usability of VLMs across various domains, including agriculture and industrial applications, but also pave the way for more robust and reliable AI systems capable of nuanced reasoning and contextual understanding.
Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, ...
Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and...
Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, w...
Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning...
Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spat...
Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts pe...
This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural pl...
The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While com...
High-performing vision language models still produce incorrect answers, yet their failure modes are often difficult to explain. To make model internals more accessible and enable systematic debugging,...
The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain vulnerable to jailbreaking attacks that undermine their...