Recent advancements in video understanding are focusing on enhancing the capabilities of models to interpret complex visual data through improved frameworks and methodologies. For instance, the introduction of benchmarks like MultiAgent-EgoQA is pushing the boundaries of how multiple egocentric videos can be processed simultaneously, which is crucial for applications involving collaborative AI agents in real-world scenarios. Additionally, techniques such as Mask-to-Point learning are refining visual foundation models to better track dense points across video frames, thus improving accuracy in dynamic environments. The integration of spatial and temporal reasoning in models like SPARROW and the development of frameworks like HAVEN are addressing the challenges of coherence and context in long videos, which can significantly benefit sectors like entertainment and surveillance. These efforts are not only enhancing model performance but also paving the way for more efficient and practical applications in industries that rely heavily on video data analysis.
Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting f...
Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatia...
We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empi...
Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inferen...
Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimizatio...
As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-...
Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal unde...
Video Large Language Models (VLMs) have achieved remarkable success in video understanding, but the significant computational cost from processing dense frames severely limits their practical applicat...
Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decis...
Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ''context rot'' due to mass...