Current research in computer vision is increasingly focused on enhancing the robustness and efficiency of models through multimodal approaches and innovative architectures. Recent work on road surface classification illustrates this trend by integrating image and inertial data to improve performance across diverse environmental conditions, addressing a critical need in predictive maintenance systems. Similarly, cross-modal learning techniques are being applied to ship re-identification and plankton recognition, leveraging the strengths of different sensing modalities while minimizing the reliance on extensive labeled datasets. The introduction of rotation equivariant architectures is also notable, as it aims to bolster model resilience against geometric transformations, which is essential for real-world applications. Moreover, advancements in real-time segmentation and aberration correction highlight the push towards practical, deployment-ready solutions that can operate efficiently on edge devices. Collectively, these developments indicate a shift towards more adaptable, data-efficient systems capable of tackling complex, real-world challenges in various domains.
Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery is fundamentally challenged by the severe radiometric discrepancy between passive optical imaging a...
Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions ...
We introduce WAFT-Stereo, a simple and effective warping-based method for stereo matching. WAFT-Stereo demonstrates that cost volumes, a common design used in many leading methods, are not necessary f...
Automatic identification of screw types is important for industrial automation, robotics, and inventory management. However, publicly available datasets for screw classification are scarce, particular...
Quantifying human movement (kinematics) and musculoskeletal forces (kinetics) at scale, such as estimating quadriceps force during a sit-to-stand movement, could transform prediction, treatment, and m...
Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the succes...
This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recogn...
Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren't able to achieve this in one-s...
Pre-trained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs ...
Cross-View object geo-localization (CVOGL) aims to precisely determine the geographic coordinates of a query object from a ground or drone perspective by referencing a satellite map. Segmentation-base...