Recent research on Vision Transformers is increasingly focused on enhancing their adaptability and efficiency, addressing challenges such as computational demands and data scarcity. Techniques like AdapterTune introduce low-rank adapters that stabilize optimization in frozen models, significantly improving accuracy while minimizing parameter usage. Meanwhile, methods like Adaptive MLP Pruning and Hierarchical Auto-Pruning streamline model architectures by dynamically reducing unnecessary parameters without compromising performance, making these models more feasible for deployment on resource-constrained devices. Innovations such as the Channel-Aware Vision Transformer enhance feature fusion through dynamic attention mechanisms, improving representational expressiveness. Additionally, frameworks like Semi-Supervised Masked Autoencoders leverage abundant unlabeled data to boost performance in low-label scenarios, showcasing the potential for efficient training strategies. Overall, the field is shifting towards more flexible, resource-efficient models that can adapt to various tasks and environments, paving the way for broader applications in industries ranging from healthcare to autonomous systems.
Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of princ...
Large vision transformers present impressive scalability, as their performance can be well improved with increased model capacity. Nevertheless, their cumbersome parameters results in exorbitant compu...
Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ...
Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce th...
We address the challenge of training Vision Transformers (ViTs) when labeled data is scarce but unlabeled data is abundant. We propose Semi-Supervised Masked Autoencoder (SSMAE), a framework that join...
Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long ...
Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding...