Recent advancements in transformer architectures are focusing on enhancing efficiency and interpretability, addressing critical challenges in deployment and performance. The introduction of ultra-sparse embeddings through methods like CSRv2 is making it feasible to reduce memory and computational costs significantly, with reported improvements in speed and efficiency that are crucial for real-time applications. In parallel, frameworks like UAT-LITE are tackling the issue of miscalibrated predictions in neural NLP models, enhancing uncertainty awareness without altering pretrained weights, thereby improving reliability in high-stakes environments. Additionally, innovations such as RASA are breaking through the relational bottleneck in transformers, enabling better multi-hop reasoning by incorporating relational structures into attention mechanisms. These developments suggest a shift toward more practical, deployable AI systems that prioritize both performance and resource efficiency, positioning transformers to better handle complex tasks across various domains, including natural language processing and structured data analysis.
In the era of large foundation models, the quality of embeddings has become a central determinant of downstream task performance and overall system capability. Yet widely used dense embeddings are oft...
Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to ...
We introduce directional routing, a lightweight mechanism that gives each transformer attention head learned suppression directions controlled by a shared router, at 3.9% parameter cost. We train a 43...
Neural NLP models are often miscalibrated, assigning high confidence to incorrect predictions, which undermines selective prediction and high-stakes deployment. Post-hoc calibration methods adjust out...
Rotary positional embeddings (RoPE) are widely used in large language models to encode token positions through multiplicative rotations, yet their behavior at long context lengths remains poorly chara...
Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. We prove that computing a simple trigger-conditional behavior necessarily induces a s...
Transformers achieve remarkable performance across many domains, yet struggle with tasks requiring multi-hop relational reasoning over structured data. We analyze this limitation through circuit compl...