Recent advancements in AI interpretability are focusing on enhancing the understanding of complex models through various innovative frameworks. One notable trend is the development of models that balance accuracy and interpretability, such as the Mixture of Concept Bottleneck Experts, which allows for adaptable predictions based on user needs. Additionally, new methods for diagnosing generalization failures, inspired by medical biomarkers, are providing robust indicators of model performance, which could significantly improve deployment strategies in real-world applications. Techniques like AgentXRay are enabling the reconstruction of interpretable workflows from opaque systems, enhancing user control over AI decisions. Furthermore, the introduction of frameworks like Model Medicine is promoting a clinical approach to AI, emphasizing systematic diagnosis and treatment of model disorders. Collectively, these efforts aim to bridge the gap between human understanding and machine learning, addressing commercial challenges in transparency and reliability across diverse applications.
Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in human-understandable concepts. However, existing CBMs typically fix their task predictor to a single linear or Boo...
The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ''duck'' and ''rab...
Large Language Models have shown strong capabilities in complex problem solving, yet many agentic systems remain difficult to interpret and control due to opaque internal workflows. While some framewo...
Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal sub...
Generalization, the ability to perform well beyond the training context, is a hallmark of biological and artificial intelligence, yet anticipating unseen failures remains a central challenge. Conventi...
Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction ...
Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressiv...
Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models -- like biological organisms -- have internal stru...
It is essential for understanding neural network decisions to interpret the functionality (also known as concepts) of neurons. Existing approaches describe neuron concepts by generating natural langua...
We study how individual training examples shape the internal computation of looped transformers, where a shared block is applied for $τ$ recurrent iterations to enable latent reasoning. Existing train...