Current research in model interpretability is increasingly focused on enhancing the understanding of complex machine learning systems, particularly large language models and diffusion models. Recent work emphasizes methods that attribute model behavior to semantic features rather than individual data points, improving both scalability and explainability. For instance, new frameworks leverage sparse autoencoders to extract interpretable features from diffusion language models, enabling more effective interventions and better performance compared to traditional autoregressive models. Additionally, innovative approaches like zero-shot Shapley value estimations allow for feature importance assessments without direct model access, addressing a significant barrier in real-world applications. The exploration of structural sparsity in neural networks reveals that while sparse models may not inherently lead to greater interpretability, they can still provide valuable insights when evaluated through comprehensive frameworks. Collectively, these advancements aim to bridge the gap between model complexity and user understanding, ultimately enhancing trust and usability in commercial applications.
As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attri...
Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable feat...
Computing the importance of features in supervised classification tasks is critical for model interpretability. Shapley values are a widely used approach for explaining model predictions, but require ...
Sparse neural networks are often hypothesized to be more interpretable than dense models, motivated by findings that weight sparsity can produce compact circuits in language models. However, it remain...
When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture fam...