HateXScore is a four-component metric suite designed to evaluate the reasoning quality of model explanations in hateful speech detection. It assesses conclusion explicitness, faithfulness, protected group identification, and logical consistency, serving as a diagnostic tool for interpretability failures.
HateXScore is a new way to check if AI models that detect hate speech are explaining *why* they flagged something correctly. It looks at four specific things in the explanation to make sure it's clear, accurate, identifies the right groups, and makes sense, helping content moderators trust the AI more.
Was this definition helpful?