LLM Safety

Proof pending

68papers

5.8viability

+13%30d

Proof pending

Proof pending. Core topic summary fields are still materializing.

State of the Field

The field of large language model (LLM) safety is advancing through innovative approaches that target internal mechanisms of toxicity and harmful content generation. Recent research has introduced frameworks that localize and suppress toxicity within model architectures, enhancing safety without extensive retraining. Techniques such as language-agnostic semantic alignment and internal representation analysis are being employed to improve safety across diverse languages and contexts. Additionally, new methods for detecting harmful content leverage internal features of models, providing efficient and effective solutions. These developments are crucial for builders aiming to deploy LLMs in real-world applications, ensuring that safety measures are robust and adaptable to various scenarios, thereby fostering trust and reliability in AI systems.

Last updated May 29, 2026

Topic-linked question coverage is still building for this proof surface.

Topic trend

Topic-specific paper and score movement from the daily diff ledger.

Papers

1-10 of 50

Research Paper·Apr 13, 2026

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to...

8.0 viability

Research Paper·May 27, 2026

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where...

8.0 viability

Research Paper·Apr 20, 2026

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich saf...

8.0 viability

Research Paper·Apr 14, 2026

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existin...

7.0 viability

Research Paper·Apr 20, 2026

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent na...

7.0 viability

Research Paper·Apr 28, 2026

Test-Time Safety Alignment

Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only bee...

7.0 viabilityHas code

Research Paper·May 27, 2026

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe be...

7.0 viabilityHas code

Research Paper·Mar 6, 2026

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why a model refuses a request, nor intervene when its...

7.0 viability

Research Paper·May 13, 2026

Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

Large language models (LLMs) are increasingly deployed in a wide range of applications, yet remain vulnerable to adversarial jailbreak attacks that circumvent their safety guardrails. Existing evaluat...

7.0 viability

Research Paper·Mar 8, 2026

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prio...

7.0 viability

Page 1 of 5

LLM Safety

Proof pending

State of the Field

Topic trend

Papers

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

Test-Time Safety Alignment

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Filters

Topic proof surfaces

LLM Safety

Use this topic page as a durable research-area proof surface