Safety Alignment

Proof pending

3papers

5.3viability

Proof pending

Proof pending. This topic has not reached the minimum paper threshold yet.

Topic-linked question coverage is still building for this proof surface.

Papers

1-3 of 3

Research Paper·Mar 12, 2026

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted i...

7.0 viability

Research Paper·Jan 26, 2026

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainst...

6.0 viability

Research Paper·Mar 17, 2026

MOSAIC: Composable Safety Alignment with Modular Control Tokens

Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety ...

3.0 viability

Safety Alignment

Proof pending

Papers

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

MOSAIC: Composable Safety Alignment with Modular Control Tokens

Filters

Topic proof surfaces

Safety Alignment

Use this topic page as a durable research-area proof surface