Safety-Preserving Fine-tuning (SPF) is a method for adapting large language models (LLMs) to specific tasks while preventing the degradation of their inherent safety alignments. It addresses the trade-off between improving task performance and maintaining robustness against safety vulnerabilities like jailbreak attacks.
Safety-Preserving Fine-tuning (SPF) helps large AI models learn new tasks without becoming unsafe or vulnerable to malicious attacks. It works by carefully managing how the model updates its knowledge, ensuring it improves at a task while still remembering its safety rules, solving a common problem where making models better at tasks often makes them less safe.
Safety-Preserving Fine-tuning, Safety-Prioritized Fine-tuning, Safe Fine-tuning, Alignment-Preserving Fine-tuning
Was this definition helpful?