Current research on aligning large language models (LLMs) is increasingly focused on improving interpretability, robustness, and cultural sensitivity, addressing key challenges in deploying these models across diverse applications. Recent work emphasizes the need for scalable and interpretable reward modeling, with frameworks like Contrast-Driven Rubric Reward Model demonstrating enhanced data efficiency and bias mitigation. Additionally, studies reveal significant gaps in cultural alignment, particularly regarding religious viewpoints in multilingual contexts, prompting calls for systematic audits to ensure equitable deployment. Privacy-preserving techniques are gaining traction, allowing for cross-model alignment without compromising security, while innovative approaches like winsorized Direct Preference Optimization are refining preference alignment by targeting specific noise types in training data. As the field matures, there is a clear shift toward integrating observational feedback and reference-guided evaluations, which enhance the effectiveness of alignment strategies, ultimately aiming to create LLMs that better reflect human values and preferences in real-world scenarios.
Reward modeling is essential for aligning Large Language Models(LLMs) with human preferences, yet conventional reward models suffer from poor interpretability and heavy reliance on costly expert annot...
Large Language Models (LLMs) are increasingly being deployed in multilingual, multicultural settings, yet their reliance on predominantly English-centric training data risks misalignment with the dive...
Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotato...
We study how to allocate a fixed supervised fine-tuning budget when three objectives must be balanced at once: multi-turn safety alignment, low over-refusal on benign boundary queries, and instruction...
Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, D...
Best-of-N (BoN) sampling is a widely used inference-time alignment method for language models, whereby N candidate responses are sampled from a reference model and the one with the highest predicted r...
Inference-time alignment effectively steers large language models (LLMs) by generating multiple candidates from a reference model and selecting among them with an imperfect reward model. However, curr...
Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently ...
Direct Preference Optimization (DPO) aligns large language models by optimizing pairwise preferences and has shown remarkable effectiveness as a simple and scalable alternative to RLHF. However, in pr...
While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, s...