Recent research in AI alignment is increasingly focused on refining how large language models (LLMs) can be aligned with human values and preferences, addressing both efficiency and representativeness. Techniques such as alignment pretraining are gaining traction, revealing that the discourse used during model training can significantly influence alignment outcomes, leading to self-fulfilling misalignment if negative narratives dominate. Innovations like LLMdoctor and Reward Informed Fine-Tuning (RIFT) are streamlining alignment processes by optimizing performance at test time and repurposing negative samples, respectively, thereby reducing reliance on costly expert data. Meanwhile, frameworks like Democratic Preference Optimization are tackling demographic biases in preference data collection, ensuring that models reflect a broader spectrum of human values. Additionally, methods such as Density-Guided Response Optimization are enabling alignment in resource-scarce environments by leveraging implicit community signals. Collectively, these advancements suggest a shift towards more nuanced, efficient, and inclusive approaches to aligning AI systems with complex human values.
Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behavio...
Aligning Large Language Models (LLMs) with human preferences is critical, yet traditional fine-tuning methods are computationally expensive and inflexible. While test-time alignment offers a promising...
Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where ...
Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systemat...
While Supervised Fine-Tuning (SFT) and Rejection Sampling Fine-Tuning (RFT) are standard for LLM alignment, they either rely on costly expert data or discard valuable negative samples, leading to data...
Reliable AI systems require large language models (LLMs) to exhibit behaviors aligned with human preferences and values. However, most existing alignment approaches operate at training time and rely o...
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerge...
Language models deployed in online communities must adapt to norms that vary across social, cultural, and domain-specific contexts. Prior alignment approaches rely on explicit preference supervision o...
This paper introduces a methodological framework for empirically testing AI alignment strategies through structured multi-model dialogue. Drawing on Peace Studies traditions - particularly interest-ba...
Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negati...