Computation and Language - Value Drifts Tracing Value Alignment During LLM Post-Training

2025-11-01

Hey everyone, Ernis here, and welcome back to PaperLedge! Today, we're diving into a fascinating piece of research that gets to the heart of how AI learns our values – or doesn't! We're talking about Large Language Models, or LLMs, those powerful AI systems that are becoming increasingly woven into our daily lives. Think about it: these models are answering our questions, writing our emails, even helping us make important decisions. That means they need to understand, and hopefully share, our values. The big q...

Think about it: these models are answering our questions, writing our emails, even helping us make important decisions. That means they need to understand, and hopefully share, our values. The big question is: how do they learn what's right and wrong?

Now, a lot of previous research has focused on checking whether these LLMs already align with human values after they’ve been fully trained. But this paper takes a different, and in my opinion, much more insightful approach. It's like peeking behind the curtain to see how the magic actually happens. Instead of just seeing the finished product, the researchers are studying the entire training process, specifically the "post-training" phase, to understand how and when these values get baked in.

The research team essentially dissected the post-training process, looking at two key ingredients: the algorithms used to train the models and the data they’re trained on. They wanted to understand how each contributes to value alignment. Imagine it like teaching a child – are their values shaped more by the teaching method (the algorithm) or by the examples they see (the data)?

They experimented with some big-name models like Llama-3 and Qwen-3, models of different sizes. They put them through different post-training methods, including Supervised Fine-Tuning (SFT) and Preference Optimization (algorithms that help models learn what humans prefer), using popular datasets designed for these purposes.

Here’s the key takeaway: They found that the SFT phase, which is where models are directly shown examples of how to respond to prompts, has the biggest impact on establishing a model's values. Think of SFT as the foundational value programming. The surprising part? Subsequent Preference Optimization, which is meant to fine-tune the model based on human preferences, often doesn't significantly change those initial values. It's like trying to repaint a house without fixing the underlying structure.

"the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values."

But the researchers didn’t stop there! They even created their own "synthetic" preference dataset, which allowed them to control and manipulate the values the models were learning. This is where things get really interesting. They discovered that even when the models were fed the same preference data, different Preference Optimization algorithms led to different value alignment outcomes! So, the how you teach is as important as what you teach.

Think of it like baking a cake. You can have the exact same recipe (the data), but if you use different baking methods (the algorithms) – maybe one oven is convection, the other isn't – you'll end up with slightly different cakes.

So, why does all of this matter?

For AI developers: This research provides actionable insights into how to curate data and choose algorithms to better align models with human values. It suggests that focusing on the SFT phase and carefully selecting the right preference optimization algorithm are crucial.
For policymakers: Understanding how values are learned during post-training can help inform regulations and guidelines for the development and deployment of AI systems.
For everyone else: As AI becomes more prevalent, it's essential to understand how these systems are being trained and what values they are learning. This research helps us to be more informed consumers and advocates for responsible AI development.

This research also raises some fascinating questions:

If SFT is so crucial for establishing values, how can we ensure that the data used in this phase is truly representative of diverse human values?
Given that different preference optimization algorithms can lead to different value alignments, even with the same data, how do we choose the "right" algorithm? Is there even a single "right" algorithm, or should we be tailoring them to specific contexts and values?
If preference optimization struggles to significantly re-align values established during SFT, does this suggest we need fundamentally new approaches to value alignment in LLMs?

That's all for this episode of PaperLedge. I hope this has shed some light on the complex world of AI value alignment. Until next time, keep learning and keep questioning!

Credit to Paper authors: Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Vered Shwartz, Siva Reddy

Comments (3)