Machine Learning - Online SFT for LLM Reasoning Surprising Effectiveness of Self-Tuning without Rewards

2025-10-22

Hey PaperLedge learning crew, Ernis here, ready to dive into some seriously cool AI wizardry! Today, we're cracking open a paper that tackles a big challenge: how to make Large Language Models, or LLMs – think of them as super-smart chatbots – even better at reasoning, especially when it comes to complex stuff like math problems. Now, usually, training these LLMs to think better is a bit like teaching a dog new tricks. You need to reward them when they get it right, which, in AI terms, means setting up a who...

Now, usually, training these LLMs to think better is a bit like teaching a dog new tricks. You need to reward them when they get it right, which, in AI terms, means setting up a whole reward system. This can be tricky and time-consuming. But what if the LLM could, in a way, teach itself?

That's precisely what this paper proposes with something they call Online Supervised Finetuning (OSFT). It's like a self-help program for AI! The basic idea is simple: the LLM tries to solve a problem, then immediately learns from its own attempt – whether it was right or wrong.

Think of it like this: you're trying to learn a new recipe. Instead of having a chef constantly telling you what to do, you try making the dish yourself. Then, you immediately analyze what went well, what didn't, and adjust your approach for the next time. That's OSFT in a nutshell!

The cool thing is, OSFT cuts out the need for a complex reward system. It's reward-free! The LLM is simply learning from its own actions, one step at a time. They call this "latent knowledge" - it already knows some things from its initial training, and OSFT helps it unlock its own potential.

The major mechanism of OSFT lies in facilitating the model's own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement.

The researchers put OSFT to the test on some seriously tough math problems. And guess what? It performed just as well as, or even better than, those LLMs trained with those complicated reward systems, like GRPO (which they compare it to).

What's really exciting is that OSFT seems super-efficient and reliable. The researchers did a bunch of experiments to prove it, and the results are pretty convincing.

So, why does all this matter?

For AI researchers: OSFT offers a simpler and potentially more effective way to train LLMs for reasoning, which could lead to breakthroughs in AI capabilities.
For developers: Imagine being able to improve your AI models' problem-solving abilities without needing to build complex reward systems. OSFT could make AI development much easier and faster.
For everyone else: Better reasoning in AI could lead to smarter virtual assistants, more accurate medical diagnoses, and more efficient solutions to complex global problems. It's all about making AI a more helpful and capable tool for humanity.

Now, I'm left wondering... if an LLM can teach itself through OSFT, could we apply similar principles to other areas of AI training? Could this "self-help" approach be useful for teaching AI to be more creative, or even more ethical?

Also, how far can we push this? Is there a limit to how much an LLM can improve through self-learning alone, or will it eventually need external input to reach its full potential?

You can find the code for this project over at Github, the link is https://github.com/ElementQi/OnlineSFT.

That's all for today's deep dive, learning crew! Keep those questions coming, and I'll see you next time on PaperLedge.

Credit to Paper authors: Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li

Comments (3)