Beyond Training: How Fine-Tuning Shapes Smarter Models




Webinar delivered 03/12/2025 by Dr. Nadine Kroher, Chief Scientific Officer at Passion Labs
Most people now know that Large Language Models (LLMs) like ChatGPT are trained by predicting the next word... but that’s only half the story. Next-word prediction gives you something that can continue a sentence. It does not automatically give you something that:
That last part comes from fine-tuning. This webinar walked through how modern LLMs are fine-tuned, why reinforcement learning (RL) is suddenly fashionable again and what it really means when we say models are “trained with human feedback”.
In pre-training, an LLM is shown enormous amounts of text and learns to guess the next token.
“NADINE NEEDS A …” → pizza? coffee? zebra?
The model assigns probabilities to all the words it knows, gets told which one was actually in the original text and nudges its internal parameters to be slightly less wrong next time. Repeat this trillions of times and you get a model that has absorbed:
What you don’t get yet is a good assistant. A purely pre-trained model happily continues text but doesn’t really follow instructions. It doesn’t reason in a structured way and doesn’t know how polite, safe or useful it’s meant to be... that’s where fine-tuning comes in.
In practice, modern LLMs go through two main fine-tuning stages after pre-training.
First, models see lots of examples like:
Instruction: “Explain why the sky appears blue.”
Desired answer: “The sky appears blue because of a phenomenon called Rayleigh scattering…”
Here, we do know the target output. We compare what the model wrote to what we wanted and update its parameters accordingly. This:
But the amount of high-quality instruction data is limited and we still have little control over style, tone and safety.
The second stage uses reinforcement learning to nudge the model towards outputs that:
This is the “beyond training” part: we’re no longer just asking “Is this answer correct?”, but “Is this the kind of answer we want?”
Reinforcement learning is a framework where a system learns by trial and error.
Think of teaching an autopilot to fly a plane in a simulator:
Over many simulated flights, the agent learns a rule or what we call a 'policy'. This is essentially a mapping from states to good actions. If the world is simple, that policy could be a set of rules: “If pitch > X and speed > Y → pitch down.”
In real problems, the state and action spaces are huge and messy. So instead of explicit rules, we let a neural network approximate the policy: it takes the current state as input and outputs the best action it has learned so far. That’s deep reinforcement learning.
Now swap the plane for an LLM.
Imagine we already have a trained model. It’s decent, but we want it to:
We don’t know the perfect answer beforehand. But we can get a score for whatever it writes.
For example:
Prompt: “Write a polite email to my boss…”
LLM output: “Hey, what’s up…”
Scoring model: gives it 2 out of 100 for politeness.
We can plug this into an RL setup:
At a high level, we are telling the LLM: “For this kind of prompt, keep adjusting yourself until the scoring model likes your answers more.”
Being polite is useful, but we want more than “not offensive”. We want answers that humans overall like: clear, honest, helpful (and yes) friendly.
A natural idea is to put humans in the loop:
The problem: this doesn’t scale. You’d need millions of ratings, in real time, with updates happening constantly. That’s not realistic.
Instead, modern systems do something smarter.
Over time, this reward model becomes a reasonably good simulator of human judgement.
Now we put the reward model into the RL loop:
This is Reinforcement Learning with Human Feedback (RLHF). It’s what allowed LLMs to jump from “interesting research toy” to “pretty decent assistant”.
Optimising for human ratings doesn’t just reward factual accuracy or clarity. People also tend to give higher scores to answers that:
So models learn to:
Try having an argument with your favourite LLM: it will keep validating your point, gently reframing and avoiding a hard “you’re wrong”. That isn’t an accident, it’s a consequence of the optimisation target: “what humans rate highly”, not “what is coldly, brutally true in all circumstances”.
From a distance, the story of modern LLMs looks like this:
And importantly:
At Passion Labs, our research philosophy starts from this reality:
Great AI systems aren’t magic. They’re the result of careful objectives, clear rewards and human values- built with heart, by humans.