< Academy

Beyond Training: How Fine-Tuning Shapes Smarter Models

Research
Dr Nadine Kroher
Chief Scientific Officer

Beyond Training: How Fine-Tuning Shapes Smarter Models

Webinar delivered 03/12/2025 by Dr. Nadine Kroher, Chief Scientific Officer at Passion Labs

Most people now know that Large Language Models (LLMs) like ChatGPT are trained by predicting the next word... but that’s only half the story. Next-word prediction gives you something that can continue a sentence. It does not automatically give you something that:

  • listens to instructions
  • avoids toxic or offensive language
  • sounds helpful, careful and human-friendly

That last part comes from fine-tuning. This webinar walked through how modern LLMs are fine-tuned, why reinforcement learning (RL) is suddenly fashionable again and what it really means when we say models are “trained with human feedback”.

Click through the Slides below

Recap: What Pre-Training Actually Does

In pre-training, an LLM is shown enormous amounts of text and learns to guess the next token.

“NADINE NEEDS A …” → pizza? coffee? zebra?

The model assigns probabilities to all the words it knows, gets told which one was actually in the original text and nudges its internal parameters to be slightly less wrong next time. Repeat this trillions of times and you get a model that has absorbed:

  • syntax (how sentences are structured)
  • semantics (what words tend to mean in context)
  • contextual relationships (what tends to go with what)
  • some factual knowledge, though not always reliably

What you don’t get yet is a good assistant. A purely pre-trained model happily continues text but doesn’t really follow instructions. It doesn’t reason in a structured way and doesn’t know how polite, safe or useful it’s meant to be... that’s where fine-tuning comes in.


Two Stages of Fine-Tuning

In practice, modern LLMs go through two main fine-tuning stages after pre-training.

1: Supervised Fine-Tuning: Learning to Follow Instructions

First, models see lots of examples like:

Instruction: “Explain why the sky appears blue.”


Desired answer: “The sky appears blue because of a phenomenon called Rayleigh scattering…”

Here, we do know the target output. We compare what the model wrote to what we wanted and update its parameters accordingly. This:

  • teaches the model to treat inputs as tasks rather than random text
  • greatly improves its ability to follow instructions

But the amount of high-quality instruction data is limited and we still have little control over style, tone and safety.

2: RL-Based Fine-Tuning: Making Models “Better Behaved”

The second stage uses reinforcement learning to nudge the model towards outputs that:

  • users find more helpful or clearer
  • are less toxic or harmful
  • align more closely with human preferences

This is the “beyond training” part: we’re no longer just asking “Is this answer correct?”, but “Is this the kind of answer we want?”

A Gentle Introduction to Reinforcement Learning

Reinforcement learning is a framework where a system learns by trial and error.

Think of teaching an autopilot to fly a plane in a simulator:

  • The agent (a neural network) controls the plane.
  • It chooses actions: pitch up, pitch down, more thrust, less thrust.
  • The environment (the simulator) applies physics and updates the flight state.
  • The agent observes states: speed, altitude, pitch, attitude.
  • A reward tells it how well it’s doing (e.g. high reward for stable flight, large penalty for crashing).

Over many simulated flights, the agent learns a rule or what we call a 'policy'. This is essentially a mapping from states to good actions. If the world is simple, that policy could be a set of rules: “If pitch > X and speed > Y → pitch down.”

In real problems, the state and action spaces are huge and messy. So instead of explicit rules, we let a neural network approximate the policy: it takes the current state as input and outputs the best action it has learned so far. That’s deep reinforcement learning.

Framing LLM Fine-Tuning as an RL Problem

Now swap the plane for an LLM.

Imagine we already have a trained model. It’s decent, but we want it to:

  • be more polite
  • be safer
  • or simply align better with our idea of “good answers”

We don’t know the perfect answer beforehand. But we can get a score for whatever it writes.

For example:

Prompt: “Write a polite email to my boss…”

LLM output: “Hey, what’s up…”

Scoring model: gives it 2 out of 100 for politeness.

We can plug this into an RL setup:

  • States → the prompt plus everything the model has generated so far
  • Actions → the next token the model chooses to output
  • Policy → the LLM’s own parameters (its internal weights)
  • Environment → a reward model that examines the output and gives a score
  • Reward → that score, which we try to maximise

At a high level, we are telling the LLM: “For this kind of prompt, keep adjusting yourself until the scoring model likes your answers more.”

RLHF: Reinforcement Learning with Human Feedback

Being polite is useful, but we want more than “not offensive”. We want answers that humans overall like: clear, honest, helpful (and yes) friendly.

A natural idea is to put humans in the loop:

  1. LLM generates an answer.
  2. A person rates it (thumbs up/down, or 1–10).
  3. We use that rating as a reward.

The problem: this doesn’t scale. You’d need millions of ratings, in real time, with updates happening constantly. That’s not realistic.

The Actual Solution: Train a Reward Model

Instead, modern systems do something smarter.

Step 1: Collect Human Ratings (Offline)

  • Show human annotators a lot of prompt + LLM response pairs.
  • Ask them to rate which answers are better, safer, clearer, etc.
  • Store all of these ratings in a dataset.

Step 2: Train a Reward Model

  • Train a separate ML model that takes in the prompt and response as input and tries to predict the human rating.
  • Because we have the real rating, we can compute an error and adjust the model, just like any supervised learning task.

Over time, this reward model becomes a reasonably good simulator of human judgement.

Step 3: Use the Reward Model During Fine-Tuning

Now we put the reward model into the RL loop:

  1. The LLM generates a response to a prompt.
  2. The reward model predicts how a human would rate that response.
  3. We treat that prediction as a reward.
  4. We adjust the LLM to maximise that predicted reward.

This is Reinforcement Learning with Human Feedback (RLHF). It’s what allowed LLMs to jump from “interesting research toy” to “pretty decent assistant”.

The Side Effects: Why Models Are So Nice

Optimising for human ratings doesn’t just reward factual accuracy or clarity. People also tend to give higher scores to answers that:

  • sound friendly and supportive
  • agree with them
  • avoid confrontation

So models learn to:

  • be extremely diplomatic
  • avoid strong disagreement
  • steer away from toxic or divisive content
  • lean toward “safe” positions in contentious topics

Try having an argument with your favourite LLM: it will keep validating your point, gently reframing and avoiding a hard “you’re wrong”. That isn’t an accident, it’s a consequence of the optimisation target: “what humans rate highly”, not “what is coldly, brutally true in all circumstances”.

What You Should Take Away

From a distance, the story of modern LLMs looks like this:

  1. Pre-training teaches models to predict the next token and learn the structure of language.
  2. Supervised fine-tuning teaches them to follow instructions.
  3. RL-based fine-tuning, often via RLHF, teaches them to behave in ways humans like: safer, clearer, more polite.

And importantly:

  • Humans are not literally “inside” the training loop when you talk to a model.
  • But human preferences are baked in via the reward model that mimics human ratings.
  • Fine-tuning, not just pre-training, is what makes LLMs feel like usable tools instead of random text generators.

At Passion Labs, our research philosophy starts from this reality:

Great AI systems aren’t magic. They’re the result of careful objectives, clear rewards and human values- built with heart, by humans.

< back to academy
< previous
Next >