Markov Decision Processes Explained: The Framework That Powers Reinforcement Learning




Most people encounter reinforcement learning through its headline acts: Google DeepMind’s AlphaGo defeating world champions, robots teaching themselves to walk, recommendation engines choosing your next Netflix binge. What rarely gets discussed is the mathematical scaffolding underneath all of these moments.
This mathematical scaffolding has a name: Markov Decision Processes (MDPs).
In a recent Passion Academy session, I walked the team through the foundational concepts of MDPs as a practical framework for thinking about how intelligent systems learn to make decisions. What emerged was a working philosophy that has implications far beyond robotics and games.
With reinforcement learning you are no longer programming behaviour. You are programming incentives.
Before MDPs, the dominant paradigm in computing was explicit instruction. If you wanted a machine to do something, you wrote the rules and factored in every contingency.
Reinforcement learning represents a fundamental break from that model. Instead of encoding rules, you define an environment, a set of actions, and a reward signal - then you let the agent figure out the rest. The machine learns a policy: a map from any given situation to the best available action.
With reinforcement learning you are no longer programming behaviour. You are programming incentives.
This matters because real-world problems are often too complex, too dynamic, or too poorly understood to reduce to hard-coded rules. Chess has more possible positions than atoms in the observable universe. A steak-cooking robot must balance competing signals such as temperature, time and customer preference in real time. MDPs provide the formal language for modelling all of these competing signals and considerations.
An MDP is defined by four components:
In an MDP system, the agent moves through states, takes actions, receives rewards, and transitions to new states. The goal is to find the optimal policy: a sequence of decisions that maximises cumulative reward over time.
The reward scheme doesn't just influence what the agent learns. It defines what the agent is.
Here is where most introductions to RL go wrong: they treat the reward as an afterthought. If you get the reward function wrong, your agent learns the wrong thing - often in ways that are technically impressive and practically useless.
Consider chess. Reward an agent only for winning, and it learns nothing about intermediate decisions and may stumble into good moves without understanding why. Reward it for capturing pieces, and you produce a tactically aggressive agent that sacrifices long-term position for short-term material gain.
The reward scheme doesn't just influence what the agent learns. It defines what the agent is.
A well-designed reward system often involves negative rewards too. In a maze problem, penalising every step encourages the agent to find the fastest route. The agent learns urgency because urgency was built into the incentive structure.
Continuous tasks introduce a problem: if the agent accumulates rewards indefinitely, the maths breaks down. The solution is discounting - a parameter, γ (gamma), between 0 and 1. Each future reward is multiplied by γ raised to the power of how far away it is, shrinking the influence of distant rewards toward zero.
This encodes a genuine decision about agent behaviour - agent behaviour is defined by the nature and scale of discounting at play. A low γ produces a myopic agent optimising for immediate reward. You can think of a low γ as a day trader or a sprinter. A high γ produces a patient agent weighing long-term consequences. Furthermore, think of high γ as a long-term investor, a marathon runner pacing themselves.
For an MDP to work cleanly, it must satisfy the Markov Property: the current state must contain all information necessary to make the optimal decision. History is irrelevant. Only the present moment matters.
Chess is Markovian - the board position tells you everything. Poker is not, because hidden information makes the observable state insufficient. The same applies to movie recommendations, which depend on watch history, not just the current session.
When a problem is not inherently Markovian, you can make it Markovian by expanding the state definition to include relevant history. It is an approximation, but a principled one.
MDPs first. RL second. The model before the method.
The Bellman equation defines the value of a state recursively: the immediate reward + the discounted expected value of the next state. This connects every decision to every future consequence in a single tractable expression - but creates a rapidly branching tree of possibilities. Discounting is what keeps that tree manageable.
MDPs vs. Reinforcement Learning - these terms are often conflated. MDPs define the problem: states, actions, rewards, transition dynamics. RL is the training methodology, Q-learning, policy gradients, actor-critic methods, used to find the optimal policy within that model. MDPs first. RL second. The model before the method.
MDPs first. RL second. The model before the method.
If an agent's behaviour is entirely determined by its reward function, then the most consequential decisions in any RL system are made by the people who design that function. This is not a technical observation. It is a governance one.
In high-stakes domains - military systems, autonomous vehicles, critical infrastructure - the reward function is a policy document. One practical mitigation is action masking: hard-limiting the action space so the agent simply cannot take certain actions, regardless of what the reward signal might incentivise. In RL, as in most AI applications, the interesting problems are rarely purely technical.
Overall, MDPs provide a framework for modelling decision-making under uncertainty - one that is remarkably general, technically rigorous, and practically powerful.
The real skill lies in taking a messy, real-world problem and asking the right questions. What is the state space? What actions are available? What does success actually look like - and how do you encode that honestly? What is the cost of failure?
If you get those answers right, the mathematics will follow. If you don’t get it right, you will have a highly optimised agent solving the wrong problem very efficiently.
Ultimately, MDPs are worth understanding - not just as a prerequisite for reinforcement learning, but as a discipline for thinking clearly about complex systems.