Reinforcement Learning: From Textbook to Real World

Research

Dr Fabio Rodriguez

Senior ML Engineer

Reinforcement learning is one of the most exciting and misunderstood areas of modern AI. Unlike supervised learning, where a model learns from labelled examples, RL takes a fundamentally different approach: it learns by doing. In a recent Passion Academy session, Dr. Fabio Rodríguez walked us through how RL actually works, where it's being applied today and what a real-world deployment looks like in practice.

‍

Learning Through Experience

‍

RL is built around a simple loop. A system observes its environment, takes an action designed to maximise long-term outcomes and receives feedback in the form of a reward or penalty. It adjusts its strategy and repeats.

‍

What makes this powerful is that the system doesn't need to be told the right answer in advance. It discovers what works through experience, optimising for long-term goals rather than short-term fixes. This makes it particularly suited to environments that are complex, dynamic or simply too large for humans to reason through manually.

‍

Where RL Shines

‍

Dr. Rodríguez walked through four classic problem types that illustrate where RL outperforms traditional approaches.

‍

Strategic decisions with near-infinite scenarios. TD-Gammon simulated 300,000 games of self-play against itself and reached world-champion level at backgammon, discovering opening strategies human experts had historically overlooked.

‍

Real-time logistics with unpredictable demand. An elevator dispatch system trained across the equivalent of 60,000 simulated hours of peak traffic significantly cut average wait times and extreme delays compared to legacy rule-based dispatchers.

‍

Resource allocation under finite capacity. A telecom channel assignment agent trained on live simulated traffic blocked significantly fewer calls than the best existing dynamic heuristics, while remaining lightweight enough to run in real-time.

‍

Complex scheduling with clashing dependencies. Applied to NASA shuttle payload planning, an RL agent iteratively repaired broken schedules and consistently produced shorter project timelines than traditional simulated annealing methods.

‍

Real-World Research

‍

Two current papers show where RL is heading.

‍

In aircraft maintenance, scheduling tasks across a live fleet means constantly adapting to faults, ground times and shifting time windows. An RL agent using Deep Q-Learning learned to either pick the best available slot or create a new one, with rewards tied to time slack and access cost. It demonstrated that complex, high-stakes scheduling problems can be handled adaptively when the reward signal is well designed.

‍

In agentic memory management, the question is how LLM agents should decide what to store, update or delete across long tasks. This research treats memory itself as an RL problem. The agent is rewarded for task success and memory efficiency, and penalised for poor recall. The key insight: memory becomes an active, learned policy rather than a passive store. As agents take on more complex multi-step tasks, this will matter a lot.

‍

How Passion Labs Deploys RL

‍

The session closed with Passion Labs' four-stage delivery model. Discovery maps the core operational bottleneck and defines the environment. Design builds a high-fidelity simulator so the agent can learn safely before touching anything real. Development compresses years of operational experience into days of training. Deployment integrates via API or web, with continuous optimisation as real data flows in.

‍

The emphasis on simulation is deliberate. If the simulator is good enough, an agent can learn in days what would otherwise take years of live exposure.

‍

The Broader Picture

‍

Games, logistics, resource allocation, scheduling, maintenance, memory management. These aren't niche applicationsn they represent some of the most operationally significant challenges businesses face today.

‍

RL won't replace human judgement, but in environments too large to enumerate and too dynamic for fixed rules, it offers something genuinely different: a system that gets better the more it operates. As tooling matures and simulation gets cheaper, expect RL to move from research curiosity to operational standard across far more industries than most currently anticipate.

‍

< back to academy

< previous

Next >