Benchmarking Safety in LLMs and Large Reasoning Models

Blog

Tom Lorimer

Passion Labs

The Growing Concern

‍

AI safety is in the spotlight. From press coverage to government hearings, the question is no longer whether models can generate harmful content , it’s how often, under what conditions, and how to measure risk.

‍

Misuse typically falls into two categories:

‍

Malicious prompts: Directly asking for harmful content (e.g. cyberattacks, weapons, hate speech, or self-harm).
Adversarial attacks: Cleverly bypassing guardrails through prompt injection (“ignore previous instructions”) or jailbreaks (role-play scenarios, algorithmically generated prompts).

‍

To counter this, developers employ refusal classifiers and safety alignment (fine-tuning models to prefer safe answers). But how do we know if these methods actually work?

‍

Enter the Benchmarks

‍

Researchers are turning to datasets like Airbench 2024, which contains 5,600+ prompts across 300 categories, all designed to test a model’s refusal behaviour.

‍

The key metric is simple: safety rate - the percentage of harmful prompts the model successfully refuses.

‍

A Surprising Study

‍

A recent study compared safety rates across several models:

DeepSeek’s open-source LLM and its reasoning variant (LRM)
Llama 3.3 (Meta)
DeepSeek R170B (reasoning fine-tuned)
OpenAI’s O3 Mini (reasoning model)

‍

The results were striking:

OpenAI’s O3 Mini outperformed open-source models across most safety categories.
Reasoning fine-tuning lowered safety. The same base model, after being trained on reasoning datasets, showed decreased refusal rates, meaning it was more likely to answer harmful prompts.
Thought traces added risk. Even when the final answer was safe, intermediate reasoning steps sometimes violated safety guidelines.

‍

Why This Matters

‍

This presents a real tension:

Fine-tuning for better reasoning makes models more capable…
…but also more vulnerable to misuse.

‍

It suggests that transparency (showing thought traces) can be a double-edged sword: useful for debugging and evaluation, but also exposing unsafe reasoning.

‍

Looking Ahead

‍

Benchmarking safety isn’t just about testing refusal rates, it’s about understanding the trade-offs between capability and control. As models move from LLMs to LRMs, developers may need entirely new approaches to safety alignment.

‍

The big question: Can we make models smarter without making them riskier?

‍

< back to academy

< previous

Next >