Benchmarking Safety in LLMs and Large Reasoning Models

AI safety is in the spotlight. From press coverage to government hearings, the question is no longer whether models can generate harmful content , it’s how often, under what conditions, and how to measure risk.
Misuse typically falls into two categories:
To counter this, developers employ refusal classifiers and safety alignment (fine-tuning models to prefer safe answers). But how do we know if these methods actually work?
Researchers are turning to datasets like Airbench 2024, which contains 5,600+ prompts across 300 categories, all designed to test a model’s refusal behaviour.
The key metric is simple: safety rate - the percentage of harmful prompts the model successfully refuses.
A recent study compared safety rates across several models:
The results were striking:
This presents a real tension:
It suggests that transparency (showing thought traces) can be a double-edged sword: useful for debugging and evaluation, but also exposing unsafe reasoning.
Benchmarking safety isn’t just about testing refusal rates, it’s about understanding the trade-offs between capability and control. As models move from LLMs to LRMs, developers may need entirely new approaches to safety alignment.
The big question: Can we make models smarter without making them riskier?