When Embeddings Hit Their Limits




Text embeddings sit quietly behind almost every AI system you use. From chatbots and search tools to recommendation engines. They’re what allow systems to “understand” similarity: when two pieces of text mean roughly the same thing, their embeddings sit close together in a mathematical space.
A new paper from DeepMind takes a closer look at the limits of that idea. It shows that, under certain conditions, even very large embeddings can’t perfectly capture all possible relationships between queries and documents. It backs it up with a clever experiment.
So, what does this mean for people building retrieval-augmented generation (RAG) systems every day?
When you ask a question, say, “How do I reset my company email?”, the system converts that question into a vector, a kind of numerical fingerprint.
It does the same for all documents in its knowledge base. Then it finds which vectors are closest together those are assumed to be the most relevant results.
That’s the basic mechanism behind everything from internal search tools to support chatbots.
The DeepMind team started by asking a theoretical question:
Are there cases where no fixed-size embedding space can represent all the right relationships between questions and answers?
Their answer was yes, there are always some patterns that can’t be captured, no matter how many dimensions the embedding has.
To test this in practice, they built what they call a “limit dataset.”
It’s intentionally simple:
Each question has exactly two correct answers.
The challenge: can an embedding model place those two correct documents close to the question and everything else far away?
Surprisingly, today’s high-performing neural embedding models (the kind used across modern AI systems) struggled with this setup.
It’s a fascinating result because it demonstrates, with both theory and experiment, that embeddings have structural limits.
Some relationships between questions and answers are just too constrained to fit neatly into a single-vector space.
For most real-world retrieval systems, no.
Typical applications (customer support, document search, chat assistants) don’t look like this highly structured “two-correct-answers” puzzle.
In practice, embedding models work well across messy, overlapping human language.
When retrieval fails, it’s usually due to query phrasing, chunking, or missing data, not the geometry of the embedding space itself.
This research is important because it reminds us that embedding-based retrieval isn’t magic, it’s an approximation.
It works beautifully most of the time, but not for every possible structure of text relationships.
It also highlights alternatives worth exploring:
AI progress often comes from practical experiments, what works and what doesn’t rather than theoretical proofs.
Studies like this one from DeepMind are essential. They help us understand the boundaries of the tools we rely on and remind us where creativity in model design still matters.
For most builders, the message isn’t to abandon embeddings it’s to know when they might need help.
And for researchers, it’s another step towards grounding our everyday tools in solid theory.
Reference
Weller, Orion, et al. “On the theoretical limitations of embedding-based retrieval.” arXiv preprint arXiv:2508.21038 (2025).