SLAP: A New Architecture for Joint Audio-Text Embeddings




You may be familiar with multimodal embedding spaces: systems that map different kinds of data (images, video, audio, text) into a shared high-dimensional space.
The idea is simple but powerful:
If trained properly, embeddings for related pairs (e.g. a song and its caption) sit close together, while unrelated ones are far apart.
This enables:
OpenAI’s CLIP pioneered this approach for images, and CLAP extended it to audio. Both rely on contrastive learning with positive (matching) and negative (non-matching) pairs.
Contrastive methods like CLIP/CLAP work well, but they have drawbacks:
This means the shared space isn’t as efficiently leveraged as it could be.
SLAP (Siamese Language Audio Pre-Training) tackles these issues by eliminating the need for negative pairs.
Why is that tricky?
SLAP borrows an idea from image representation learning called Bootstrap Your Own Latent (BYOL).
For both the audio and text branches, SLAP introduces a target network:
This ensures embeddings don’t collapse while avoiding the need for handcrafted negatives.
SLAP optimises with two types of loss:
A balance between the two is crucial, too much weight on either side leads to trivial or unstable solutions.
The paper shows compelling improvements:
In short, SLAP learns a more genuinely multimodal representation while sidestepping the inefficiencies of contrastive learning.
SLAP is an important step toward more effective multimodal systems, particularly for music understanding and generation. By fixing structural issues in the embedding space, it opens the door to:
It’s a reminder that even in a field dominated by scale, architectural innovations still matter.
Reference
Guinot, Julien, et al. “SLAP: Siamese Language-Audio Pretraining Without Negative Samples for Music Understanding.” arXiv preprint arXiv:2506.17815 (2025).