SLAP: A New Architecture for Joint Audio-Text Embeddings

Research

Dr Nadine Kroher

Chief Scientific Officer

‍

The Context

‍

You may be familiar with multimodal embedding spaces: systems that map different kinds of data (images, video, audio, text) into a shared high-dimensional space.

‍

The idea is simple but powerful:

An audio track goes through an audio encoder.
A textual description goes through a text encoder.
Both outputs are embedding vectors.
‍

If trained properly, embeddings for related pairs (e.g. a song and its caption) sit close together, while unrelated ones are far apart.

‍

This enables:

‍

Retrieval: type “upbeat jazz saxophone” and retrieve matching tracks.
Recommendation: cluster songs with similar embeddings.
Music captioning: generate descriptions based on nearest-neighbour captions.
Conditional generation: guide music creation with text prompts.

‍

OpenAI’s CLIP pioneered this approach for images, and CLAP extended it to audio. Both rely on contrastive learning with positive (matching) and negative (non-matching) pairs.

‍

The Limitations

‍

Contrastive methods like CLIP/CLAP work well, but they have drawbacks:

‍

Large batch sizes needed: small batches fail, while large ones require heavy compute.
Modality gap: audio and text embeddings tend to form separate clusters rather than truly mixing, leading to strange “cone-like” structures in the embedding space.

‍

This means the shared space isn’t as efficiently leveraged as it could be.

‍

Enter SLAP

‍

SLAP (Siamese Language Audio Pre-Training) tackles these issues by eliminating the need for negative pairs.

‍

Why is that tricky?

‍

If you only minimise the distance between matching pairs, the network can collapse into a trivial solution: always outputting the same embedding vector, regardless of input.
Contrastive losses avoid this by comparing to negatives, but that introduces the modality gap and scaling issues.

‍

SLAP borrows an idea from image representation learning called Bootstrap Your Own Latent (BYOL).

‍

How It Works

‍

For both the audio and text branches, SLAP introduces a target network:

Initially random, but updated over time as an exponential moving average of the encoder network’s weights.
Effectively, the target network is a “lagging version” of the encoder.

‍

This ensures embeddings don’t collapse while avoiding the need for handcrafted negatives.

‍

SLAP optimises with two types of loss:

Intermodal loss: audio vs. text across target/encoder pairs.
Intramodal loss: each branch compared to its own target.

‍

A balance between the two is crucial, too much weight on either side leads to trivial or unstable solutions.

‍

Results

‍

The paper shows compelling improvements:

‍

Embedding space: SLAP reduces the modality gap, with audio and text embeddings more naturally interwoven.
Downstream tasks: genre classification, instrument recognition, tagging — SLAP is competitive with or outperforms CLAP and MusiCLaP (a strong baseline).
Retrieval tasks: consistently strong, often better than contrastive approaches.

‍

In short, SLAP learns a more genuinely multimodal representation while sidestepping the inefficiencies of contrastive learning.

‍

Why It Matters

‍

SLAP is an important step toward more effective multimodal systems, particularly for music understanding and generation. By fixing structural issues in the embedding space, it opens the door to:

‍

More accurate search and retrieval.
Richer recommendation and tagging.
Stronger foundations for generative models conditioned on text.

‍

It’s a reminder that even in a field dominated by scale, architectural innovations still matter.

‍

Reference

‍

Guinot, Julien, et al. “SLAP: Siamese Language-Audio Pretraining Without Negative Samples for Music Understanding.” arXiv preprint arXiv:2506.17815 (2025).

‍

< back to academy

< previous

Next >