< Academy

SLAP: A New Architecture for Joint Audio-Text Embeddings

Research
Dr Nadine Kroher
Chief Scientific Officer

The Context

You may be familiar with multimodal embedding spaces: systems that map different kinds of data (images, video, audio, text) into a shared high-dimensional space.

The idea is simple but powerful:

  • An audio track goes through an audio encoder.
  • A textual description goes through a text encoder.
  • Both outputs are embedding vectors.

If trained properly, embeddings for related pairs (e.g. a song and its caption) sit close together, while unrelated ones are far apart.

This enables:

  • Retrieval: type “upbeat jazz saxophone” and retrieve matching tracks.
  • Recommendation: cluster songs with similar embeddings.
  • Music captioning: generate descriptions based on nearest-neighbour captions.
  • Conditional generation: guide music creation with text prompts.

OpenAI’s CLIP pioneered this approach for images, and CLAP extended it to audio. Both rely on contrastive learning with positive (matching) and negative (non-matching) pairs.

The Limitations

Contrastive methods like CLIP/CLAP work well, but they have drawbacks:

  • Large batch sizes needed: small batches fail, while large ones require heavy compute.
  • Modality gap: audio and text embeddings tend to form separate clusters rather than truly mixing, leading to strange “cone-like” structures in the embedding space.

This means the shared space isn’t as efficiently leveraged as it could be.

Enter SLAP

SLAP (Siamese Language Audio Pre-Training) tackles these issues by eliminating the need for negative pairs.

Why is that tricky?

  • If you only minimise the distance between matching pairs, the network can collapse into a trivial solution: always outputting the same embedding vector, regardless of input.
  • Contrastive losses avoid this by comparing to negatives, but that introduces the modality gap and scaling issues.

SLAP borrows an idea from image representation learning called Bootstrap Your Own Latent (BYOL).

How It Works

For both the audio and text branches, SLAP introduces a target network:

  • Initially random, but updated over time as an exponential moving average of the encoder network’s weights.
  • Effectively, the target network is a “lagging version” of the encoder.

This ensures embeddings don’t collapse while avoiding the need for handcrafted negatives.

SLAP optimises with two types of loss:

  • Intermodal loss: audio vs. text across target/encoder pairs.
  • Intramodal loss: each branch compared to its own target.

A balance between the two is crucial, too much weight on either side leads to trivial or unstable solutions.

Results

The paper shows compelling improvements:

  • Embedding space: SLAP reduces the modality gap, with audio and text embeddings more naturally interwoven.
  • Downstream tasks: genre classification, instrument recognition, tagging — SLAP is competitive with or outperforms CLAP and MusiCLaP (a strong baseline).
  • Retrieval tasks: consistently strong, often better than contrastive approaches.

In short, SLAP learns a more genuinely multimodal representation while sidestepping the inefficiencies of contrastive learning.

Why It Matters

SLAP is an important step toward more effective multimodal systems, particularly for music understanding and generation. By fixing structural issues in the embedding space, it opens the door to:

  • More accurate search and retrieval.
  • Richer recommendation and tagging.
  • Stronger foundations for generative models conditioned on text.

It’s a reminder that even in a field dominated by scale, architectural innovations still matter.

Reference

Guinot, Julien, et al. “SLAP: Siamese Language-Audio Pretraining Without Negative Samples for Music Understanding.” arXiv preprint arXiv:2506.17815 (2025).

< back to academy
< previous
Next >