CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval

1Qatar Computing Research Institute

CLASP is a novel lightweight multilingual multimodal representation model designed for audio-text retrieval. It is capable of generating rich multilingual semantic embeddings for sentence-level audio that can be used in different text-speech tasks. This paper is published at ECIR 2025 conference.

Overview of the two proposed strategies for the fusion encoder architecture.
CLASP training and evaluation pipeline.

Overview of the two proposed strategies for the fusion encoder architecture.
Overview of the two proposed strategies for the fusion encoder architecture.

Abstract

This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP's audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based retrieval methods that rely on transcribing speech into text for subsequent text retrieval, especially in specific scenarios.

Summary

  • We introduce CLASP (Contrastive Language-Speech Pretraining), a novel lightweight multilingual, multimodal representation designed for audio-text retrieval.
  • We introduce a diverse paired speech-text dataset (Speech Brown) in 15 categories, encompassing a wide range of topics from fiction to religion.
  • We show that the combination of audio spectrograms with a pre-trained self-supervised speech model improves audio encoding in retrieval applications.
  • Evaluations in multiple languages demonstrate that CLASP sets new benchmarks in HITS@1, Mean Reciprocal Rank (MRR), and Mean Rank (meanR) metrics.
  • CLASP Leverages audio spectrograms in addition to self-supervised speech encoding in a contrastive learning framework to enhance semantic representation.
  • CLASP is Simpler, faster, and more size-efficient than ASR-based retrieval pipelines.

Speech Brown Dataset

Speech Brown dataset synthesis pipeline
Speech Brown dataset synthesis pipeline.
  • The Speech Brown Dataset is a comprehensive speech-text paired corpus spanning 15 diverse categories including science fiction, religion, romance, and news, providing rich contextual variety for speech processing research.
  • Comprising over 55,000 sentence-level samples synthesized using the NVIDIA Tacotron 2 text-to-speech model, this dataset offers high-quality audio-text pairs for multimodal learning applications.
  • With approximately 30 GB of data, the dataset features an average of 19 tokens and 96.72 characters per sample, making it ideal for both short-form speech recognition and text-to-speech development.
  • Each sample is meticulously categorized across domains like adventure, belles_lettres, editorial, government, hobbies, humor, learned, and more, enabling domain-specific speech processing research.
  • The dataset's balanced representation across multiple genres provides researchers with a robust foundation for developing and evaluating speech-text retrieval systems and multimodal language models.
  • Built upon the renowned Brown corpus, this speech-enhanced version extends its utility to modern speech processing applications while maintaining the linguistic diversity of the original collection.

Results and Analysis

Speech Brown dataset synthesis pipeline
  • CLASP outperforms Wav2Vec2 and nearly matches HuBERT in retrieval metrics, while reducing model size by approximately 50% and improving inference speed by around 10%, demonstrating superior efficiency without sacrificing performance.
  • With significantly lower meanR values (7.71) compared to HuBERT (17.84) and Wav2Vec2 (38.3), CLASP demonstrates better retrieval ranking and enhanced handling of outliers, leading to more accurate and reliable search results.
  • Despite being trained on only ~130 hours of data—far less than HuBERT's 60,000+ hours—CLASP achieves comparable performance with less than 1.5% drop in HITS@1 and only ~0.8% reduction in MRR, showcasing remarkable data efficiency.
  • The integration of spectrogram analysis with self-supervised speech encoding boosts HITS@1 by approximately 3%, highlighting the benefit of abstract feature extraction from spectrogram waveforms for semantic understanding.
  • Our experiments reveal that the combination of LaBSE text encoder with HuBERT speech embeddings using the concatenation fusion strategy outperforms alternative architectures, achieving the highest retrieval scores across evaluation metrics.
  • CLASP demonstrates exceptional multilingual capability with impressive scores across diverse languages.
  • Our contrastive learning approach enables CLASP to capture nuanced semantic relationships by learning from both positive matches and negative samples, resulting in more robust and discriminative representations for audio-text retrieval.
t-SNE visualization of sentence embeddings
t-SNE visualization of sentence embeddings across modalities, demonstrating effective projection into a shared representation space for the test dataset.

BibTeX

@inproceedings{10.1007/978-3-031-88717-8_2,
                author = {Abootorabi, Mohammad Mahdi and Asgari, Ehsaneddin},
                title = {CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval},
                year = {2025},
                isbn = {978-3-031-88716-1},
                publisher = {Springer-Verlag},
                address = {Berlin, Heidelberg},
                url = {https://doi.org/10.1007/978-3-031-88717-8_2},
                doi = {10.1007/978-3-031-88717-8_2},
                abstract = {This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP’s audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming traditional ASR-based retrieval methods that rely on transcribing speech into text for subsequent text retrieval, especially in specific scenarios.},
                booktitle = {Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part IV},
                pages = {10–20},
                numpages = {11},
                keywords = {Multimodal IR, Speech Retrieval, Contrastive Learning},
                location = {Lucca, Italy}
                }