UMD Logo

Research at GAMMA Lab

GAMMA Lab Logo

Publications

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
ICML 2025 (Under Submission)

We introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across 20+ benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs - 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach.

GAMA Project Image

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
EMNLP 2024 (Oral)

We propose GAMA, a novel Large Audio-Language Model (LALM) that is capable of responding accurately to complex questions about an input audio. GAMA benefits from a mixture of encoders and synthetic data generated using a novel data generation pipeline we propose. GAMA currently stands as the state-of-the-art LALM on various audio understanding, reasoning, and hallucination benchmarks.

GAMA Project Image

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
ICLR 2025 (Spotlight)

We introduce MMAU (Massive Multi-Task Audio Understanding and Reasoning Benchmark), a comprehensive benchmark designed to evaluate Large Audio-Language Models (LALMs) on tasks that demand expert-level knowledge and complex reasoning. MMAU includes 10,000 meticulously curated audio clips paired with human-annotated natural language questions and answers, covering speech, environmental sounds, and music. The benchmark features information extraction and reasoning questions that require models to demonstrate 27 distinct skills across unique and challenging tasks. Notably, even the advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves 52.50%, underscoring significant potential for improvement.

MMAU Project Image

ProSE: Diffusion Priors for Speech Enhancement
NAACL 2025

We propose ProSE (diffusion-based Priors for SE), a novel methodology based on an alternative framework for applying diffusion models to SE. Specifically, we first apply DDPMs to generate priors in a latent space due to their powerful distribution mapping capabilities. The priors are then integrated into a transformer-based regression model for SE. The priors guide the regression model in the enhancement process. Since the diffusion process is applied to a compact latent space, the diffusion model takes fewer iterations than the traditional DM to obtain accurate estimations. Additionally, using a regression model for SE avoids the distortion issue caused by misaligned details generated by DMs.

GAMA Project Image

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
ICLR 2024

We introduce CompA, a benchmark specifically designed to address gaps in compositional reasoning in audio-language models (ALMs). CompA includes two expert-annotated benchmarks: CompA-order, which evaluates how well an ALM understands the sequence of acoustic events, and CompA-attribute, which tests the model’s ability to associate attributes with specific sounds. Each test instance contains audio-caption pairs with the same events but in varying compositions, challenging the model to match audio accurately to captions. Using CompA, we demonstrate that current ALMs, including CLAP, struggle with complex compositional reasoning. To improve performance, we propose CompA-CLAP, a fine-tuned model that leverages compositionally-aware hard negatives and a new modular contrastive learning objective, significantly enhancing compositional reasoning capabilities across both benchmarks.

CompA Project Image

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
ICLR 2025

We present Synthio, a novel method for generating synthetic data specifically for audio classification. Our approach first involves aligning a Text-to-Audio generation model with the target dataset through preference optimization. We then introduce an iterative prompting method with large language models (LLMs) to generate diverse and consistent audio captions, which are used to prompt the Text-to-Audio generation model for synthetic data creation. By augmenting small-scale audio classification datasets with data generated by Synthio, we achieve up to a 39% performance improvement on benchmark datasets.

CompA Project Image

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning
EMNLP 2025 (Oral)

We introduce EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised approach for speech representation learning. EH-MAM enables better learning from unsupervised data by using an adaptive masking strategy that gradually increases the difficulty of the p re-text SSL task and selectively reconstructing challenging regions within the speech input. EH-MAM outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%.

CompA Project Image

PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
NAACL 2025

We introduce PAT (Parameter-free Audio-Text aligner), a novel training and parameter-free method designed to boost zero-shot audio classification performance with audio-language models. PAT achieves this by improving test-time audio-text alignment, enhancing representations for both modalities through mutual feedback. PAT outperforms vanilla zero-shot audio classification with significant margins of 0.42%-27.0%.

CompA Project Image

Do Audio-Language Models Understand Linguistic Variations?
NAACL 2024

We propose RobustCLAP, a compute-efficient technique that enhances audio-language representations to be robust to linguistic variations. We observe that existing ALMs struggle to generalize effectively to linguistically diverse textual queries. RobustCLAP addresses this challenge by reformulating the contrastive loss in CLAP architectures with a multi-view contrastive learning objective. This approach improves text-to-audio retrieval performance by 0.8%-13% across various benchmarks.

CompA Project Image

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
ICASSP 2025

We present ReCLAP, a nvel approach to enhance zero-shot audio classification performance in CLAP-like Audio-Language Models. Our method first involves training a CLAP model using a unique caption augmentation technique, where audio captions are rewritten to describe individual acoustic events from an auditory perspective. To further improve zero-shot audio classification, we introduce a novel prompt augmentation strategy that generates custom prompts for each category by rephrasing labels to describe sounds associated with each category. ReCLAP achieves state-of-the-art performance on retrieval benchmarks and boosts zero-shot audio classification accuracy by 1%-18% across seven zero-shot classification benchmarks.

CompA Project Image

SLICER: Symmetrical Learning of Instance and Cluster-level Efficient Representations
ICASSP 2023

We introduce SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations), a Self-Supervised Learning approach for pre-training audio encoders on unlabeled data to enhance generalization across diverse audio processing tasks. SLICER learns fine-grained audio representations by combining clustering and contrastive learning with a symmetric loss between student and teacher encoders. SLICER sets state-of-the-art results on the LAPE Benchmark, surpassing prior methods trained on larger datasets.

CompA Project Image

MAST: Multiscale Audio Spectrogram Transformer

We introduce MAST (Multiscale Audio Spectrogram Transformer), a novel audio encoder that incorporates multiscale feature hierarchies into the Audio Spectrogram Transformer (AST). MAST progressively expands embedding dimensions while reducing temporal resolution, using a pyramid structure to capture both low-level acoustic details in early layers and high-level features in deeper layers. MAST shows its effectiveness by achieving a 3.4% average accuracy gain over AST across various downstream tasks.

CompA Project Image

Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition

We introduce Stable Distillation, a novel approach for continued self-supervised learning (SSL) pre-training aimed at adapting SSL models to low-resource Automatic Speech Recognition (ASR) domains. Our method leverages self-distillation as a regularization technique to address mismatches between source and target domains. Specifically, we first conduct standard continued pre-training on a target ASR dataset to develop a "teacher" model, which is then used as a "student" to replicate the teacher's representations. Our proposed method achieves performance improvements of 0.8–7% on existing speech encoders across various low-resource ASR settings.

CompA Project Image

FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning

We introduce FusDom, a novel approach for continued pre-training in self-supervised learning (SSL) to enhance Automatic Speech Recognition (ASR) performance without catastrophic forgetting. FusDom adapts SSL models to target domains while preserving knowledge from prior domains by jointly utilizing two identical SSL models, a teacher and a student, connected by a cross-attention head that solves the pre-task for continued pre-training. FusDom shows its robustness by significantly improving target domains' ASR performance (0.2%–7.3%) while maintaining source domain performance.

CompA Project Image

FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning

We introduce FusDom, a novel approach for continued pre-training in self-supervised learning (SSL) to enhance Automatic Speech Recognition (ASR) performance without catastrophic forgetting. FusDom adapts SSL models to target domains while preserving knowledge from prior domains by jointly utilizing two identical SSL models, a teacher and a student, connected by a cross-attention head that solves the pre-task for continued pre-training. FusDom shows its robustness by significantly improving target domains' ASR performance (0.2%–7.3%) while maintaining source domain performance.

CompA Project Image

TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification

We introduce TSPE (Task-Specific Prompt Ensemble), a novel training-free approach to enhance zero-shot audio classification performance of Audio-Language Models (ALMs). Unlike generic text prompts, TSPE generates context-rich, task-specific prompts by incorporating key sound attributes and sources, improving audio-text alignment. We show that TSPE significantly improves performance across 12 diverse audio classification datasets, achieving an absolute accuracy improvement of upto 16.36% compared to standard zero-shot evaluations.

CompA Project Image