UMD Logo

Research at GAMMA Lab

GAMMA Lab Logo

Publications

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Under Review

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to deliberately think before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks. We will open-source all our code, data, and checkpoints upon paper acceptance.

AF3 Project Image

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
ICML 2025

We introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across 20+ benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs - 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach.

AF2 Project Image

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
EMNLP 2024 (Oral)

We propose GAMA, a novel Large Audio-Language Model (LALM) that is capable of responding accurately to complex questions about an input audio. GAMA benefits from a mixture of encoders and synthetic data generated using a novel data generation pipeline we propose. GAMA currently stands as the state-of-the-art LALM on various audio understanding, reasoning, and hallucination benchmarks.

GAMA Project Image

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
ICLR 2025 (Spotlight)

We introduce MMAU (Massive Multi-Task Audio Understanding and Reasoning Benchmark), a comprehensive benchmark designed to evaluate Large Audio-Language Models (LALMs) on tasks that demand expert-level knowledge and complex reasoning. MMAU includes 10,000 meticulously curated audio clips paired with human-annotated natural language questions and answers, covering speech, environmental sounds, and music. The benchmark features information extraction and reasoning questions that require models to demonstrate 27 distinct skills across unique and challenging tasks. Notably, even the advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves 52.50%, underscoring significant potential for improvement.

MMAU Project Image

Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
ACL 2025 (Findings)

Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models struggle to generalize beyond the specific types of errors encountered during training, limiting their ability to correct new, unseen errors at test time, particularly in out-of-domain (OOD) scenarios. This phenomenon amplifies with named entities (NEs), where, in addition to insufficient contextual information or knowledge about the NEs, novel NEs keep emerging. To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. We augment the GEC training dataset with synthetic data generated by prompting LLMs and text-to-speech models, thereby simulating additional errors from which the model can learn. For OOD scenarios, we simulate test-time errors from new domains similarly and in an unsupervised fashion. Additionally, to better handle named entities, we introduce retrieval-augmented correction by augmenting the input with entities retrieved from a database. Our approach is simple, scalable, and both domain- and language-agnostic. We experiment on multiple datasets and settings, showing that DARAG outperforms all our baselines, achieving 8\% -- 30\% relative WER improvements in ID and 10\% -- 33\% improvements in OOD settings.

FF Project Image

ProSE: Diffusion Priors for Speech Enhancement
NAACL 2025 (Oral)

We propose ProSE (diffusion-based Priors for SE), a novel methodology based on an alternative framework for applying diffusion models to SE. Specifically, we first apply DDPMs to generate priors in a latent space due to their powerful distribution mapping capabilities. The priors are then integrated into a transformer-based regression model for SE. The priors guide the regression model in the enhancement process. Since the diffusion process is applied to a compact latent space, the diffusion model takes fewer iterations than the traditional DM to obtain accurate estimations. Additionally, using a regression model for SE avoids the distortion issue caused by misaligned details generated by DMs.

GAMA Project Image

MULTIVOX: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions
Under Review

The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses. Our benchmark will be open-sourced.

Vox Project Image

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
ICLR 2024

We introduce CompA, a benchmark specifically designed to address gaps in compositional reasoning in audio-language models (ALMs). CompA includes two expert-annotated benchmarks: CompA-order, which evaluates how well an ALM understands the sequence of acoustic events, and CompA-attribute, which tests the model’s ability to associate attributes with specific sounds. Each test instance contains audio-caption pairs with the same events but in varying compositions, challenging the model to match audio accurately to captions. Using CompA, we demonstrate that current ALMs, including CLAP, struggle with complex compositional reasoning. To improve performance, we propose CompA-CLAP, a fine-tuned model that leverages compositionally-aware hard negatives and a new modular contrastive learning objective, significantly enhancing compositional reasoning capabilities across both benchmarks.

CompA Project Image

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
ICLR 2025

We present Synthio, a novel method for generating synthetic data specifically for audio classification. Our approach first involves aligning a Text-to-Audio generation model with the target dataset through preference optimization. We then introduce an iterative prompting method with large language models (LLMs) to generate diverse and consistent audio captions, which are used to prompt the Text-to-Audio generation model for synthetic data creation. By augmenting small-scale audio classification datasets with data generated by Synthio, we achieve up to a 39% performance improvement on benchmark datasets.

CompA Project Image

PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification
NAACL 2025 (Oral)

We introduce PAT (Parameter-free Audio-Text aligner), a novel training and parameter-free method designed to boost zero-shot audio classification performance with audio-language models. PAT achieves this by improving test-time audio-text alignment, enhancing representations for both modalities through mutual feedback. PAT outperforms vanilla zero-shot audio classification with significant margins of 0.42%-27.0%.

CompA Project Image

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning
EMNLP 2025 (Oral)

We introduce EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised approach for speech representation learning. EH-MAM enables better learning from unsupervised data by using an adaptive masking strategy that gradually increases the difficulty of the p re-text SSL task and selectively reconstructing challenging regions within the speech input. EH-MAM outperforms several state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%.

CompA Project Image

Do Audio-Language Models Understand Linguistic Variations?
NAACL 2025

We propose RobustCLAP, a compute-efficient technique that enhances audio-language representations to be robust to linguistic variations. We observe that existing ALMs struggle to generalize effectively to linguistically diverse textual queries. RobustCLAP addresses this challenge by reformulating the contrastive loss in CLAP architectures with a multi-view contrastive learning objective. This approach improves text-to-audio retrieval performance by 0.8%-13% across various benchmarks.

CompA Project Image

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
ICASSP 2025

We present ReCLAP, a nvel approach to enhance zero-shot audio classification performance in CLAP-like Audio-Language Models. Our method first involves training a CLAP model using a unique caption augmentation technique, where audio captions are rewritten to describe individual acoustic events from an auditory perspective. To further improve zero-shot audio classification, we introduce a novel prompt augmentation strategy that generates custom prompts for each category by rephrasing labels to describe sounds associated with each category. ReCLAP achieves state-of-the-art performance on retrieval benchmarks and boosts zero-shot audio classification accuracy by 1%-18% across seven zero-shot classification benchmarks.

CompA Project Image

TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification
ICASSP 2025 SALMA Workshop

We introduce TSPE (Task-Specific Prompt Ensemble), a novel training-free approach to enhance zero-shot audio classification performance of Audio-Language Models (ALMs). Unlike generic text prompts, TSPE generates context-rich, task-specific prompts by incorporating key sound attributes and sources, improving audio-text alignment. We show that TSPE significantly improves performance across 12 diverse audio classification datasets, achieving an absolute accuracy improvement of upto 16.36% compared to standard zero-shot evaluations.

TSPE Project Image

SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation
Under Review

The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects with control over key acoustic parameters such as loudness, pitch, reverb, fade, brightness, noise and duration, enabling creative applications in sound design and content creation. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model-agnostic and is based on learning the disentanglement between audio semantics and its acoustic features. Our approach not only enhances the versatility and expressiveness of text-to-audio generation but also opens new avenues for creative audio production and sound design. Our objective and subjective evaluation results demonstrate the effectiveness of our approach in producing high-quality, customizable audio outputs that align closely with user specifications.

SILA Project Image

SLICER: Symmetrical Learning of Instance and Cluster-level Efficient Representations
ICASSP 2023

We introduce SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations), a Self-Supervised Learning approach for pre-training audio encoders on unlabeled data to enhance generalization across diverse audio processing tasks. SLICER learns fine-grained audio representations by combining clustering and contrastive learning with a symmetric loss between student and teacher encoders. SLICER sets state-of-the-art results on the LAPE Benchmark, surpassing prior methods trained on larger datasets.

CompA Project Image

MAST: Multiscale Audio Spectrogram Transformer

We introduce MAST (Multiscale Audio Spectrogram Transformer), a novel audio encoder that incorporates multiscale feature hierarchies into the Audio Spectrogram Transformer (AST). MAST progressively expands embedding dimensions while reducing temporal resolution, using a pyramid structure to capture both low-level acoustic details in early layers and high-level features in deeper layers. MAST shows its effectiveness by achieving a 3.4% average accuracy gain over AST across various downstream tasks.

CompA Project Image

Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition

We introduce Stable Distillation, a novel approach for continued self-supervised learning (SSL) pre-training aimed at adapting SSL models to low-resource Automatic Speech Recognition (ASR) domains. Our method leverages self-distillation as a regularization technique to address mismatches between source and target domains. Specifically, we first conduct standard continued pre-training on a target ASR dataset to develop a "teacher" model, which is then used as a "student" to replicate the teacher's representations. Our proposed method achieves performance improvements of 0.8–7% on existing speech encoders across various low-resource ASR settings.

CompA Project Image

FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning

We introduce FusDom, a novel approach for continued pre-training in self-supervised learning (SSL) to enhance Automatic Speech Recognition (ASR) performance without catastrophic forgetting. FusDom adapts SSL models to target domains while preserving knowledge from prior domains by jointly utilizing two identical SSL models, a teacher and a student, connected by a cross-attention head that solves the pre-task for continued pre-training. FusDom shows its robustness by significantly improving target domains' ASR performance (0.2%–7.3%) while maintaining source domain performance.

CompA Project Image

FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning

We introduce FusDom, a novel approach for continued pre-training in self-supervised learning (SSL) to enhance Automatic Speech Recognition (ASR) performance without catastrophic forgetting. FusDom adapts SSL models to target domains while preserving knowledge from prior domains by jointly utilizing two identical SSL models, a teacher and a student, connected by a cross-attention head that solves the pre-task for continued pre-training. FusDom shows its robustness by significantly improving target domains' ASR performance (0.2%–7.3%) while maintaining source domain performance.

CompA Project Image