UMD Logo

Audio Processing in the Age of Large Language Models

GAMMA Lab Logo

Our Goal

Audio comprehension—including speech, non-speech sounds, and music—is essential for AI agents to interact effectively with the world. Yet, research in audio processing has lagged behind other areas like language and vision, hindered by limited datasets, the need for advanced architectures, and training methods suited to the inherent complexities of audio. However, the rise of Large Language Models (LLMs) offers promising new directions, as they have shown remarkable ability to understand and reason about the world through language, pushing forward foundational audio tasks like Automatic Speech Recognition (ASR), cross-modal retrieval, and audio captioning. While essential, these tasks only scratch the surface of true, complex reasoning needed to reach intelligence levels comparable to skilled human cognition.

GAMA Hero Image

At GAMMA Lab, UMD, we aim to bridge this gap with a range of innovative solutions, starting with GAMA (EMNLP 2024), our large audio-language model designed for advanced audio perception and complex reasoning. GAMA is built with a specialized architecture, optimized audio encoding, and a novel alignment dataset, positioning it as a leader across benchmarks for audio understanding, reasoning, and hallucination reduction. Good representations are key to advancing perception and GAMA’s development builds on our past achievements, such as MAST and SLICER (ICASSP 2023) and EH-MAM (EMNLP 2024), which pioneered approaches for learning strong audio representations from unlabeled data. Complementing this, we introduced ReCLAP, a state-of-the-art audio-language encoder, and CompA, one of the first projects to tackle compositional reasoning in audio-language models—a critical challenge given audio’s inherently compositional nature.

Looking forward, we envision LALMs becoming integral to daily life, capable of conversational speech QA, information-extraction-based QA, and addressing knowledge-driven questions about diverse audio inputs. We aim to extend GAMA to process longer audio inputs beyond 30 seconds, and ultimately, to interpret multimodal content by integrating visual input, enabling complex question-answering over long video content. Achieving these ambitious goals requires both advanced data and architectures. Synthio, our latest synthetic data generation framework, supports this mission by generating data for complex audio understanding. Progress must also be measurable, so we’re dedicated to establishing comprehensive benchmarks. Our recent release, MMAU, rigorously tests LALMs on real-world tasks, and we plan to roll out additional benchmarks focusing on advanced reasoning over long and multi-audio scenarios.

Together with open-source resources, advanced audio-language models, encoders, and synthetic data frameworks, we are accelerating audio-language intelligence to meet the demands of tomorrow’s AI applications.

People

GAMMA Lab @ Department of Computer Science, UMD

Updates!

Jan 2025: 3 papers accepted to ICLR 2025!

Jan 2025: 3 papers accepted to NAACL 2025!

Jan 2025: 3 PAPERS have been accepted to ICASSP 2025!

Sept 2024: We released MMAU, the most comprehensive audio understanding and reasoning benchmark yet!

Sept 2024: 2 papers accepted to EMNLP 2024 as oral presentations!