Towards Open & Accessible Audio General Intelligence for All

Our Goal

Audio General Intelligence—the capacity of AI agents to deeply understand and reason about all types of auditory input, including speech, environmental sounds, and music—is crucial for enabling AI to interact seamlessly and naturally with our world. Despite this importance, audio intelligence has traditionally lagged behind advancements in vision and language processing. This gap arises from significant challenges, such as limited datasets, the complexity of audio signals, and a shortage of advanced neural architectures and effective training methodologies tailored specifically for audio. Recent breakthroughs in Large Language Models (LLMs) have begun to transform the landscape. By demonstrating unprecedented capabilities in language comprehension and reasoning, LLMs offer promising pathways to not only enhance foundational audio tasks like Automatic Speech Recognition (ASR), cross-modal retrieval, and audio captioning, but also give emergence to new tasks like complex Audio Question Answering. However, achieving genuine Audio General Intelligence—characterized by complex reasoning and understanding comparable to expert human cognition—remains an ambitious goal yet to be fully realized.

At GAMMA Lab, University of Maryland, our mission is to accelerate progress toward Audio General Intelligence through open and accessible innovations. Our flagship models, GAMA, Audio Flamingo 2 and Audio Flamingo 3, embody this vision, featuring specialized architectures, optimized audio encoders, and meticulously curated alignment datasets. These models excel in complex reasoning tasks, audio understanding, and minimizing hallucinations, setting new benchmarks across the field.

Looking ahead, we foresee Audio General Intelligence playing a pivotal role in daily life—supporting conversational question answering, extracting structured information from audio, and answering knowledge-intensive queries across diverse auditory contexts. To realize this vision, we are extending Audio Flamingo models to handle longer-duration audio and integrating multimodal capabilities to interpret visual information alongside auditory inputs, thereby enabling sophisticated reasoning over extensive video content. Equally important is measurable progress; hence, we released MMAU, a comprehensive benchmark assessing LALMs in practical, real-world tasks. Future benchmarks will further challenge models with complex reasoning scenarios over extended and multi-audio contexts.

Through open-source models, advanced synthetic data frameworks, and rigorous evaluation benchmarks, GAMMA Lab is committed to driving forward Audio General Intelligence, fostering community engagement, transparency, and collaboration. By maintaining an open ecosystem, we ensure that audio intelligence technologies remain inclusive, impactful, and accessible to researchers and practitioners worldwide.