MMAU

A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi*,1, Utkarsh Tyagi*,1, Sonal Kumar*,1, Ashish Seth*,1, Ramaneswaran Selvakumar*,1, Oriol Nieto2, Ramani Duraiswami1, Sreyan Ghosh*,†,1, Dinesh Manocha†,1,
1University of Maryland, College Park, USA, 2Adobe, USA

*Equal Technical Contribution †Equal Advising
Correspondence: ssakshi@umd.edu, sonalkum@umd.edu, sreyang@umd.edu
geometric reasoning

Overview of the MMAU dataset.


MMAU Benchmark

Introduction

We present MMAU: a novel benchmark designed to evaluate mul- timodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It features 27 diverse tasks, includ- ing 12 information-retrieval types 1 and 15 reasoning types 2, challenging mod- els to perform at the level of human experts in complex, multimodal audio un- derstanding. Unlike existing benchmarks, MMAU emphasizes advanced percep- tion and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini 1.5 achieves only 66.15% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 55.4%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.

algebraic reasoning

MMAU is uniquely designed to test LALMs’ advanced cognitive abilities, challenging models with questions that require complex, deliberate reasoning and knowledge retrieval grounded in audio perception. To our knowledge, MMAU stands as the first comprehensive benchmark to rigorously assess these capabilities, filling a critical gap in the evaluation of LALMs.


Comparisons with Other Benchmarks

To further distinguish the difference between dataset and other existing ones, we elaborate the benchmark details in Figure. From the breadth perspective, prior benchmarks are often restricted to specific domains and question types. The covered image format is also limited.

algebraic reasoning

Comparison of MMAU with existing audio understanding and reasoning benchmarks across various statistics. MMAU covers all three domains—speech, sound, and music—while having the highest number of information-extraction and complex reasoning tasks.


Complex AQA

Audio Options Answer Qwen2-Audio Gemini


Question: From the given utterance, Count the number of words that contain at least one stressed phoneme

Audio Id: 00050
["two", "one", "thirteen", "four"] two four thirteen


Question: Why can the last line be interpreted as sarcastic?

Audio Id: 1_537
["Sheldon always gets mail first.", "Penny dislikes reading magazines.", "Sheldon is uninterested in physics.", "Penny loves physics journals."] Sheldon is uninterested in physics. Penny loves physics journals. Penny dislikes reading magazines.


Question: What primary element is featured in the foreground of the audio?

Audio Id: zopos1B6Elc
["Digital clicking sounds", "String instruments", "Percussion beats", "Vocal harmonies"] Digital clicking sounds. String instruments. Vocal harmonies.


Question: How does the interaction between different octaves in the piano contribute to the mood?

Audio Id: 1066198.2min
["Adds a touch of comedy", "Creates harmony and peace", "Emphasizes a romantic connection", "Builds tension and conflict"] Builds tension and conflict. Creates harmony and peace. Creates harmony and peace.


Question: What type of vocals are featured in the audio?

Audio Id: uAgizG1hYw0
["Female vocal", "Child vocal", "Male vocal", "Group vocal"] Female vocal. Male vocal. Group vocal.

Experiment Results

Leaderboard

We evaluate various models including LALMs and LLMs. In each type, we consider both closed- and open-source models.

Human Expert Open-Source Proprietary
Sound Music Speech Avg
Name Size Test-mini Test Test-mini Test Test-mini Test Test-mini Test

Overall results of different models on the MMAU leaderboard. The best-performing model in each category is in-bold, and the second best is underlined. *: results provided by the authors.

Error Analysis

We delve into the analysis of errors by Qwen2-Audio-Instruct and Gemini Pro v1.5, the below figure is the breakdown of the error types made by Qwen2-Audio-Instruct and Gemini Pro v1.5 across 500 instances. The dominant error category for both models is Perceptual Errors, with Qwen2-Audio-Instruct showing 55% and Gemini Pro v1.5 at 64%. This indicates that both models struggle primarily with understanding and accurately perceiving the audio inputs.

algebraic reasoning algebraic reasoning

Distribution of human-annotated error types across 500 instances for Qwen2-Audio-Instruct (Left) and Gemini Pro v1.5 (Right).


BibTeX

@misc{sakshi2024mmaumassivemultitaskaudio,
        title={MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark}, 
        author={S Sakshi and Utkarsh Tyagi and Sonal Kumar and Ashish Seth and Ramaneswaran Selvakumar and Oriol Nieto and Ramani Duraiswami and Sreyan Ghosh and Dinesh Manocha},
        year={2024},
        eprint={2410.19168},
        archivePrefix={arXiv},
        primaryClass={eess.AS},
        url={https://arxiv.org/abs/2410.19168}, 
  }