Papers With Code 2 | ML Benchmarks, SotA Results & Code

We introduce the first dataset, MUSIC-AVQA-R, to evaluate the robustness of AVQA models. The construction of this dataset involves two key processes: rephrasing and splitting. The former involves the rephrasing of questions in the test split of MUSIC-AVQA, and the latter is dedicated to the categorization of questions into frequent (head) and rare (tail) subset.

We followed the previous work in partitioning the dataset into "head" and "tail" categories. Based on the number of answers in the dataset, answers with a count greater than $1.2$ times the mean, denoted as $\mu(a)$ , were categorized as "head" while those with counts less than $1.2\mu(a)$ were categorized as "tail"