TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Temporal and cross-modal attention for audio-visual zero-s...

Temporal and cross-modal attention for audio-visual zero-shot learning

Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

2022-07-20GZSL Video ClassificationVideo ClassificationZero-Shot Learning
PaperPDFCodeCode(official)

Abstract

Audio-visual generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information in order to be able to recognise samples from novel, previously unseen classes at test time. The natural semantic and temporal alignment between audio and visual data in video data can be exploited to learn powerful representations that generalise to unseen classes at test time. We propose a multi-modal and Temporal Cross-attention Framework (\modelName) for audio-visual generalised zero-shot learning. Its inputs are temporally aligned audio and visual features that are obtained from pre-trained networks. Encouraging the framework to focus on cross-modal correspondence across time instead of self-attention within the modalities boosts the performance significantly. We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the \ucf, \vgg, and \activity benchmarks for (generalised) zero-shot learning. Code for reproducing all results is available at \url{https://github.com/ExplainableML/TCAF-GZSL}.

Results

TaskDatasetMetricValueModel
Zero-Shot LearningActivityNet-GZSL(main)HM10.71TCaF
Zero-Shot LearningActivityNet-GZSL(main)ZSL7.91TCaF
Zero-Shot LearningVGGSound-GZSL (cls)HM8.77TCaF
Zero-Shot LearningVGGSound-GZSL (cls)ZSL7.41TCaF
Zero-Shot LearningActivityNet-GZSL (cls)HM12.2TCaF
Zero-Shot LearningActivityNet-GZSL (cls)ZSL7.96TCaF
Zero-Shot LearningVGGSound-GZSL(main)HM7.33TCaF
Zero-Shot LearningVGGSound-GZSL(main)ZSL6.06TCaF
Zero-Shot LearningUCF-GZSL (cls)HM50.78TCaF
Zero-Shot LearningUCF-GZSL (cls)ZSL44.64TCaF
Zero-Shot LearningUCF-GZSL(main)HM31.72TCaF
Zero-Shot LearningUCF-GZSL(main)ZSL24.81TCaF

Related Papers

GLAD: Generalizable Tuning for Vision-Language Models2025-07-17DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation2025-07-14ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning2025-06-26Zero-Shot Learning for Obsolescence Risk Forecasting2025-06-26SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network2025-06-25A Multi-Scale Spatial Attention-Based Zero-Shot Learning Framework for Low-Light Image Enhancement2025-06-23Generalizable Agent Modeling for Agent Collaboration-Competition Adaptation with Multi-Retrieval and Dynamic Generation2025-06-20