TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multimodal Open-Vocabulary Video Classification via Pre-Tr...

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui

2022-07-15Optical Flow EstimationZero-Shot Action RecognitionVideo Classification
PaperPDF

Abstract

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

Results

TaskDatasetMetricValueModel
Zero-Shot Action RecognitionUCF101Top-1 Accuracy87.1MOV (ViT-L/14)
Zero-Shot Action RecognitionUCF101Top-1 Accuracy82.6MOV (ViT-B/16)
Zero-Shot Action RecognitionHMDB51Top-1 Accuracy64.7MOV (ViT-L/14)
Zero-Shot Action RecognitionHMDB51Top-1 Accuracy60.8MOV (ViT-B/16)

Related Papers

Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17An Efficient Approach for Muscle Segmentation and 3D Reconstruction Using Keypoint Tracking in MRI Scan2025-07-11Learning to Track Any Points from Human Motion2025-07-08TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation2025-07-07MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation2025-06-29ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting2025-06-26WAFT: Warping-Alone Field Transforms for Optical Flow2025-06-26