TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video Mamba Suite: State Space Model as a Versatile Altern...

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang

2024-03-14Moment RetrievalVideo UnderstandingTemporal Action Localization
PaperPDFCode(official)

Abstract

Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.

Results

TaskDatasetMetricValueModel
VideoHACSAverage-mAP44.56ActionMamba(InternVideo2-6B)
VideoHACSmAP@0.564.02ActionMamba(InternVideo2-6B)
VideoHACSmAP@0.7545.71ActionMamba(InternVideo2-6B)
VideoHACSmAP@0.9513.34ActionMamba(InternVideo2-6B)
VideoActivityNet-1.3mAP42.02ActionMamba (InternVideo2-6B)
VideoActivityNet-1.3mAP IOU@0.562.43ActionMamba (InternVideo2-6B)
VideoActivityNet-1.3mAP IOU@0.7543.49ActionMamba (InternVideo2-6B)
VideoActivityNet-1.3mAP IOU@0.9510.23ActionMamba (InternVideo2-6B)
VideoFineActionmAP29.04ActionMamba(InternVideo2-6B)
VideoFineActionmAP IOU@0.545.44ActionMamba(InternVideo2-6B)
VideoFineActionmAP IOU@0.7528.82ActionMamba(InternVideo2-6B)
VideoFineActionmAP IOU@0.956.79ActionMamba(InternVideo2-6B)
VideoTHUMOS’14Avg mAP (0.3:0.7)72.72ActionMamba(InternVideo2-6B)
VideoTHUMOS’14mAP IOU@0.386.89ActionMamba(InternVideo2-6B)
VideoTHUMOS’14mAP IOU@0.483.09ActionMamba(InternVideo2-6B)
VideoTHUMOS’14mAP IOU@0.576.9ActionMamba(InternVideo2-6B)
VideoTHUMOS’14mAP IOU@0.665.91ActionMamba(InternVideo2-6B)
VideoTHUMOS’14mAP IOU@0.750.82ActionMamba(InternVideo2-6B)
Temporal Action LocalizationHACSAverage-mAP44.56ActionMamba(InternVideo2-6B)
Temporal Action LocalizationHACSmAP@0.564.02ActionMamba(InternVideo2-6B)
Temporal Action LocalizationHACSmAP@0.7545.71ActionMamba(InternVideo2-6B)
Temporal Action LocalizationHACSmAP@0.9513.34ActionMamba(InternVideo2-6B)
Temporal Action LocalizationActivityNet-1.3mAP42.02ActionMamba (InternVideo2-6B)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.562.43ActionMamba (InternVideo2-6B)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.7543.49ActionMamba (InternVideo2-6B)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.9510.23ActionMamba (InternVideo2-6B)
Temporal Action LocalizationFineActionmAP29.04ActionMamba(InternVideo2-6B)
Temporal Action LocalizationFineActionmAP IOU@0.545.44ActionMamba(InternVideo2-6B)
Temporal Action LocalizationFineActionmAP IOU@0.7528.82ActionMamba(InternVideo2-6B)
Temporal Action LocalizationFineActionmAP IOU@0.956.79ActionMamba(InternVideo2-6B)
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)72.72ActionMamba(InternVideo2-6B)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.386.89ActionMamba(InternVideo2-6B)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.483.09ActionMamba(InternVideo2-6B)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.576.9ActionMamba(InternVideo2-6B)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.665.91ActionMamba(InternVideo2-6B)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.750.82ActionMamba(InternVideo2-6B)
Zero-Shot LearningHACSAverage-mAP44.56ActionMamba(InternVideo2-6B)
Zero-Shot LearningHACSmAP@0.564.02ActionMamba(InternVideo2-6B)
Zero-Shot LearningHACSmAP@0.7545.71ActionMamba(InternVideo2-6B)
Zero-Shot LearningHACSmAP@0.9513.34ActionMamba(InternVideo2-6B)
Zero-Shot LearningActivityNet-1.3mAP42.02ActionMamba (InternVideo2-6B)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.562.43ActionMamba (InternVideo2-6B)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.7543.49ActionMamba (InternVideo2-6B)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.9510.23ActionMamba (InternVideo2-6B)
Zero-Shot LearningFineActionmAP29.04ActionMamba(InternVideo2-6B)
Zero-Shot LearningFineActionmAP IOU@0.545.44ActionMamba(InternVideo2-6B)
Zero-Shot LearningFineActionmAP IOU@0.7528.82ActionMamba(InternVideo2-6B)
Zero-Shot LearningFineActionmAP IOU@0.956.79ActionMamba(InternVideo2-6B)
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)72.72ActionMamba(InternVideo2-6B)
Zero-Shot LearningTHUMOS’14mAP IOU@0.386.89ActionMamba(InternVideo2-6B)
Zero-Shot LearningTHUMOS’14mAP IOU@0.483.09ActionMamba(InternVideo2-6B)
Zero-Shot LearningTHUMOS’14mAP IOU@0.576.9ActionMamba(InternVideo2-6B)
Zero-Shot LearningTHUMOS’14mAP IOU@0.665.91ActionMamba(InternVideo2-6B)
Zero-Shot LearningTHUMOS’14mAP IOU@0.750.82ActionMamba(InternVideo2-6B)
Action LocalizationHACSAverage-mAP44.56ActionMamba(InternVideo2-6B)
Action LocalizationHACSmAP@0.564.02ActionMamba(InternVideo2-6B)
Action LocalizationHACSmAP@0.7545.71ActionMamba(InternVideo2-6B)
Action LocalizationHACSmAP@0.9513.34ActionMamba(InternVideo2-6B)
Action LocalizationActivityNet-1.3mAP42.02ActionMamba (InternVideo2-6B)
Action LocalizationActivityNet-1.3mAP IOU@0.562.43ActionMamba (InternVideo2-6B)
Action LocalizationActivityNet-1.3mAP IOU@0.7543.49ActionMamba (InternVideo2-6B)
Action LocalizationActivityNet-1.3mAP IOU@0.9510.23ActionMamba (InternVideo2-6B)
Action LocalizationFineActionmAP29.04ActionMamba(InternVideo2-6B)
Action LocalizationFineActionmAP IOU@0.545.44ActionMamba(InternVideo2-6B)
Action LocalizationFineActionmAP IOU@0.7528.82ActionMamba(InternVideo2-6B)
Action LocalizationFineActionmAP IOU@0.956.79ActionMamba(InternVideo2-6B)
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)72.72ActionMamba(InternVideo2-6B)
Action LocalizationTHUMOS’14mAP IOU@0.386.89ActionMamba(InternVideo2-6B)
Action LocalizationTHUMOS’14mAP IOU@0.483.09ActionMamba(InternVideo2-6B)
Action LocalizationTHUMOS’14mAP IOU@0.576.9ActionMamba(InternVideo2-6B)
Action LocalizationTHUMOS’14mAP IOU@0.665.91ActionMamba(InternVideo2-6B)
Action LocalizationTHUMOS’14mAP IOU@0.750.82ActionMamba(InternVideo2-6B)
Moment RetrievalCharades-STAR@1 IoU=0.557.18video-mamba-suite
Moment RetrievalCharades-STAR@1 IoU=0.736.05video-mamba-suite
Moment RetrievalQVHighlightsR@1 IoU=0.566.65video-mamba-suite
Moment RetrievalQVHighlightsR@1 IoU=0.752.19video-mamba-suite
Moment RetrievalQVHighlightsmAP45.18video-mamba-suite
Moment RetrievalQVHighlightsmAP@0.564.37video-mamba-suite
Moment RetrievalQVHighlightsmAP@0.7546.68video-mamba-suite

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08