TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Make Your Training Flexible: Towards Deployment-Efficient ...

Make Your Training Flexible: Towards Deployment-Efficient Video Models

Chenting Wang, Kunchang Li, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, LiMin Wang

2025-03-18Action ClassificationZero-Shot Video Retrieval
PaperPDFCode(official)

Abstract

Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. All models and data are available at https://github.com/OpenGVLab/FluxViT.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@190FluxViT-B
VideoKinetics-400Parameters (M)97FluxViT-B
VideoKinetics-400Acc@188FluxViT-S
VideoKinetics-400Parameters (M)24FluxViT-S
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@149.9FluxViT-B
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1079.6FluxViT-B
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@571FluxViT-B
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@149.4FluxViT-B
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@1082.4FluxViT-B
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@573.9FluxViT-B
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@145FluxViT-S
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1075.8FluxViT-S
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@567.5FluxViT-S
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@144.9FluxViT-S
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@1076.5FluxViT-S
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@568.2FluxViT-S

Related Papers

SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis2025-06-09From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos2025-06-05Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition2025-05-29SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding2025-05-22Mouse Lockbox Dataset: Behavior Recognition for Mice Solving Lockboxes2025-05-21Domain Adaptation of VLM for Soccer Video Understanding2025-05-20OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition2025-03-30CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition2025-03-30