TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Towards Weakly Supervised End-to-end Learning for Long-vid...

Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition

Jiaming Zhou, Hanjun Li, Kun-Yu Lin, Junwei Liang

2023-11-28Action SegmentationAction ClassificationWeakly Supervised Action Segmentation (Action Set))Temporal Sentence GroundingLong-video Activity RecognitionAction RecognitionTemporal Action LocalizationAction Understanding
PaperPDF

Abstract

Developing end-to-end action recognition models on long videos is fundamental and crucial for long-video action understanding. Due to the unaffordable cost of end-to-end training on the whole long videos, existing works generally train models on short clips trimmed from long videos. However, this ``trimming-then-training'' practice requires action interval annotations for clip-level supervision, i.e., knowing which actions are trimmed into the clips. Unfortunately, collecting such annotations is very expensive and prevents model training at scale. To this end, this work aims to build a weakly supervised end-to-end framework for training recognition models on long videos, with only video-level action category labels. Without knowing the precise temporal locations of actions in long videos, our proposed weakly supervised framework, namely AdaptFocus, estimates where and how likely the actions will occur to adaptively focus on informative action clips for end-to-end training. The effectiveness of the proposed AdaptFocus framework is demonstrated on three long-video datasets. Furthermore, for downstream long-video tasks, our AdaptFocus framework provides a weakly supervised feature extraction pipeline for extracting more robust long-video features, such that the state-of-the-art methods on downstream tasks are significantly advanced. We will release the code and models.

Results

TaskDatasetMetricValueModel
Video UnderstandingCharades-STAR1@0.562.4AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
Video UnderstandingCharades-STAR1@0.738.6AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
Video UnderstandingCharades-STAR5@0.589.4AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
Video UnderstandingCharades-STAR5@0.766.4AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
Video UnderstandingCharades-STAR1@0.556.7AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
Video UnderstandingCharades-STAR1@0.735.6AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
Video UnderstandingCharades-STAR5@0.587.9AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
Video UnderstandingCharades-STAR5@0.765AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
Video UnderstandingCharades-STAR1@0.551.7AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
Video UnderstandingCharades-STAR1@0.723.2AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
Video UnderstandingCharades-STAR5@0.585.2AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
Video UnderstandingCharades-STAR5@0.752.6AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
Video UnderstandingCharades-STAR1@0.549.1AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
Video UnderstandingCharades-STAR1@0.722.4AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
Video UnderstandingCharades-STAR5@0.584.2AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
Video UnderstandingCharades-STAR5@0.751.8AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
Video UnderstandingCharades-STAR1@0.550.1AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
Video UnderstandingCharades-STAR1@0.721.8AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
Video UnderstandingCharades-STAR5@0.586.1AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
Video UnderstandingCharades-STAR5@0.754.6AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
Video UnderstandingCharades-STAR1@0.546.9AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
Video UnderstandingCharades-STAR1@0.721.1AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
Video UnderstandingCharades-STAR5@0.579.3AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
Video UnderstandingCharades-STAR5@0.749.2AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
Video UnderstandingBreakfastmAP79.5AdaFocus (MViT-Breakfast-Pretrain-feature, GHRM)
Video UnderstandingBreakfastmAP79.2AdaFocus (MViT-Breakfast-Pretrain-feature, Timeception)
Video UnderstandingBreakfastmAP70.4AdaFocus (I3D-Breakfast-Pretrain-feature, Timeception)
Video UnderstandingBreakfastmAP69.6AdaFocus (I3D-Breakfast-Pretrain-feature, GHRM)
VideoCharades-STAR1@0.562.4AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
VideoCharades-STAR1@0.738.6AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
VideoCharades-STAR5@0.589.4AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
VideoCharades-STAR5@0.766.4AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
VideoCharades-STAR1@0.556.7AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
VideoCharades-STAR1@0.735.6AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
VideoCharades-STAR5@0.587.9AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
VideoCharades-STAR5@0.765AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
VideoCharades-STAR1@0.551.7AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
VideoCharades-STAR1@0.723.2AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
VideoCharades-STAR5@0.585.2AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
VideoCharades-STAR5@0.752.6AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
VideoCharades-STAR1@0.549.1AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
VideoCharades-STAR1@0.722.4AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
VideoCharades-STAR5@0.584.2AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
VideoCharades-STAR5@0.751.8AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
VideoCharades-STAR1@0.550.1AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
VideoCharades-STAR1@0.721.8AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
VideoCharades-STAR5@0.586.1AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
VideoCharades-STAR5@0.754.6AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
VideoCharades-STAR1@0.546.9AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
VideoCharades-STAR1@0.721.1AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
VideoCharades-STAR5@0.579.3AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
VideoCharades-STAR5@0.749.2AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
VideoBreakfastmAP79.5AdaFocus (MViT-Breakfast-Pretrain-feature, GHRM)
VideoBreakfastmAP79.2AdaFocus (MViT-Breakfast-Pretrain-feature, Timeception)
VideoBreakfastmAP70.4AdaFocus (I3D-Breakfast-Pretrain-feature, Timeception)
VideoBreakfastmAP69.6AdaFocus (I3D-Breakfast-Pretrain-feature, GHRM)
VideoCharadesMAP47.8AdaFocus (weak supervision, MViT-B-24, 32x3)
VideoCharadesMAP41.4AdaFocus (weak supervision, MViT-B-K400-pretrain, 16x4)
VideoCharadesMAP41.2AdaFocus (weak supervision, X3D-L, 32x3)
VideoCharadesMAP39.3AdaFocus (weak supervision, Slowfast-R50, 16x8)
Action LocalizationBreakfastAcc78AdaFocus (newly extracted I3D-features, LT-Context model)
Action LocalizationBreakfastAverage F176.2AdaFocus (newly extracted I3D-features, LT-Context model)
Action LocalizationBreakfastEdit78.3AdaFocus (newly extracted I3D-features, LT-Context model)
Action LocalizationBreakfastF1@10%82.1AdaFocus (newly extracted I3D-features, LT-Context model)
Action LocalizationBreakfastF1@25%79AdaFocus (newly extracted I3D-features, LT-Context model)
Action LocalizationBreakfastF1@50%67.5AdaFocus (newly extracted I3D-features, LT-Context model)
Action LocalizationBreakfastAcc49.6AdaFocus (newly extracted I3D-features, POC model)
Action SegmentationBreakfastAcc78AdaFocus (newly extracted I3D-features, LT-Context model)
Action SegmentationBreakfastAverage F176.2AdaFocus (newly extracted I3D-features, LT-Context model)
Action SegmentationBreakfastEdit78.3AdaFocus (newly extracted I3D-features, LT-Context model)
Action SegmentationBreakfastF1@10%82.1AdaFocus (newly extracted I3D-features, LT-Context model)
Action SegmentationBreakfastF1@25%79AdaFocus (newly extracted I3D-features, LT-Context model)
Action SegmentationBreakfastF1@50%67.5AdaFocus (newly extracted I3D-features, LT-Context model)
Action SegmentationBreakfastAcc49.6AdaFocus (newly extracted I3D-features, POC model)
Temporal Sentence GroundingCharades-STAR1@0.562.4AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
Temporal Sentence GroundingCharades-STAR1@0.738.6AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
Temporal Sentence GroundingCharades-STAR5@0.589.4AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
Temporal Sentence GroundingCharades-STAR5@0.766.4AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)
Temporal Sentence GroundingCharades-STAR1@0.556.7AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
Temporal Sentence GroundingCharades-STAR1@0.735.6AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
Temporal Sentence GroundingCharades-STAR5@0.587.9AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
Temporal Sentence GroundingCharades-STAR5@0.765AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)
Temporal Sentence GroundingCharades-STAR1@0.551.7AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
Temporal Sentence GroundingCharades-STAR1@0.723.2AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
Temporal Sentence GroundingCharades-STAR5@0.585.2AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
Temporal Sentence GroundingCharades-STAR5@0.752.6AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)
Temporal Sentence GroundingCharades-STAR1@0.549.1AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
Temporal Sentence GroundingCharades-STAR1@0.722.4AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
Temporal Sentence GroundingCharades-STAR5@0.584.2AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
Temporal Sentence GroundingCharades-STAR5@0.751.8AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)
Temporal Sentence GroundingCharades-STAR1@0.550.1AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
Temporal Sentence GroundingCharades-STAR1@0.721.8AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
Temporal Sentence GroundingCharades-STAR5@0.586.1AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
Temporal Sentence GroundingCharades-STAR5@0.754.6AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)
Temporal Sentence GroundingCharades-STAR1@0.546.9AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
Temporal Sentence GroundingCharades-STAR1@0.721.1AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
Temporal Sentence GroundingCharades-STAR5@0.579.3AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)
Temporal Sentence GroundingCharades-STAR5@0.749.2AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Self-supervised pretraining of vision transformers for animal behavioral analysis and neural encoding2025-07-13Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25