TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Anticipative Video Transformer

Anticipative Video Transformer

Rohit Girdhar, Kristen Grauman

2021-06-03ICCV 2021 10Action Anticipation
PaperPDFCode(official)

Abstract

We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal aggregation strategies, AVT has the advantage of both maintaining the sequential progression of observed actions while still capturing long-range dependencies--both critical for the anticipation task. Through extensive experiments, we show that AVT obtains the best reported performance on four popular action anticipation benchmarks: EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads; and it wins first place in the EpicKitchens-100 CVPR'21 challenge.

Results

TaskDatasetMetricValueModel
Activity RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Act.16.84AVT+
Activity RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Noun20.16AVT+
Activity RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Verb34.36AVT+
Activity RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Act.36.52AVT+
Activity RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Noun51.57AVT+
Activity RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Verb80.03AVT+
Activity RecognitionEPIC-KITCHENS-100 (test)recall@516.7AVT++
Activity RecognitionEPIC-KITCHENS-100 (test)recall@512.6AVT+
Activity RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Act.10.41AVT+
Activity RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Noun15.64AVT+
Activity RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Verb30.66AVT+
Activity RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Act.24.27AVT+
Activity RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Noun40.76AVT+
Activity RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Verb72.17AVT+
Activity RecognitionEPIC-KITCHENS-100Recall@515.9AVT+
Action RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Act.16.84AVT+
Action RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Noun20.16AVT+
Action RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Verb34.36AVT+
Action RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Act.36.52AVT+
Action RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Noun51.57AVT+
Action RecognitionEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Verb80.03AVT+
Action RecognitionEPIC-KITCHENS-100 (test)recall@516.7AVT++
Action RecognitionEPIC-KITCHENS-100 (test)recall@512.6AVT+
Action RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Act.10.41AVT+
Action RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Noun15.64AVT+
Action RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Verb30.66AVT+
Action RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Act.24.27AVT+
Action RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Noun40.76AVT+
Action RecognitionEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Verb72.17AVT+
Action RecognitionEPIC-KITCHENS-100Recall@515.9AVT+
Action AnticipationEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Act.16.84AVT+
Action AnticipationEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Noun20.16AVT+
Action AnticipationEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Verb34.36AVT+
Action AnticipationEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Act.36.52AVT+
Action AnticipationEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Noun51.57AVT+
Action AnticipationEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Verb80.03AVT+
Action AnticipationEPIC-KITCHENS-100 (test)recall@516.7AVT++
Action AnticipationEPIC-KITCHENS-100 (test)recall@512.6AVT+
Action AnticipationEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Act.10.41AVT+
Action AnticipationEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Noun15.64AVT+
Action AnticipationEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Verb30.66AVT+
Action AnticipationEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Act.24.27AVT+
Action AnticipationEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Noun40.76AVT+
Action AnticipationEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Verb72.17AVT+
Action AnticipationEPIC-KITCHENS-100Recall@515.9AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Act.16.84AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Noun20.16AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Verb34.36AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Act.36.52AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Noun51.57AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Verb80.03AVT+
2D Human Pose EstimationEPIC-KITCHENS-100 (test)recall@516.7AVT++
2D Human Pose EstimationEPIC-KITCHENS-100 (test)recall@512.6AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Act.10.41AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Noun15.64AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Verb30.66AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Act.24.27AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Noun40.76AVT+
2D Human Pose EstimationEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Verb72.17AVT+
2D Human Pose EstimationEPIC-KITCHENS-100Recall@515.9AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Act.16.84AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Noun20.16AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Seen test set (S1))Top 1 Accuracy - Verb34.36AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Act.36.52AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Noun51.57AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Seen test set (S1))Top 5 Accuracy - Verb80.03AVT+
Action Recognition In VideosEPIC-KITCHENS-100 (test)recall@516.7AVT++
Action Recognition In VideosEPIC-KITCHENS-100 (test)recall@512.6AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Act.10.41AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Noun15.64AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Unseen test set (S2)Top 1 Accuracy - Verb30.66AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Act.24.27AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Noun40.76AVT+
Action Recognition In VideosEPIC-KITCHENS-55 (Unseen test set (S2)Top 5 Accuracy - Verb72.17AVT+
Action Recognition In VideosEPIC-KITCHENS-100Recall@515.9AVT+

Related Papers

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning2025-06-11Vision and Intention Boost Large Language Model in Long-Term Action Anticipation2025-05-03Hierarchical and Multimodal Data for Daily Activity Understanding2025-04-24Action Anticipation from SoccerNet Football Video Broadcasts2025-04-16ICPR 2024 Competition on Rider Intention Prediction2025-03-11Learning to Generate Long-term Future Narrations Describing Activities of Daily Living2025-03-03Multimodal Large Models Are Effective Action Anticipators2025-01-01MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation2025-01-01