TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Anticipative Feature Fusion Transformer for Multi-Modal Ac...

Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation

Zeyun Zhong, David Schneider, Michael Voit, Rainer Stiefelhagen, Jürgen Beyerer

2022-10-23Action Anticipation
PaperPDFCode(official)

Abstract

Although human action anticipation is a task which is inherently multi-modal, state-of-the-art methods on well known action anticipation datasets leverage this data by applying ensemble methods and averaging scores of unimodal anticipation networks. In this work we introduce transformer based modality fusion techniques, which unify multi-modal data at an early stage. Our Anticipative Feature Fusion Transformer (AFFT) proves to be superior to popular score fusion approaches and presents state-of-the-art results outperforming previous methods on EpicKitchens-100 and EGTEA Gaze+. Our model is easily extensible and allows for adding new modalities without architectural changes. Consequently, we extracted audio features on EpicKitchens-100 which we add to the set of commonly used features in the community.

Results

TaskDatasetMetricValueModel
Activity RecognitionEPIC-KITCHENS-100 (test)recall@514.9AFFT
Activity RecognitionEPIC-KITCHENS-100Recall@518.5AFFT
Action RecognitionEPIC-KITCHENS-100 (test)recall@514.9AFFT
Action RecognitionEPIC-KITCHENS-100Recall@518.5AFFT
Action AnticipationEPIC-KITCHENS-100 (test)recall@514.9AFFT
Action AnticipationEPIC-KITCHENS-100Recall@518.5AFFT
2D Human Pose EstimationEPIC-KITCHENS-100 (test)recall@514.9AFFT
2D Human Pose EstimationEPIC-KITCHENS-100Recall@518.5AFFT
Action Recognition In VideosEPIC-KITCHENS-100 (test)recall@514.9AFFT
Action Recognition In VideosEPIC-KITCHENS-100Recall@518.5AFFT

Related Papers

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning2025-06-11Vision and Intention Boost Large Language Model in Long-Term Action Anticipation2025-05-03Hierarchical and Multimodal Data for Daily Activity Understanding2025-04-24Action Anticipation from SoccerNet Football Video Broadcasts2025-04-16ICPR 2024 Competition on Rider Intention Prediction2025-03-11Learning to Generate Long-term Future Narrations Describing Activities of Daily Living2025-03-03Multimodal Large Models Are Effective Action Anticipators2025-01-01MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation2025-01-01