Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Olga Zatsarynna, Yazan Abu Farha, Juergen Gall

2021-07-18Action Anticipation Autonomous Driving Self-Driving Cars

Abstract

Anticipating human actions is an important task that needs to be addressed for the development of reliable intelligent agents, such as self-driving cars or robot assistants. While the ability to make future predictions with high accuracy is crucial for designing the anticipation approaches, the speed at which the inference is performed is not less important. Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process. Thus, this will increase the reaction time of the underlying system. This poses a problem for domains such as autonomous driving, where the reaction time is crucial. In this work, we propose a simple and effective multi-modal architecture based on temporal convolutions. Our approach stacks a hierarchy of temporal convolutional layers and does not rely on recurrent layers to ensure a fast prediction. We further introduce a multi-modal fusion mechanism that captures the pairwise interactions between RGB, flow, and object modalities. Results on two large-scale datasets of egocentric videos, EPIC-Kitchens-55 and EPIC-Kitchens-100, show that our approach achieves comparable performance to the state-of-the-art approaches while being significantly faster.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	EPIC-KITCHENS-100 (test)	recall@5	11	TBN
Action Recognition	EPIC-KITCHENS-100 (test)	recall@5	11	TBN
Action Anticipation	EPIC-KITCHENS-100 (test)	recall@5	11	TBN
2D Human Pose Estimation	EPIC-KITCHENS-100 (test)	recall@5	11	TBN
Action Recognition In Videos	EPIC-KITCHENS-100 (test)	recall@5	11	TBN

Related Papers

GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving2025-07-19 AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework2025-07-18 World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17 Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models2025-07-17 Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17 LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17 Safeguarding Federated Learning-based Road Condition Classification2025-07-16 Towards Autonomous Riding: A Review of Perception, Planning, and Control in Intelligent Two-Wheelers2025-07-16