ECO: Efficient Convolutional Network for Online Video Understanding

Mohammadreza Zolfaghari, Kamaljeet Singh, Thomas Brox

2018-04-24ECCV 2018 9Video Retrieval Action Classification Video Captioning Video Classification General Classification Video Understanding Action Recognition Retrieval

Paper PDF Code Code Code(official)Code Code Code

Abstract

The state of the art in video understanding suffers from two problems: (1) The major part of reasoning is performed locally in the video, therefore, it misses important relationships within actions that span several seconds. (2) While there are local methods with fast per-frame processing, the processing of the whole video is not efficient and hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with a sampling strategy, which exploits that neighboring frames are largely redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video can consist of a few hundred frames. The approach achieves competitive performance across all datasets while being 10x to 80x faster than state-of-the-art methods.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Something-Something V1	Top 1 Accuracy	46.4	ECO-Net (ImageNet pretrained)
Activity Recognition	Something-Something V1	Top 1 Accuracy	46.4	ECO-Net
Action Recognition	Something-Something V1	Top 1 Accuracy	46.4	ECO-Net (ImageNet pretrained)
Action Recognition	Something-Something V1	Top 1 Accuracy	46.4	ECO-Net

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 A Survey of Context Engineering for Large Language Models2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17 Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16