Asynchronous Temporal Fields for Action Recognition

Gunnar A. Sigurdsson, Santosh Divvala, Ali Farhadi, Abhinav Gupta

2016-12-19CVPR 2017 7Action Classification Temporal Localization Action Recognition Temporal Action Localization Variational Inference

Paper PDF Code(official)Code

Abstract

Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. End-to-end training of such structured models is a challenging endeavor: For inference and learning we need to construct mini-batches consisting of whole videos, leading to mini-batches with only a few videos. This causes high-correlation between data points leading to breakdown of the backprop algorithm. To address this challenge, we present an asynchronous variational inference method that allows efficient end-to-end training. Our method achieves a classification mAP of 22.4% on the Charades benchmark, outperforming the state-of-the-art (17.2% mAP), and offers equal gains on the task of temporal localization.

Results

Task	Dataset	Metric	Value	Model
Video	Charades	MAP	22.4	Asyn-TF
Action Detection	Charades	mAP	9.6	Sigurdsson et al.

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 Interpretable Bayesian Tensor Network Kernel Machines with Automatic Rank and Feature Selection2025-07-15 Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01 EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26 Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference2025-06-26 Feature Hallucination for Self-supervised Action Recognition2025-06-25 CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25