Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

AJ Piergiovanni, Chenyou Fan, Michael S. Ryoo

2016-05-26Activity Recognition In Videos Action Classification Human Activity Recognition Action Recognition In Videos Activity Recognition

Paper PDF Code

Abstract

In this paper, we newly introduce the concept of temporal attention filters, and describe how they can be used for human activity recognition from videos. Many high-level activities are often composed of multiple temporal parts (e.g., sub-events) with different duration/speed, and our objective is to make the model explicitly learn such temporal structure using multiple attention filters and benefit from them. Our temporal filters are designed to be fully differentiable, allowing end-of-end training of the temporal filters together with the underlying frame-based or segment-based convolutional neural network architectures. This paper presents an approach of learning a set of optimal static temporal attention filters to be shared across different videos, and extends this approach to dynamically adjust attention filters per testing video using recurrent long short-term memory networks (LSTMs). This allows our temporal attention filters to learn latent sub-events specific to each activity. We experimentally confirm that the proposed concept of temporal attention filters benefits the activity recognition, and we visualize the learned latent sub-events.

Results

Task	Dataset	Metric	Value	Model
Video	DogCentric	Accuracy	98.55	VTFSA
Temporal Action Localization	DogCentric	Accuracy	98.55	VTFSA
Zero-Shot Learning	DogCentric	Accuracy	98.55	VTFSA
Action Localization	DogCentric	Accuracy	98.55	VTFSA
Activity Recognition In Videos	DogCentric	Accuracy	98.55	VTFSA

Related Papers

ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs2025-07-15 SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network2025-06-25 Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis2025-06-17 DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding2025-06-16 MORIC: CSI Delay-Doppler Decomposition for Robust Wi-Fi-based Human Activity Recognition2025-06-15 AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments2025-06-13 ScalableHD: Scalable and High-Throughput Hyperdimensional Computing Inference on Multi-Core CPUs2025-06-10 SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis2025-06-09