AssembleNet++: Assembling Modality Representations via Attention Connections

Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

2020-08-18Action Classification Activity Recognition

Abstract

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network. A new network component named peer-attention is introduced, which dynamically learns the attention weights using another block or input modality. Even without pre-training, our models outperform the previous work on standard public activity recognition datasets with continuous videos, establishing new state-of-the-art. We also confirm that our findings of having neural connections from the object modality and the use of peer-attention is generally applicable for different existing architectures, improving their performances. We name our model explicitly as AssembleNet++. The code will be available at: https://sites.google.com/corp/view/assemblenet/

Results

Task	Dataset	Metric	Value	Model
Video	Charades	MAP	59.8	AssembleNet++ 50
Video	Charades	MAP	54.98	AssembleNet++ 50 without object
Video	Toyota Smarthome dataset	CS	63.6	AssembleNet++

Related Papers

ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs2025-07-15 SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network2025-06-25 Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis2025-06-17 DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding2025-06-16 MORIC: CSI Delay-Doppler Decomposition for Robust Wi-Fi-based Human Activity Recognition2025-06-15 AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments2025-06-13 ScalableHD: Scalable and High-Throughput Hyperdimensional Computing Inference on Multi-Core CPUs2025-06-10 SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis2025-06-09