TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Cooperative Cross-Stream Network for Discriminative Action...

Cooperative Cross-Stream Network for Discriminative Action Representation

Jingran Zhang, Fumin Shen, Xing Xu, Heng Tao Shen

2019-08-27Action RecognitionTemporal Action Localization
PaperPDF

Abstract

Spatial and temporal stream model has gained great success in video action recognition. Most existing works pay more attention to designing effective features fusion methods, which train the two-stream model in a separate way. However, it's hard to ensure discriminability and explore complementary information between different streams in existing works. In this work, we propose a novel cooperative cross-stream network that investigates the conjoint information in multiple different modalities. The jointly spatial and temporal stream networks feature extraction is accomplished by an end-to-end learning manner. It extracts this complementary information of different modality from a connection block, which aims at exploring correlations of different stream features. Furthermore, different from the conventional ConvNet that learns the deep separable features with only one cross-entropy loss, our proposed model enhances the discriminative power of the deeply learned features and reduces the undesired modality discrepancy by jointly optimizing a modality ranking constraint and a cross-entropy loss for both homogeneous and heterogeneous modalities. The modality ranking constraint constitutes intra-modality discriminative embedding and inter-modality triplet constraint, and it reduces both the intra-modality and cross-modality feature variations. Experiments on three benchmark datasets demonstrate that by cooperating appearance and motion feature extraction, our method can achieve state-of-the-art or competitive performance compared with existing results.

Results

TaskDatasetMetricValueModel
Activity RecognitionHMDB-51Average accuracy of 3 splits81.9CCS + TSN (ImageNet+Kinetics pretrained)
Activity RecognitionSomething-Something V2Top-1 Accuracy61.2CCS + two-stream + TRN
Activity RecognitionSomething-Something V2Top-5 Accuracy89.3CCS + two-stream + TRN
Activity RecognitionUCF1013-fold Accuracy97.4CCS + TSN (ImageNet+Kinetics pretrained)
Action RecognitionHMDB-51Average accuracy of 3 splits81.9CCS + TSN (ImageNet+Kinetics pretrained)
Action RecognitionSomething-Something V2Top-1 Accuracy61.2CCS + two-stream + TRN
Action RecognitionSomething-Something V2Top-5 Accuracy89.3CCS + two-stream + TRN
Action RecognitionUCF1013-fold Accuracy97.4CCS + TSN (ImageNet+Kinetics pretrained)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22