Learning Implicit Temporal Alignment for Few-shot Video Classification

Songyang Zhang, Jiale Zhou, Xuming He

2021-05-11Few-Shot Learning Video Classification Classification Action Recognition In Videos

Abstract

Few-shot video classification aims to learn new video categories with only a few labeled examples, alleviating the burden of costly annotation in real-world applications. However, it is particularly challenging to learn a class-invariant spatial-temporal representation in such a setting. To address this, we propose a novel matching-based few-shot learning strategy for video sequences in this work. Our main idea is to introduce an implicit temporal alignment for a video pair, capable of estimating the similarity between them in an accurate and robust manner. Moreover, we design an effective context encoding module to incorporate spatial and feature channel context, resulting in better modeling of intra-class variations. To train our model, we develop a multi-task loss for learning video matching, leading to video features with better generalization. Extensive experimental results on two challenging benchmarks, show that our method outperforms the prior arts with a sizable margin on SomethingSomething-V2 and competitive results on Kinetics.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-1-Shot)	49.2	ITANet
Activity Recognition	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-5-Shot)	62.3	ITANet
Activity Recognition	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-1-Shot)	42.8	OTAM[3]++
Activity Recognition	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-5-Shot)	52.3	OTAM[3]++
Activity Recognition	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-1-Shot)	39.8	ITANet
Activity Recognition	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-5-Shot)	53.7	ITANet
Activity Recognition	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-1-Shot)	36.2	CMN[35]
Activity Recognition	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-5-Shot)	48.8	CMN[35]
Action Recognition	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-1-Shot)	49.2	ITANet
Action Recognition	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-5-Shot)	62.3	ITANet
Action Recognition	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-1-Shot)	42.8	OTAM[3]++
Action Recognition	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-5-Shot)	52.3	OTAM[3]++
Action Recognition	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-1-Shot)	39.8	ITANet
Action Recognition	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-5-Shot)	53.7	ITANet
Action Recognition	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-1-Shot)	36.2	CMN[35]
Action Recognition	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-5-Shot)	48.8	CMN[35]
Action Recognition In Videos	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-1-Shot)	49.2	ITANet
Action Recognition In Videos	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-5-Shot)	62.3	ITANet
Action Recognition In Videos	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-1-Shot)	42.8	OTAM[3]++
Action Recognition In Videos	FS-Something-Something V2-Full	Top-1 Accuracy(5-Way-5-Shot)	52.3	OTAM[3]++
Action Recognition In Videos	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-1-Shot)	39.8	ITANet
Action Recognition In Videos	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-5-Shot)	53.7	ITANet
Action Recognition In Videos	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-1-Shot)	36.2	CMN[35]
Action Recognition In Videos	FS-Something-Something V2-Small	Top-1 Accuracy(5-Way-5-Shot)	48.8	CMN[35]

Learning Implicit Temporal Alignment for Few-shot Video Classification

Abstract

Results

Related Papers

Learning Implicit Temporal Alignment for Few-shot Video Classification

Abstract

Results

Related Papers