Yuncong Yang, Jiawei Ma, Shiyuan Huang, Long Chen, Xudong Lin, Guangxing Han, Shih-Fu Chang
Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level comparison may ignore global temporal context, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal succession by shuffling video clips w.r.t. temporal granularity. Then, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Long Video Retrieval (Background Removed) | YouCook2 | Cap. Avg. R@1 | 74.5 | TempCLR |
| Long Video Retrieval (Background Removed) | YouCook2 | Cap. Avg. R@10 | 97 | TempCLR |
| Long Video Retrieval (Background Removed) | YouCook2 | Cap. Avg. R@5 | 94.6 | TempCLR |
| Long Video Retrieval (Background Removed) | YouCook2 | DTW R@1 | 83.5 | TempCLR |
| Long Video Retrieval (Background Removed) | YouCook2 | DTW R@10 | 99.3 | TempCLR |
| Long Video Retrieval (Background Removed) | YouCook2 | DTW R@5 | 97.2 | TempCLR |
| Long Video Retrieval (Background Removed) | YouCook2 | OTAM R@1 | 84.9 | TempCLR |
| Long Video Retrieval (Background Removed) | YouCook2 | OTAM R@10 | 99.5 | TempCLR |
| Long Video Retrieval (Background Removed) | YouCook2 | OTAM R@5 | 97.9 | TempCLR |