Pritam Sarkar, Ali Etemad
We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound. The codes and pretrained models are available on the project website.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Activity Recognition | UCF101 | 3-fold Accuracy | 92.4 | CrissCross (AudioSet) |
| Activity Recognition | UCF101 | 3-fold Accuracy | 91.5 | CrissCross (Kinetics400) |
| Activity Recognition | UCF101 | 3-fold Accuracy | 88.3 | CrissCross (Kinetics-Sound) |
| Activity Recognition | HMDB51 | Top-1 Accuracy | 66.8 | CrissCross (AudioSet) |
| Activity Recognition | HMDB51 | Top-1 Accuracy | 64.7 | CrissCross (Kinetics400) |
| Activity Recognition | HMDB51 | Top-1 Accuracy | 60.5 | CrissCross (Kinetics-Sound) |
| Audio Classification | DCASE | Top-1 Accuracy | 97 | CrissCross (AudioSet) |
| Audio Classification | DCASE | Top-1 Accuracy | 96 | CrissCross (Kinetics-400) |
| Audio Classification | DCASE | Top-1 Accuracy | 93 | CrissCross (Kinetics-Sound) |
| Action Recognition | UCF101 | 3-fold Accuracy | 92.4 | CrissCross (AudioSet) |
| Action Recognition | UCF101 | 3-fold Accuracy | 91.5 | CrissCross (Kinetics400) |
| Action Recognition | UCF101 | 3-fold Accuracy | 88.3 | CrissCross (Kinetics-Sound) |
| Action Recognition | HMDB51 | Top-1 Accuracy | 66.8 | CrissCross (AudioSet) |
| Action Recognition | HMDB51 | Top-1 Accuracy | 64.7 | CrissCross (Kinetics400) |
| Action Recognition | HMDB51 | Top-1 Accuracy | 60.5 | CrissCross (Kinetics-Sound) |
| Classification | DCASE | Top-1 Accuracy | 97 | CrissCross (AudioSet) |
| Classification | DCASE | Top-1 Accuracy | 96 | CrissCross (Kinetics-400) |
| Classification | DCASE | Top-1 Accuracy | 93 | CrissCross (Kinetics-Sound) |