TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ViSiL: Fine-grained Spatio-Temporal Video Similarity Learn...

ViSiL: Fine-grained Spatio-Temporal Video Similarity Learning

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, Ioannis Kompatsiaris

2019-08-20ICCV 2019 10Video RetrievalRetrieval
PaperPDFCode(official)

Abstract

In this paper we introduce ViSiL, a Video Similarity Learning architecture that considers fine-grained Spatio-Temporal relations between pairs of videos -- such relations are typically lost in previous video retrieval approaches that embed the whole frame or even the whole video into a vector descriptor before the similarity estimation. By contrast, our Convolutional Neural Network (CNN)-based approach is trained to calculate video-to-video similarity from refined frame-to-frame similarity matrices, so as to consider both intra- and inter-frame relations. In the proposed method, pairwise frame similarity is estimated by applying Tensor Dot (TD) followed by Chamfer Similarity (CS) on regional CNN frame features - this avoids feature aggregation before the similarity calculation between frames. Subsequently, the similarity matrix between all video frames is fed to a four-layer CNN, and then summarized using Chamfer Similarity (CS) into a video-to-video similarity score -- this avoids feature aggregation before the similarity calculation between videos and captures the temporal similarity patterns between matching frame sequences. We train the proposed network using a triplet loss scheme and evaluate it on five public benchmark datasets on four different video retrieval problems where we demonstrate large improvements in comparison to the state of the art. The implementation of ViSiL is publicly available.

Results

TaskDatasetMetricValueModel
VideoFIVR-200KmAP (CSVR)0.854ViSiLv (pt)
VideoFIVR-200KmAP (DSVR)0.899ViSiLv (pt)
VideoFIVR-200KmAP (ISVR)0.723ViSiLv (pt)
VideoFIVR-200KmAP (CSVR)0.841ViSiLv (tf)
VideoFIVR-200KmAP (DSVR)0.892ViSiLv (tf)
VideoFIVR-200KmAP (ISVR)0.702ViSiLv (tf)
VideoFIVR-200KmAP (CSVR)0.797ViSiLf
VideoFIVR-200KmAP (DSVR)0.843ViSiLf
VideoFIVR-200KmAP (ISVR)0.66ViSiLf
VideoFIVR-200KmAP (CSVR)0.792ViSiLsym
VideoFIVR-200KmAP (DSVR)0.833ViSiLsym
VideoFIVR-200KmAP (ISVR)0.654ViSiLsym
Video RetrievalFIVR-200KmAP (CSVR)0.854ViSiLv (pt)
Video RetrievalFIVR-200KmAP (DSVR)0.899ViSiLv (pt)
Video RetrievalFIVR-200KmAP (ISVR)0.723ViSiLv (pt)
Video RetrievalFIVR-200KmAP (CSVR)0.841ViSiLv (tf)
Video RetrievalFIVR-200KmAP (DSVR)0.892ViSiLv (tf)
Video RetrievalFIVR-200KmAP (ISVR)0.702ViSiLv (tf)
Video RetrievalFIVR-200KmAP (CSVR)0.797ViSiLf
Video RetrievalFIVR-200KmAP (DSVR)0.843ViSiLf
Video RetrievalFIVR-200KmAP (ISVR)0.66ViSiLf
Video RetrievalFIVR-200KmAP (CSVR)0.792ViSiLsym
Video RetrievalFIVR-200KmAP (DSVR)0.833ViSiLsym
Video RetrievalFIVR-200KmAP (ISVR)0.654ViSiLsym

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15