TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CenterCLIP: Token Clustering for Efficient Text-Video Retr...

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang

2022-05-02Video RetrievalClusteringRetrieval
PaperPDFCode(official)

Abstract

Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex multi-modal relations. However, in the vision transformer of CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive and similar frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering. Center tokens from each segment are later concatenated into a new sequence, while their original spatial-temporal relations are well maintained. We instantiate two clustering algorithms to efficiently find deterministic medoids and iteratively partition groups in high dimensional space. Through this token clustering and center selection procedure, we successfully reduce computation costs by removing redundant visual tokens. This method further enhances segment-level semantic alignment between video and text representations, enforcing the spatio-temporal interactions of tokens from within-segment frames. Our method, coined as CenterCLIP, surpasses existing state-of-the-art by a large margin on typical text-video benchmarks, while reducing the training memory cost by 35\% and accelerating the inference speed by 14\% at the best case. The code is available at \href{{https://github.com/mzhaoshuai/CenterCLIP}}{{https://github.com/mzhaoshuai/CenterCLIP}}.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank13.8CenterCLIP (ViT-B/16)
VideoMSR-VTT-1kAtext-to-video Median Rank2CenterCLIP (ViT-B/16)
VideoMSR-VTT-1kAtext-to-video R@148.4CenterCLIP (ViT-B/16)
VideoMSR-VTT-1kAtext-to-video R@1082CenterCLIP (ViT-B/16)
VideoMSR-VTT-1kAtext-to-video R@573.8CenterCLIP (ViT-B/16)
VideoMSR-VTT-1kAvideo-to-text Mean Rank10.2CenterCLIP (ViT-B/16)
VideoMSR-VTT-1kAvideo-to-text Median Rank2CenterCLIP (ViT-B/16)
VideoMSR-VTT-1kAvideo-to-text R@147.7CenterCLIP (ViT-B/16)
VideoMSR-VTT-1kAvideo-to-text R@1083.3CenterCLIP (ViT-B/16)
VideoMSR-VTT-1kAvideo-to-text R@575CenterCLIP (ViT-B/16)
VideoActivityNettext-to-video Mean Rank5.7CenterCLIP (ViT-B/16)
VideoActivityNettext-to-video Median Rank2CenterCLIP (ViT-B/16)
VideoActivityNettext-to-video R@146.2CenterCLIP (ViT-B/16)
VideoActivityNettext-to-video R@1087.6CenterCLIP (ViT-B/16)
VideoActivityNettext-to-video R@577CenterCLIP (ViT-B/16)
VideoActivityNetvideo-to-text Mean Rank5.5CenterCLIP (ViT-B/16)
VideoActivityNetvideo-to-text Median Rank2CenterCLIP (ViT-B/16)
VideoActivityNetvideo-to-text R@146.7CenterCLIP (ViT-B/16)
VideoActivityNetvideo-to-text R@1088CenterCLIP (ViT-B/16)
VideoActivityNetvideo-to-text R@577.1CenterCLIP (ViT-B/16)
VideoLSMDCtext-to-video Mean Rank47.3CenterCLIP (ViT-B/16)
VideoLSMDCtext-to-video Median Rank8CenterCLIP (ViT-B/16)
VideoLSMDCtext-to-video R@124.2CenterCLIP (ViT-B/16)
VideoLSMDCtext-to-video R@1055.9CenterCLIP (ViT-B/16)
VideoLSMDCtext-to-video R@546.2CenterCLIP (ViT-B/16)
VideoLSMDCvideo-to-text Mean Rank41.3CenterCLIP (ViT-B/16)
VideoLSMDCvideo-to-text Median Rank7CenterCLIP (ViT-B/16)
VideoLSMDCvideo-to-text R@124.5CenterCLIP (ViT-B/16)
VideoLSMDCvideo-to-text R@1055.8CenterCLIP (ViT-B/16)
VideoLSMDCvideo-to-text R@546.4CenterCLIP (ViT-B/16)
VideoMSVDtext-to-video Mean Rank8.4CenterCLIP (ViT-B/16)
VideoMSVDtext-to-video Median Rank1CenterCLIP (ViT-B/16)
VideoMSVDtext-to-video R@150.6CenterCLIP (ViT-B/16)
VideoMSVDtext-to-video R@1088.4CenterCLIP (ViT-B/16)
VideoMSVDtext-to-video R@580.3CenterCLIP (ViT-B/16)
VideoMSVDvideo-to-text Mean Rank3CenterCLIP (ViT-B/16)
VideoMSVDvideo-to-text Median Rank1CenterCLIP (ViT-B/16)
VideoMSVDvideo-to-text R@168.4CenterCLIP (ViT-B/16)
VideoMSVDvideo-to-text R@1095CenterCLIP (ViT-B/16)
VideoMSVDvideo-to-text R@590.1CenterCLIP (ViT-B/16)
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank13.8CenterCLIP (ViT-B/16)
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank2CenterCLIP (ViT-B/16)
Video RetrievalMSR-VTT-1kAtext-to-video R@148.4CenterCLIP (ViT-B/16)
Video RetrievalMSR-VTT-1kAtext-to-video R@1082CenterCLIP (ViT-B/16)
Video RetrievalMSR-VTT-1kAtext-to-video R@573.8CenterCLIP (ViT-B/16)
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank10.2CenterCLIP (ViT-B/16)
Video RetrievalMSR-VTT-1kAvideo-to-text Median Rank2CenterCLIP (ViT-B/16)
Video RetrievalMSR-VTT-1kAvideo-to-text R@147.7CenterCLIP (ViT-B/16)
Video RetrievalMSR-VTT-1kAvideo-to-text R@1083.3CenterCLIP (ViT-B/16)
Video RetrievalMSR-VTT-1kAvideo-to-text R@575CenterCLIP (ViT-B/16)
Video RetrievalActivityNettext-to-video Mean Rank5.7CenterCLIP (ViT-B/16)
Video RetrievalActivityNettext-to-video Median Rank2CenterCLIP (ViT-B/16)
Video RetrievalActivityNettext-to-video R@146.2CenterCLIP (ViT-B/16)
Video RetrievalActivityNettext-to-video R@1087.6CenterCLIP (ViT-B/16)
Video RetrievalActivityNettext-to-video R@577CenterCLIP (ViT-B/16)
Video RetrievalActivityNetvideo-to-text Mean Rank5.5CenterCLIP (ViT-B/16)
Video RetrievalActivityNetvideo-to-text Median Rank2CenterCLIP (ViT-B/16)
Video RetrievalActivityNetvideo-to-text R@146.7CenterCLIP (ViT-B/16)
Video RetrievalActivityNetvideo-to-text R@1088CenterCLIP (ViT-B/16)
Video RetrievalActivityNetvideo-to-text R@577.1CenterCLIP (ViT-B/16)
Video RetrievalLSMDCtext-to-video Mean Rank47.3CenterCLIP (ViT-B/16)
Video RetrievalLSMDCtext-to-video Median Rank8CenterCLIP (ViT-B/16)
Video RetrievalLSMDCtext-to-video R@124.2CenterCLIP (ViT-B/16)
Video RetrievalLSMDCtext-to-video R@1055.9CenterCLIP (ViT-B/16)
Video RetrievalLSMDCtext-to-video R@546.2CenterCLIP (ViT-B/16)
Video RetrievalLSMDCvideo-to-text Mean Rank41.3CenterCLIP (ViT-B/16)
Video RetrievalLSMDCvideo-to-text Median Rank7CenterCLIP (ViT-B/16)
Video RetrievalLSMDCvideo-to-text R@124.5CenterCLIP (ViT-B/16)
Video RetrievalLSMDCvideo-to-text R@1055.8CenterCLIP (ViT-B/16)
Video RetrievalLSMDCvideo-to-text R@546.4CenterCLIP (ViT-B/16)
Video RetrievalMSVDtext-to-video Mean Rank8.4CenterCLIP (ViT-B/16)
Video RetrievalMSVDtext-to-video Median Rank1CenterCLIP (ViT-B/16)
Video RetrievalMSVDtext-to-video R@150.6CenterCLIP (ViT-B/16)
Video RetrievalMSVDtext-to-video R@1088.4CenterCLIP (ViT-B/16)
Video RetrievalMSVDtext-to-video R@580.3CenterCLIP (ViT-B/16)
Video RetrievalMSVDvideo-to-text Mean Rank3CenterCLIP (ViT-B/16)
Video RetrievalMSVDvideo-to-text Median Rank1CenterCLIP (ViT-B/16)
Video RetrievalMSVDvideo-to-text R@168.4CenterCLIP (ViT-B/16)
Video RetrievalMSVDvideo-to-text R@1095CenterCLIP (ViT-B/16)
Video RetrievalMSVDvideo-to-text R@590.1CenterCLIP (ViT-B/16)

Related Papers

Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Ranking Vectors Clustering: Theory and Applications2025-07-16Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16