TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TS2-Net: Token Shift and Selection Transformer for Text-Vi...

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, Qin Jin

2022-07-16Video RetrievalRetrieval
PaperPDFCode(official)

Abstract

Text-Video retrieval is a task of great practical value and has received increasing attention, among which learning spatial-temporal video representation is one of the research hotspots. The video encoders in the state-of-the-art video retrieval models usually directly adopt the pre-trained vision backbones with the network structure fixed, they therefore can not be further improved to produce the fine-grained spatial-temporal video representation. In this paper, we propose Token Shift and Selection Network (TS2-Net), a novel token shift and selection transformer architecture, which dynamically adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples. The token shift module temporally shifts the whole token features back-and-forth across adjacent frames, to preserve the complete token representation and capture subtle movements. Then the token selection module selects tokens that contribute most to local spatial semantics. Based on thorough experiments, the proposed TS2-Net achieves state-of-the-art performance on major text-video retrieval benchmarks, including new records on MSRVTT, VATEX, LSMDC, ActivityNet, and DiDeMo.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@154TS2-Net
VideoMSR-VTT-1kAtext-to-video R@1087.4TS2-Net
VideoMSR-VTT-1kAtext-to-video R@579.3TS2-Net
VideoVATEXtext-to-video R@159.1TS2-Net
VideoVATEXtext-to-video R@1095.2TS2-Net
Video RetrievalMSR-VTT-1kAtext-to-video R@154TS2-Net
Video RetrievalMSR-VTT-1kAtext-to-video R@1087.4TS2-Net
Video RetrievalMSR-VTT-1kAtext-to-video R@579.3TS2-Net
Video RetrievalVATEXtext-to-video R@159.1TS2-Net
Video RetrievalVATEXtext-to-video R@1095.2TS2-Net

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15