TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Expectation-Maximization Contrastive Learning for Compact ...

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David A. Clifton, Jie Chen

2022-11-21Video RetrievalRepresentation LearningVideo Question AnsweringVideo CaptioningContrastive LearningRetrievalVisual Question Answering (VQA)
PaperPDFCodeCodeCode(official)Code

Abstract

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank1EMCL-Net++
VideoMSR-VTT-1kAtext-to-video R@151.6EMCL-Net++
VideoMSR-VTT-1kAtext-to-video R@1085.3EMCL-Net++
VideoMSR-VTT-1kAtext-to-video R@578.1EMCL-Net++
VideoMSR-VTT-1kAvideo-to-text Mean Rank1EMCL-Net++
VideoMSR-VTT-1kAvideo-to-text R@151.8EMCL-Net++
VideoMSR-VTT-1kAvideo-to-text R@1088EMCL-Net++
VideoMSR-VTT-1kAvideo-to-text R@580.2EMCL-Net++
VideoMSR-VTT-1kAtext-to-video Mean Rank2EMCL-Net
VideoMSR-VTT-1kAtext-to-video R@146.8EMCL-Net
VideoMSR-VTT-1kAtext-to-video R@1083.1EMCL-Net
VideoMSR-VTT-1kAtext-to-video R@573.1EMCL-Net
VideoMSR-VTT-1kAvideo-to-text Mean Rank2EMCL-Net
VideoMSR-VTT-1kAvideo-to-text R@146.5EMCL-Net
VideoMSR-VTT-1kAvideo-to-text R@1083.5EMCL-Net
VideoMSR-VTT-1kAvideo-to-text R@573.5EMCL-Net
VideoActivityNettext-to-video Mean Rank1EMCL-Net++
VideoActivityNettext-to-video R@150.6EMCL-Net++
VideoActivityNettext-to-video R@578.7EMCL-Net++
VideoActivityNettext-to-video R@5098.1EMCL-Net++
VideoActivityNetvideo-to-text Mean Rank1EMCL-Net++
VideoActivityNetvideo-to-text R@150.6EMCL-Net++
VideoActivityNetvideo-to-text R@578.9EMCL-Net++
VideoActivityNetvideo-to-text R@5098.4EMCL-Net++
VideoActivityNettext-to-video Mean Rank2EMCL-Net
VideoActivityNettext-to-video R@141.2EMCL-Net
VideoActivityNettext-to-video R@572.7EMCL-Net
VideoActivityNetvideo-to-text Mean Rank2EMCL-Net
VideoActivityNetvideo-to-text R@142.7EMCL-Net
VideoActivityNetvideo-to-text R@574EMCL-Net
VideoActivityNetvideo-to-text R@5098.3EMCL-Net
VideoLSMDCtext-to-video R@125.9EMCL-Net++
VideoLSMDCtext-to-video R@546.4EMCL-Net++
VideoLSMDCvideo-to-text Mean Rank8EMCL-Net++
VideoLSMDCvideo-to-text R@126.7EMCL-Net++
VideoLSMDCvideo-to-text R@1054.4EMCL-Net++
VideoLSMDCvideo-to-text R@544.7EMCL-Net++
VideoLSMDCtext-to-video R@123.9EMCL-Net
VideoLSMDCtext-to-video R@1050.9EMCL-Net
VideoLSMDCtext-to-video R@542.4EMCL-Net
VideoLSMDCvideo-to-text Mean Rank12EMCL-Net
VideoLSMDCvideo-to-text R@122.2EMCL-Net
VideoLSMDCvideo-to-text R@1049.2EMCL-Net
VideoLSMDCvideo-to-text R@540.6EMCL-Net
VideoLSMDCtext-to-video Mean Rank8EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)
VideoLSMDCtext-to-video R@1053.7EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.458EMCL-Net
Video Question AnsweringMSRVTT-QAAccuracy45.8EMCL-Net
Video CaptioningMSR-VTTBLEU-445.3EMCL-Net
Video CaptioningMSR-VTTCIDEr54.6EMCL-Net
Video CaptioningMSR-VTTMETEOR30.2EMCL-Net
Video CaptioningMSR-VTTROUGE-L63.2EMCL-Net
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank1EMCL-Net++
Video RetrievalMSR-VTT-1kAtext-to-video R@151.6EMCL-Net++
Video RetrievalMSR-VTT-1kAtext-to-video R@1085.3EMCL-Net++
Video RetrievalMSR-VTT-1kAtext-to-video R@578.1EMCL-Net++
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank1EMCL-Net++
Video RetrievalMSR-VTT-1kAvideo-to-text R@151.8EMCL-Net++
Video RetrievalMSR-VTT-1kAvideo-to-text R@1088EMCL-Net++
Video RetrievalMSR-VTT-1kAvideo-to-text R@580.2EMCL-Net++
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank2EMCL-Net
Video RetrievalMSR-VTT-1kAtext-to-video R@146.8EMCL-Net
Video RetrievalMSR-VTT-1kAtext-to-video R@1083.1EMCL-Net
Video RetrievalMSR-VTT-1kAtext-to-video R@573.1EMCL-Net
Video RetrievalMSR-VTT-1kAvideo-to-text Mean Rank2EMCL-Net
Video RetrievalMSR-VTT-1kAvideo-to-text R@146.5EMCL-Net
Video RetrievalMSR-VTT-1kAvideo-to-text R@1083.5EMCL-Net
Video RetrievalMSR-VTT-1kAvideo-to-text R@573.5EMCL-Net
Video RetrievalActivityNettext-to-video Mean Rank1EMCL-Net++
Video RetrievalActivityNettext-to-video R@150.6EMCL-Net++
Video RetrievalActivityNettext-to-video R@578.7EMCL-Net++
Video RetrievalActivityNettext-to-video R@5098.1EMCL-Net++
Video RetrievalActivityNetvideo-to-text Mean Rank1EMCL-Net++
Video RetrievalActivityNetvideo-to-text R@150.6EMCL-Net++
Video RetrievalActivityNetvideo-to-text R@578.9EMCL-Net++
Video RetrievalActivityNetvideo-to-text R@5098.4EMCL-Net++
Video RetrievalActivityNettext-to-video Mean Rank2EMCL-Net
Video RetrievalActivityNettext-to-video R@141.2EMCL-Net
Video RetrievalActivityNettext-to-video R@572.7EMCL-Net
Video RetrievalActivityNetvideo-to-text Mean Rank2EMCL-Net
Video RetrievalActivityNetvideo-to-text R@142.7EMCL-Net
Video RetrievalActivityNetvideo-to-text R@574EMCL-Net
Video RetrievalActivityNetvideo-to-text R@5098.3EMCL-Net
Video RetrievalLSMDCtext-to-video R@125.9EMCL-Net++
Video RetrievalLSMDCtext-to-video R@546.4EMCL-Net++
Video RetrievalLSMDCvideo-to-text Mean Rank8EMCL-Net++
Video RetrievalLSMDCvideo-to-text R@126.7EMCL-Net++
Video RetrievalLSMDCvideo-to-text R@1054.4EMCL-Net++
Video RetrievalLSMDCvideo-to-text R@544.7EMCL-Net++
Video RetrievalLSMDCtext-to-video R@123.9EMCL-Net
Video RetrievalLSMDCtext-to-video R@1050.9EMCL-Net
Video RetrievalLSMDCtext-to-video R@542.4EMCL-Net
Video RetrievalLSMDCvideo-to-text Mean Rank12EMCL-Net
Video RetrievalLSMDCvideo-to-text R@122.2EMCL-Net
Video RetrievalLSMDCvideo-to-text R@1049.2EMCL-Net
Video RetrievalLSMDCvideo-to-text R@540.6EMCL-Net
Video RetrievalLSMDCtext-to-video Mean Rank8EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)
Video RetrievalLSMDCtext-to-video R@1053.7EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17