TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Align and Attend: Multimodal Summarization with Dual Contr...

Align and Attend: Multimodal Summarization with Dual Contrastive Losses

Bo He, Jun Wang, JieLin Qiu, Trung Bui, Abhinav Shrivastava, Zhaowen Wang

2023-03-13CVPR 2023 1Extractive Text SummarizationSupervised Video SummarizationVideo Summarization
PaperPDFCodeCode(official)

Abstract

The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries. Unlike the unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. To address this issue, we introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input. In addition, we propose two novel contrastive losses to model both inter-sample and intra-sample correlations. Extensive experiments on two standard video summarization datasets (TVSum and SumMe) and two multimodal summarization datasets (Daily Mail and CNN) demonstrate the superiority of A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we collected a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries. Our code and dataset are publicly available at ~\url{https://boheumd.github.io/A2Summ/}.

Results

TaskDatasetMetricValueModel
VideoTvSumF1-score (Canonical)63.4A2Summ
VideoTvSumKendall's Tau0.137A2Summ
VideoTvSumSpearman's Rho0.165A2Summ
VideoSumMeF1-score (Canonical)55A2Summ
VideoSumMeKendall's Tau0.108A2Summ
VideoSumMeSpearman's Rho0.129A2Summ
Video SummarizationTvSumF1-score (Canonical)63.4A2Summ
Video SummarizationTvSumKendall's Tau0.137A2Summ
Video SummarizationTvSumSpearman's Rho0.165A2Summ
Video SummarizationSumMeF1-score (Canonical)55A2Summ
Video SummarizationSumMeKendall's Tau0.108A2Summ
Video SummarizationSumMeSpearman's Rho0.129A2Summ
Text SummarizationCNN / Daily MailROUGE-144.11A2Summ
Text SummarizationCNN / Daily MailROUGE-220.31A2Summ
Text SummarizationCNN / Daily MailROUGE-L35.92A2Summ
Extractive Text SummarizationCNN / Daily MailROUGE-144.11A2Summ
Extractive Text SummarizationCNN / Daily MailROUGE-220.31A2Summ
Extractive Text SummarizationCNN / Daily MailROUGE-L35.92A2Summ

Related Papers

TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness2025-06-25MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment2025-06-12Prompts to Summaries: Zero-Shot Language-Guided Video Summarization2025-06-12Enhancing Video Memorability Prediction with Text-Motion Cross-modal Contrastive Loss and Its Application in Video Summarization2025-06-10TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations2025-06-03Unsupervised Transcript-assisted Video Summarization and Highlight Detection2025-05-29REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing2025-05-24SD-VSum: A Method and Dataset for Script-Driven Video Summarization2025-05-06