TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video Joint Modelling Based on Hierarchical Transformer fo...

Video Joint Modelling Based on Hierarchical Transformer for Co-summarization

Li Haopeng, Ke Qiuhong, Gong Mingming, Zhang Rui

2021-12-27Video RetrievalSupervised Video SummarizationVideo SummarizationVideo UnderstandingRetrieval
PaperPDFCodeCode(official)

Abstract

Video summarization aims to automatically generate a summary (storyboard or video skim) of a video, which can facilitate large-scale video retrieval and browsing. Most of the existing methods perform video summarization on individual videos, which neglects the correlations among similar videos. Such correlations, however, are also informative for video understanding and video summarization. To address this limitation, we propose Video Joint Modelling based on Hierarchical Transformer (VJMHT) for co-summarization, which takes into consideration the semantic dependencies across videos. Specifically, VJMHT consists of two layers of Transformer: the first layer extracts semantic representation from individual shots of similar videos, while the second layer performs shot-level video joint modelling to aggregate cross-video semantic information. By this means, complete cross-video high-level patterns are explicitly modelled and learned for the summarization of individual videos. Moreover, Transformer-based video representation reconstruction is introduced to maximize the high-level similarity between the summary and the original video. Extensive experiments are conducted to verify the effectiveness of the proposed modules and the superiority of VJMHT in terms of F-measure and rank-based evaluation.

Results

TaskDatasetMetricValueModel
VideoTvSumF1-score (Augmented)61.9VJMHT
VideoTvSumF1-score (Canonical)60.9VJMHT
VideoTvSumKendall's Tau0.097VJMHT
VideoTvSumSpearman's Rho0.105VJMHT
VideoSumMeF1-score (Augmented)51.7VJMHT
VideoSumMeF1-score (Canonical)50.6VJMHT
VideoSumMeKendall's Tau0.106VJMHT
VideoSumMeSpearman's Rho0.108VJMHT
Video SummarizationTvSumF1-score (Augmented)61.9VJMHT
Video SummarizationTvSumF1-score (Canonical)60.9VJMHT
Video SummarizationTvSumKendall's Tau0.097VJMHT
Video SummarizationTvSumSpearman's Rho0.105VJMHT
Video SummarizationSumMeF1-score (Augmented)51.7VJMHT
Video SummarizationSumMeF1-score (Canonical)50.6VJMHT
Video SummarizationSumMeKendall's Tau0.106VJMHT
Video SummarizationSumMeSpearman's Rho0.108VJMHT

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16