MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

Aman Khullar, Udit Arora

2020-10-15EMNLP (nlpbt) 2020 11Multimodal Abstractive Text Summarization Abstractive Text Summarization Text Summarization

Paper PDF Code(official)

Abstract

This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities -- text, audio and video -- in a multimodal video. Prior work on multimodal abstractive text summarization only utilized information from the text and video modalities. We examine the usefulness and challenges of deriving information from the audio modality and present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges by letting the model pay more attention to the text modality. MAST outperforms the current state of the art model (video-text) by 2.51 points in terms of Content F1 score and 1.00 points in terms of Rouge-L score on the How2 dataset for multimodal language understanding.

Results

Task	Dataset	Metric	Value	Model
Text Summarization	How2 300h	ROUGE-L	43.23	MAST
Abstractive Text Summarization	How2 300h	ROUGE-L	43.23	MAST

Related Papers

LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15 On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention2025-06-11 Improving large language models with concept-aware fine-tuning2025-06-09 Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs2025-06-03 ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs2025-05-29 MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection2025-05-29 APE: A Data-Centric Benchmark for Efficient LLM Adaptation in Text Summarization2025-05-26 FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)2025-05-25