TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Sparsifying Transformer Models with Trainable Representati...

Sparsifying Transformer Models with Trainable Representation Pooling

Michał Pietruszka, Łukasz Borchmann, Łukasz Garncarek

2020-09-10ACL 2022 5Text SummarizationSummarizationDocument Summarization
PaperPDFCode(official)

Abstract

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator. Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with trainable pooling, we can retain its top quality, while being $1.8\times$ faster during training, $4.5\times$ faster during inference, and up to $13\times$ more computationally efficient in the decoder.

Results

TaskDatasetMetricValueModel
Text SummarizationarXiv Summarization DatasetROUGE-146.85Blockwise (baseline)
Text SummarizationarXiv Summarization DatasetROUGE-219.39Blockwise (baseline)
Text SummarizationPubmedROUGE-147.81DeepPyramidion
Text SummarizationPubmedROUGE-221.14DeepPyramidion

Related Papers

LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15GenerationPrograms: Fine-grained Attribution with Executable Programs2025-06-17Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences2025-06-16On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention2025-06-11Improving large language models with concept-aware fine-tuning2025-06-09Improving Fairness of Large Language Models in Multi-document Summarization2025-06-09MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection2025-05-29ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs2025-05-29