TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MuLD: The Multitask Long Document Benchmark

MuLD: The Multitask Long Document Benchmark

G Thomas Hudson, Noura Al Moubayed

2022-02-15LREC 2022 6Text ClassificationQuestion AnsweringSummarizationStyle change detectionTranslation
PaperPDFCode(official)

Abstract

The impressive progress in NLP techniques has been driven by the development of multi-task benchmarks such as GLUE and SuperGLUE. While these benchmarks focus on tasks for one or two input sentences, there has been exciting work in designing efficient techniques for processing much longer inputs. In this paper, we present MuLD: a new long document benchmark consisting of only documents over 10,000 tokens. By modifying existing NLP tasks, we create a diverse benchmark which requires models to successfully model long-term dependencies in the text. We evaluate how existing models perform, and find that our benchmark is much more challenging than their `short document' equivalents. Furthermore, by evaluating both regular and efficient transformers, we show that models with increased context length are better able to solve the tasks presented, suggesting that future improvements in these models are vital for solving similar long document problems. We release the data and code for baselines to encourage further research on efficient NLP models.

Results

TaskDatasetMetricValueModel
Question AnsweringMuLD (NarrativeQA)BLEU-119.84Longformer
Question AnsweringMuLD (NarrativeQA)BLEU-462Longformer
Question AnsweringMuLD (NarrativeQA)METEOR4.52Longformer
Question AnsweringMuLD (NarrativeQA)Rouge-L22.09Longformer
Question AnsweringMuLD (NarrativeQA)BLEU-117.67T5
Question AnsweringMuLD (NarrativeQA)BLEU-455T5
Question AnsweringMuLD (NarrativeQA)METEOR3.36T5
Question AnsweringMuLD (NarrativeQA)Rouge-L19.03T5
Question AnsweringMuLD (HotpotQA)BLEU-130.38Longformer
Question AnsweringMuLD (HotpotQA)BLEU-416.76Longformer
Question AnsweringMuLD (HotpotQA)METEOR4.98Longformer
Question AnsweringMuLD (HotpotQA)Rouge-L30.49Longformer
Question AnsweringMuLD (HotpotQA)BLEU-128.11T5
Question AnsweringMuLD (HotpotQA)BLEU-413.63T5
Question AnsweringMuLD (HotpotQA)METEOR4.46T5
Question AnsweringMuLD (HotpotQA)Rouge-L27.61T5
SummarizationMuLD (VLSP)BLEU-146.74Longformer
SummarizationMuLD (VLSP)BLEU-43.05Longformer
SummarizationMuLD (VLSP)METEOR9.58Longformer
SummarizationMuLD (VLSP)Rouge-L19.52Longformer
SummarizationMuLD (VLSP)BLEU-128.85T5
SummarizationMuLD (VLSP)BLEU-484T5
SummarizationMuLD (VLSP)METEOR7.98T5
SummarizationMuLD (VLSP)Rouge-L16.55T5
Text ClassificationMuLD (Character Type)F182.58Longformer
Text ClassificationMuLD (Character Type)F154.01T5
ClassificationMuLD (Character Type)F182.58Longformer
ClassificationMuLD (Character Type)F154.01T5
TranslationMuLD (OpenSubtitles)BLEU-134.07T5
TranslationMuLD (OpenSubtitles)BLEU-41.63T5
TranslationMuLD (OpenSubtitles)METEOR38.53T5
TranslationMuLD (OpenSubtitles)Rouge-L35.35T5
TranslationMuLD (OpenSubtitles)BLEU-122.74Longformer
TranslationMuLD (OpenSubtitles)BLEU-420Longformer
TranslationMuLD (OpenSubtitles)METEOR22.95Longformer
TranslationMuLD (OpenSubtitles)Rouge-L22.17Longformer

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16