TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Agent-as-Judge for Factual Summarization of Long Narratives

Agent-as-Judge for Factual Summarization of Long Narratives

Yeonseok Jeong, Minsoo Kim, Seung-won Hwang, Byung-Hak Kim

2025-01-17Long-Form Narrative Summarization
PaperPDFCode(official)

Abstract

Large Language Models (LLMs) have demonstrated near-human performance in summarization tasks based on traditional metrics such as ROUGE and BERTScore. However, these metrics do not adequately capture critical aspects of summarization quality, such as factual accuracy, particularly for long narratives (>100K tokens). Recent advances, such as LLM-as-a-Judge, address the limitations of metrics based on lexical similarity but still exhibit factual inconsistencies, especially in understanding character relationships and states. In this work, we introduce NarrativeFactScore, a novel "Agent-as-a-Judge" framework for evaluating and refining summaries. By leveraging a Character Knowledge Graph (CKG) extracted from input and generated summaries, NarrativeFactScore assesses the factual consistency and provides actionable guidance for refinement, such as identifying missing or erroneous facts. We demonstrate the effectiveness of NarrativeFactScore through a detailed workflow illustration and extensive validation on widely adopted benchmarks, achieving superior performance compared to competitive methods. Our results highlight the potential of agent-driven evaluation systems to improve the factual reliability of LLM-generated summaries.

Results

TaskDatasetMetricValueModel
Text SummarizationMENSABERTScore (F1)60.22Hierarchically Merging and Agent Refinement
Text SummarizationMENSAROUGE-131.31Hierarchically Merging and Agent Refinement
Text SummarizationMENSAROUGE-28.81Hierarchically Merging and Agent Refinement
Text SummarizationMENSAROUGE-L18.62Hierarchically Merging and Agent Refinement
Text SummarizationMovieSumBERTScore (F1)59.32Hierarchically Merging and Agent Refinement
Text SummarizationMovieSumROUGE-131.31Hierarchically Merging and Agent Refinement
Text SummarizationMovieSumROUGE-28.81Hierarchically Merging and Agent Refinement
Text SummarizationMovieSumROUGE-L18.62Hierarchically Merging and Agent Refinement

Related Papers

NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization2025-05-30End-to-End Long Document Summarization using Gradient Caching2025-01-03MovieSum: An Abstractive Summarization Dataset for Movie Screenplays2024-08-12Chain of Agents: Large Language Models Collaborating on Long-Context Tasks2024-06-04Select and Summarize: Scene Saliency for Movie Script Summarization2024-04-04BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization2022-01-16BookSum: A Collection of Datasets for Long-form Narrative Summarization2021-05-18