TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ChuLo: Chunk-Level Key Information Representation for Long...

ChuLo: Chunk-Level Key Information Representation for Long Document Processing

Yan Li, Soyeon Caren Han, Yue Dai, Feiqi Cao

2024-10-14Keyphrase ExtractionToken Classificationdocument understandingNamed Entity RecognitionDocument ClassificationClassificationChunkingMultilabel Text Classification
PaperPDFCode

Abstract

Transformer-based models have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet their ability to handle long documents is constrained by computational limitations. Traditional approaches, such as truncating inputs, sparse self-attention, and chunking, attempt to mitigate these issues, but they often lead to information loss and hinder the model's ability to capture long-range dependencies. In this paper, we introduce ChuLo, a novel chunk representation method for long document classification that addresses these limitations. Our ChuLo groups input tokens using unsupervised keyphrase extraction, emphasizing semantically important keyphrase based chunk to retain core document content while reducing input length. This approach minimizes information loss and improves the efficiency of Transformer-based models. Preserving all tokens in long document understanding, especially token classification tasks, is especially important to ensure that fine-grained annotations, which depend on the entire sequence context, are not lost. We evaluate our method on multiple long document classification tasks and long document token classification tasks, demonstrating its effectiveness through comprehensive qualitative and quantitative analyses.

Results

TaskDatasetMetricValueModel
Text ClassificationHyperpartisan News DetectionAccuracy95.38ChuLo
Text ClassificationLUNAccuracy64.4ChuLo
Document ClassificationHyperpartisan News DetectionAccuracy95.38ChuLo
Document ClassificationLUNAccuracy64.4ChuLo
ClassificationHyperpartisan News DetectionAccuracy95.38ChuLo
ClassificationLUNAccuracy64.4ChuLo

Related Papers

Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Safeguarding Federated Learning-based Road Condition Classification2025-07-16A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends2025-07-14AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)2025-07-13Dynamic Chunking for End-to-End Hierarchical Sequence Modeling2025-07-10CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs2025-07-09PaddleOCR 3.0 Technical Report2025-07-08