SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

Yuzhang Tian, Jianbo Zhao, Haoyu Dong, Junyu Xiong, Shiyu Xia, Mengyu Zhou, Yun Lin, José Cambronero, Yeye He, Shi Han, Dongmei Zhang

2024-07-12Table Detection

Paper PDF Code Code

Abstract

Spreadsheets, with their extensive two-dimensional grids, various layouts, and diverse formatting options, present notable challenges for large language models (LLMs). In response, we introduce SpreadsheetLLM, pioneering an efficient encoding method designed to unleash and optimize LLMs' powerful understanding and reasoning capability on spreadsheets. Initially, we propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach was limited by LLMs' token constraints, making it impractical for most applications. To tackle this challenge, we develop SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. It comprises three modules: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting. Moreover, fine-tuned LLM with SheetCompressor has an average compression ratio of 25 times, but achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%. Finally, we propose Chain of Spreadsheet for downstream tasks of spreadsheet understanding and validate in a new and demanding spreadsheet QA task. We methodically leverage the inherent layout and structure of spreadsheets, demonstrating that SpreadsheetLLM is highly effective across a variety of spreadsheet tasks.

Related Papers

Synthetic Data Augmentation for Table Detection: Re-evaluating TableNet's Performance with Automatically Generated Document Images2025-06-17 Creating a Historical Migration Dataset from Finnish Church Records, 1800-19202025-06-09 UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis2025-03-20 RAPTOR: Refined Approach for Product Table Object Recognition2025-02-19 CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry2025-01-26 TabSniper: Towards Accurate Table Detection & Structure Recognition for Bank Statements2024-12-17 A Comparative Study of PDF Parsing Tools Across Diverse Document Categories2024-10-13 TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy2024-06-03