TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/KnowCoder: Coding Structured Knowledge into LLMs for Unive...

KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

Zixuan Li, Yutao Zeng, Yuxin Zuo, Weicheng Ren, Wenxuan Liu, Miao Su, Yucan Guo, Yantao Liu, Xiang Li, Zhilei Hu, Long Bai, Wei Li, Yidan Liu, Pan Yang, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng

2024-03-12Large Language ModelCode GenerationUIELanguage Modelling
PaperPDFCode

Abstract

In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structured knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be captured in an LLM-friendly manner. We further construct a code-style schema library covering over $\textbf{30,000}$ types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning process of LLMs, KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning. After code pretraining on around $1.5$B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves relative improvements by $\textbf{49.8%}$ F1, compared to LLaMA2, under the few-shot setting. After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to $\textbf{12.5%}$ and $\textbf{21.9%}$, compared to sota baselines, under the zero-shot setting and the low resource setting, respectively. Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to $\textbf{7.5%}$ under the supervised setting.

Results

TaskDatasetMetricValueModel
Image EnhancementACE 2005-REF1 score64.5KnowCoder-7b-IE
Image EnhancementMIT MovieF1 score90.6KnowCoder-7b-IE
Image Enhancementncbi_diseaseF1 score83.8KnowCoder-7b-IE
Image EnhancementACE 2005-EDF1 score74.2KnowCoder-7b-IE
Image EnhancementACE 2005-NERF1 score86.1KnowCoder-7b-IE
Image EnhancementCoNLL 2003F1 score95.1KnowCoder-7b-IE
Image EnhancementSciERCF1 score40KnowCoder-7b-IE
Image EnhancementFabNERF1 score82.9KnowCoder-7b-IE
Image EnhancementACE 2004F1 score86.2KnowCoder-7b-IE
Image EnhancementBroad TwitterF1 score78.3KnowCoder-7b-IE
Image EnhancementBC5CDRF1 score89.3KnowCoder-7b-IE
Image EnhancementCoNLL 2004F1 score73.3KnowCoder-7b-IE
Image EnhancementWNUT 2017F1 score66.4KnowCoder-7b-IE
Image EnhancementGIDSF1 score78KnowCoder-7b-IE
Image Enhancementsemeval REF1 score66.3KnowCoder-7b-IE
Image EnhancementMultiNERDF1 score96.1KnowCoder-7b-IE
Image EnhancementGENIAF1 score76.7KnowCoder-7b-IE
Image EnhancementFindVehicleF1 score99.4KnowCoder-7b-IE
Image Enhancementkbp37F1 score73.2KnowCoder-7b-IE
Image EnhancementDIANNF1 score94.7KnowCoder-7b-IE
Image EnhancementACE 2005-EAEF1 score70.3KnowCoder-7b-IE
Image EnhancementADE CorpusF1 score84.3KnowCoder-7b-IE
Image EnhancementNYTF1 score93.7KnowCoder-7b-IE
Image EnhancementBC2GMF1 score82KnowCoder-7b-IE
Image EnhancementWikiANNF1 score87KnowCoder-7b-IE
Image EnhancementOntoNotes 5.0F1 score88.2KnowCoder-7b-IE
Image EnhancementMIT RestaurantF1 score81.3KnowCoder-7b-IE
Image EnhancementAnatEMF1 score86.4KnowCoder-7b-IE

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17Towards Formal Verification of LLM-Generated Code from Natural Language Prompts2025-07-17