Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction

Maksym Tarnavskyi, Artem Chernodub, Kostiantyn Omelianchuk

2022-03-24ACL 2022 5Grammatical Error Correction Knowledge Distillation

Abstract

In this paper, we investigate improvements to the GEC sequence tagging architecture with a focus on ensembling of recent cutting-edge Transformer-based encoders in Large configurations. We encourage ensembling models by majority votes on span-level edits because this approach is tolerant to the model architecture and vocabulary size. Our best ensemble achieves a new SOTA result with an $F_{0.5}$ score of 76.05 on BEA-2019 (test), even without pre-training on synthetic datasets. In addition, we perform knowledge distillation with a trained ensemble to generate new synthetic training datasets, "Troy-Blogs" and "Troy-1BW". Our best single sequence tagging model that is pretrained on the generated Troy-datasets in combination with the publicly available synthetic PIE dataset achieves a near-SOTA (To the best of our knowledge, our best single model gives way only to much heavier T5 model result with an $F_{0.5}$ score of 73.21 on BEA-2019 (test). The code, datasets, and trained models are publicly available).

Results

Task	Dataset	Metric	Value	Model
Grammatical Error Correction	BEA-2019 (test)	F0.5	76.05	DeBERTa + RoBERTa + XLNet

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training2025-07-15 Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning2025-07-14 KAT-V1: Kwai-AutoThink Technical Report2025-07-11 Towards Collaborative Fairness in Federated Learning Under Imbalanced Covariate Shift2025-07-11 SFedKD: Sequential Federated Learning with Discrepancy-Aware Multi-Teacher Knowledge Distillation2025-07-11