TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CharacterBERT: Reconciling ELMo and BERT for Word-Level Op...

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Hiroshi Noji, Pierre Zweigenbaum, Junichi Tsujii

2020-10-20COLING 2020 8Relation ExtractionNatural Language InferenceSemantic SimilarityClinical Concept ExtractionDrug–drug Interaction Extraction
PaperPDFCode(official)Code

Abstract

Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level and open-vocabulary representations.

Results

TaskDatasetMetricValueModel
Relation ExtractionChemProtMicro F173.44CharacterBERT (base, medical)
Natural Language InferenceMedNLIAccuracy84.95CharacterBERT (base, medical)
Language ModellingClinicalSTSPearson Correlation85.62CharacterBERT (base, medical, ensemble)
Information ExtractionDDI extraction 2013 corpusMicro F180.38CharacterBERT (base, medical)
Sentence Pair ModelingClinicalSTSPearson Correlation85.62CharacterBERT (base, medical, ensemble)
Clinical Concept Extraction2010 i2b2/VAExact Span F189.24CharacterBERT (base, medical)
Semantic SimilarityClinicalSTSPearson Correlation85.62CharacterBERT (base, medical, ensemble)

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations2025-07-08DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification2025-07-08SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression2025-07-08FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection2025-07-06LineRetriever: Planning-Aware Observation Reduction for Web Agents2025-06-30ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation2025-06-27