TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CamemBERT: a Tasty French Language Model

CamemBERT: a Tasty French Language Model

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, Benoît Sagot

2019-11-10ACL 2020 6Natural Language InferencePart-Of-Speech TaggingNamed Entity RecognitionNamed Entity Recognition (NER)Dependency ParsingLanguage Modelling
PaperPDFCodeCodeCodeCodeCode(official)CodeCodeCode

Abstract

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models --in all languages except English-- very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

Results

TaskDatasetMetricValueModel
Part-Of-Speech TaggingSpoken CorpusUPOS96.68CamemBERT
Part-Of-Speech TaggingFrench GSDUPOS98.19CamemBERT
Part-Of-Speech TaggingSequoia TreebankUPOS99.21CamemBERT
Part-Of-Speech TaggingParTUTUPOS97.63CamemBERT
Natural Language InferenceXNLI FrenchAccuracy85.7CamemBERT (large)
Natural Language InferenceXNLI FrenchAccuracy81.2CamemBERT (base)
Dependency ParsingSpoken CorpusLAS81.37CamemBERT
Dependency ParsingSpoken CorpusUAS86.05CamemBERT
Dependency ParsingParTUTLAS92.9CamemBERT
Dependency ParsingParTUTUAS95.21CamemBERT
Dependency ParsingFrench GSDLAS92.47CamemBERT
Dependency ParsingFrench GSDUAS94.82CamemBERT
Dependency ParsingSequoia TreebankLAS94.39CamemBERT
Dependency ParsingSequoia TreebankUAS95.56CamemBERT
Named Entity Recognition (NER)French TreebankF187.93CamemBERT (subword masking)
Named Entity Recognition (NER)French TreebankPrecision88.35CamemBERT (subword masking)
Named Entity Recognition (NER)French TreebankRecall87.46CamemBERT (subword masking)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16