FQuAD: French Question Answering Dataset

Martin d'Hoffschmidt, Wacim Belblidia, Tom Brendlé, Quentin Heinrich, Maxime Vidal

2020-02-14Findings of the Association for Computational Linguistics 2020Reading Comprehension Question Answering Machine Reading Comprehension Language Modelling

Paper PDF

Abstract

Recent advances in the field of language modeling have improved state-of-the-art results on many Natural Language Processing tasks. Among them, Reading Comprehension has made significant progress over the past few years. However, most results are reported in English since labeled resources available in other languages, such as French, remain scarce. In the present work, we introduce the French Question Answering Dataset (FQuAD). FQuAD is a French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version. We train a baseline model which achieves an F1 score of 92.2 and an exact match ratio of 82.1 on the test set. In order to track the progress of French Question Answering models we propose a leader-board and we have made the 1.0 version of our dataset freely available at https://illuin-tech.github.io/FQuAD-explorer/.

Results

Task	Dataset	Metric	Value	Model
Question Answering	FQuAD	EM	82.1	CamemBERT-Large
Question Answering	FQuAD	F1	92.2	CamemBERT-Large
Question Answering	FQuAD	EM	79	XLM-RoBERTa-Large
Question Answering	FQuAD	F1	89.5	XLM-RoBERTa-Large
Question Answering	FQuAD	EM	78.4	CamemBERT-Base
Question Answering	FQuAD	F1	88.4	CamemBERT-Base
Question Answering	FQuAD	EM	75.3	XLM-RoBERTa-Base
Question Answering	FQuAD	F1	85.9	XLM-RoBERTa-Base

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17