CoNLL 2003

TextsUnknownIntroduced 2003-06-12

CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition. The data consists of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file and a large file with unannotated data.

The English data was taken from the Reuters Corpus. This corpus consists of Reuters news stories between August 1996 and August 1997. For the training and development set, ten days worth of data were taken from the files representing the end of August 1996. For the test set, the texts were from December 1996. The preprocessed raw data covers the month of September 1996.

The text for the German data was taken from the ECI Multilingual Text Corpus. This corpus consists of texts in many languages. The portion of data that was used for this task, was extracted from the German newspaper Frankfurter Rundshau. All three of the training, development and test sets were taken from articles written in one week at the end of August 1992. The raw data were taken from the months of September to December 1992.

| English data | Articles | Sentences | Tokens | LOC | MISC | ORG | PER | |-------------------|----------|-----------|---------|------|------|------|------| | Training set | 946 | 14,987 | 203,621 | 7140 | 3438 | 6321 | 6600 | | Development set | 216 | 3,466 | 51,362 | 1837 | 922 | 1341 | 1842 | | Test set | 231 | 3,684 | 46,435 | 1668 | 702 | 1661 | 1617 |

Number of articles, sentences, tokens and entities (locations, miscellaneous, organizations, and persons) in English data files.

| German data | Articles | Sentences | Tokens | LOC | MISC | ORG | PER | |-------------------|----------|-----------|---------|------|------|------|------| | Training set | 553 | 12,705 | 206,931 | 4363 | 2288 | 2427 | 2773 | | Development set | 201 | 3,068 | 51,444 | 1181 | 1010 | 1241 | 1401 | | Test set | 155 | 3,160 | 51,943 | 1035 | 670 | 773 | 1195 |

Number of articles, sentences, tokens and entities (locations, miscellaneous, organizations, and persons) in German data files.

Benchmarks

Chunking/AUC Chunking/Accuracy Chunking/F1 Chunking/Precision Chunking/Recall Cross-Lingual/Spanish Cross-Lingual/German Cross-Lingual/Dutch Cross-Lingual Transfer/Spanish Cross-Lingual Transfer/German Cross-Lingual Transfer/Dutch Event Extraction/AUC Event Extraction/Accuracy Event Extraction/F1 Event Extraction/Precision Event Extraction/Recall Image Enhancement/F1 score Information Extraction/AUC Information Extraction/Accuracy Information Extraction/F1 Information Extraction/Precision Information Extraction/Recall Named Entity Recognition (NER)/AUC Named Entity Recognition (NER)/Accuracy Named Entity Recognition (NER)/F1 Named Entity Recognition (NER)/Precision Named Entity Recognition (NER)/Recall Open Information Extraction/AUC Open Information Extraction/Accuracy Open Information Extraction/F1 Open Information Extraction/Precision Open Information Extraction/Recall Shallow Syntax/AUC Shallow Syntax/Accuracy Shallow Syntax/F1 Shallow Syntax/Precision Shallow Syntax/Recall

Related Benchmarks

CONLL 2003 Dutch/Information Extraction/F1 score CONLL 2003 German/Information Extraction/F1 score CoNLL 2003 (English)/Chunking/F1 CoNLL 2003 (English)/Named Entity Recognition (NER)/F1 CoNLL 2003 (English)/Shallow Syntax/F1 CoNLL 2003 (German)/Chunking/F1 CoNLL 2003 (German)/Named Entity Recognition (NER)/F1 CoNLL 2003 (German)/Shallow Syntax/F1 CoNLL 2003 (German) Revised/Named Entity Recognition (NER)/F1 Conll 2003 Spanish/Information Extraction/F1 score

Number of articles, sentences, tokens and entities (locations, miscellaneous, organizations, and persons) in English data files.

Number of articles, sentences, tokens and entities (locations, miscellaneous, organizations, and persons) in German data files.