TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/HiNER: A Large Hindi Named Entity Recognition Dataset

HiNER: A Large Hindi Named Entity Recognition Dataset

Rudra Murthy, Pallab Bhattacharjee, Rahul Sharnagat, Jyotsana Khatri, Diptesh Kanojia, Pushpak Bhattacharyya

2022-04-28LREC 2022 6TAGnamed-entity-recognitionNamed Entity RecognitionNERNamed Entity Recognition (NER)
PaperPDFCode(official)

Abstract

Named Entity Recognition (NER) is a foundational NLP task that aims to provide class labels like Person, Location, Organisation, Time, and Number to words in free text. Named Entities can also be multi-word expressions where the additional I-O-B annotation information helps label them during the NER annotation process. While English and European languages have considerable annotated data for the NER task, Indian languages lack on that front -- both in terms of quantity and following annotation standards. This paper releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags. We discuss the dataset statistics in all their essential detail and provide an in-depth analysis of the NER tag-set used with our data. The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation. Since the proof of resource-effectiveness is in building models with the resource and testing the model on benchmark data and against the leader-board entries in shared tasks, we do the same with the aforesaid data. We use different language models to perform the sequence labelling task for NER and show the efficacy of our data by performing a comparative evaluation with models trained on another dataset available for the Hindi NER task. Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper. To the best of our knowledge, no available dataset meets the standards of volume (amount) and variability (diversity), as far as Hindi NER is concerned. We fill this gap through this work, which we hope will significantly help NLP for Hindi. We release this dataset with our code and models at https://github.com/cfiltnlp/HiNER

Results

TaskDatasetMetricValueModel
Named Entity Recognition (NER)HiNER-collapsedF1-score (Weighted)92.22cfilt/HiNER-collapsed-xlm-roberta-large
Named Entity Recognition (NER)HiNER-collapsedF1-score (Weighted)92.11cfilt/HiNER-collapsed-muril-base-cased
Named Entity Recognition (NER)HiNER-originalF1-score (Weighted)88.78cfilt/HiNER-original-xlm-roberta-large
Named Entity Recognition (NER)HiNER-originalF1-score (Weighted)88.27cfilt/HiNER-original-muril-base-cased

Related Papers

CogniSQL-R1-Zero: Lightweight Reinforced Reasoning for Efficient SQL Generation2025-07-08Flippi: End To End GenAI Assistant for E-Commerce2025-07-08Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models2025-06-28LLMs in Coding and their Impact on the Commercial Software Engineering Landscape2025-06-19How to Speak to a Real Person at Singapore Airlines®: 15 Easy Methods Explained2025-06-17Call To Speak To Someone At Frontier™️ Airlines Through Various Contact Options: The Ultimate Step Guide2025-06-17Call To Speak To Someone At Expedia Through Various Contact Options: The Ultimate Step Guide2025-06-1723 Ways to Contact How Do I Talk to Someone at Expedia®: A Step-by-Step Guide2025-06-17