MILU
Multi-task Indic Language Understanding Benchmark
Overview
MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of Large Language Models (LLMs) across 11 Indic languages. It spans 8 domains and 42 subjects, reflecting both general and culturally specific knowledge from India.
Key Features
- Languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, and English
- Domains: 8 diverse domains including Arts & Humanities, Social Sciences, STEM, and more
- Subjects: 42 subjects covering a wide range of topics
- Questions: ~85,000 multiple-choice questions
- Cultural Relevance: Incorporates India-specific knowledge from regional and state-level examinations
Dataset Statistics
| Language | Total Questions | Translated Questions | Avg Words Per Question | |----------|-----------------|----------------------|------------------------| | Bengali | 7138 | 1601 | 15.72 | | Gujarati | 5327 | 2755 | 16.69 | | Hindi | 15450 | 115 | 20.63 | | Kannada | 6734 | 1522 | 12.83 | | Malayalam| 4670 | 1534 | 12.82 | | Marathi | 7424 | 1235 | 18.8 | | Odia | 5025 | 1452 | 15.63 | | Punjabi | 4363 | 2341 | 19.9 | | Tamil | 7059 | 1524 | 13.32 | | Telugu | 7847 | 1298 | 16.13 | | English | 14036 | - | 22.01 | | Total| 85073 | 15377 | 16.77 (avg) |
Dataset Structure
Test Set
The test set consists of the MILU (Multi-task Indic Language Understanding) benchmark, which contains approximately 85,000 multiple-choice questions across 11 Indic languages.
Validation Set
The dataset includes a separate validation set of 9,157 samples that can be used for few-shot examples during evaluation. This validation set was created by sampling from each of the 42 subject tags, which were then condensed into 8 broader domains. This approach ensures a balanced representation across subjects and domains, allowing for consistent few-shot prompting across different models and experiments.
Subjects spanning MILU
| Domain | Subjects | |--------|----------| | Arts & Humanities | Architecture and Design, Arts and Culture, Education, History, Language Studies, Literature and Linguistics, Media and Communication, Music and Performing Arts, Religion and Spirituality | | Business Studies | Business and Management, Economics, Finance and Investment | | Engineering & Tech | Energy and Power, Engineering, Information Technology, Materials Science, Technology and Innovation, Transportation and Logistics | | Environmental Sciences | Agriculture, Earth Sciences, Environmental Science, Geography | | Health & Medicine | Food Science, Health and Medicine | | Law & Governance | Defense and Security, Ethics and Human Rights, Law and Ethics, Politics and Governance | | Math and Sciences | Astronomy and Astrophysics, Biology, Chemistry, Computer Science, Logical Reasoning, Mathematics, Physics | | Social Sciences | Anthropology, International Relations, Psychology, Public Administration, Social Welfare and Development, Sociology, Sports and Recreation |
Usage
Since this is a gated dataset, after your request for accessing the dataset is accepted, you can set your HuggingFace token:
export HF_TOKEN=YOUR_TOKEN_HERE
To load the MILU dataset for a Language:
from datasets import load_dataset
language = 'Hindi'
# Use 'test' split for evaluation & 'validation' split for few-shot
split = 'test'
language_data = load_dataset("ai4bharat/MILU", data_dir=language, split=split, token=True)
print(language_data[0])
Evaluation
We evaluated 45 different LLMs on MILU, including:
- Closed proprietary models (e.g., GPT-4o, Gemini-1.5)
- Open-source multilingual models
- Language-specific fine-tuned models
Key findings:
- GPT-4o achieved the highest average accuracy at 72%
- Open multilingual models outperformed language-specific fine-tuned models
- Models performed better in high-resource languages compared to low-resource ones
- Performance was lower in culturally relevant areas (e.g., Arts & Humanities) compared to general fields like STEM
For detailed results and analysis, please refer to our paper.
Citation
If you use MILU in your research, please cite our paper:
@misc{verma2024milumultitaskindiclanguage,
title={MILU: A Multi-task Indic Language Understanding Benchmark},
author={Sshubam Verma and Mohammed Safi Ur Rahman Khan and Vishwajeet Kumar and Rudra Murthy and Jaydeep Sen},
year={2024},
eprint={2411.02538},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.02538},
}```
## License
This dataset is released under the [MIT License](https://opensource.org/licenses/MIT).
## Contact
For any questions or feedback, please contact:
- Sshubam Verma (sshubamverma@ai4bharat.org)
- Mohammed Safi Ur Rahman Khan (safikhan@ai4bharat.org)
- Rudra Murthy (rmurthyv@in.ibm.com)
- Vishwajeet Kumar (vishk024@in.ibm.com)
## Links
- [GitHub Repository](https://github.com/AI4Bharat/MILU)
- [Paper](https://arxiv.org/abs/2411.02538)
- [Hugging Face Dataset](https://huggingface.co/datasets/ai4bharat/MILU)