MILU

Multi-task Indic Language Understanding Benchmark

TextsMITIntroduced 2024-11-04

Overview

MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of Large Language Models (LLMs) across 11 Indic languages. It spans 8 domains and 42 subjects, reflecting both general and culturally specific knowledge from India.

Key Features

  • Languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, and English
  • Domains: 8 diverse domains including Arts & Humanities, Social Sciences, STEM, and more
  • Subjects: 42 subjects covering a wide range of topics
  • Questions: ~85,000 multiple-choice questions
  • Cultural Relevance: Incorporates India-specific knowledge from regional and state-level examinations

Dataset Statistics

| Language | Total Questions | Translated Questions | Avg Words Per Question | |----------|-----------------|----------------------|------------------------| | Bengali | 7138 | 1601 | 15.72 | | Gujarati | 5327 | 2755 | 16.69 | | Hindi | 15450 | 115 | 20.63 | | Kannada | 6734 | 1522 | 12.83 | | Malayalam| 4670 | 1534 | 12.82 | | Marathi | 7424 | 1235 | 18.8 | | Odia | 5025 | 1452 | 15.63 | | Punjabi | 4363 | 2341 | 19.9 | | Tamil | 7059 | 1524 | 13.32 | | Telugu | 7847 | 1298 | 16.13 | | English | 14036 | - | 22.01 | | Total| 85073 | 15377 | 16.77 (avg) |

Dataset Structure

Test Set

The test set consists of the MILU (Multi-task Indic Language Understanding) benchmark, which contains approximately 85,000 multiple-choice questions across 11 Indic languages.

Validation Set

The dataset includes a separate validation set of 9,157 samples that can be used for few-shot examples during evaluation. This validation set was created by sampling from each of the 42 subject tags, which were then condensed into 8 broader domains. This approach ensures a balanced representation across subjects and domains, allowing for consistent few-shot prompting across different models and experiments.

Subjects spanning MILU

| Domain | Subjects | |--------|----------| | Arts & Humanities | Architecture and Design, Arts and Culture, Education, History, Language Studies, Literature and Linguistics, Media and Communication, Music and Performing Arts, Religion and Spirituality | | Business Studies | Business and Management, Economics, Finance and Investment | | Engineering & Tech | Energy and Power, Engineering, Information Technology, Materials Science, Technology and Innovation, Transportation and Logistics | | Environmental Sciences | Agriculture, Earth Sciences, Environmental Science, Geography | | Health & Medicine | Food Science, Health and Medicine | | Law & Governance | Defense and Security, Ethics and Human Rights, Law and Ethics, Politics and Governance | | Math and Sciences | Astronomy and Astrophysics, Biology, Chemistry, Computer Science, Logical Reasoning, Mathematics, Physics | | Social Sciences | Anthropology, International Relations, Psychology, Public Administration, Social Welfare and Development, Sociology, Sports and Recreation |

Usage

Since this is a gated dataset, after your request for accessing the dataset is accepted, you can set your HuggingFace token:

export HF_TOKEN=YOUR_TOKEN_HERE

To load the MILU dataset for a Language:

from datasets import load_dataset

language = 'Hindi'

# Use 'test' split for evaluation & 'validation' split for few-shot
split = 'test'

language_data = load_dataset("ai4bharat/MILU", data_dir=language, split=split, token=True)

print(language_data[0])

Evaluation

We evaluated 45 different LLMs on MILU, including:

  • Closed proprietary models (e.g., GPT-4o, Gemini-1.5)
  • Open-source multilingual models
  • Language-specific fine-tuned models

Key findings:

  • GPT-4o achieved the highest average accuracy at 72%
  • Open multilingual models outperformed language-specific fine-tuned models
  • Models performed better in high-resource languages compared to low-resource ones
  • Performance was lower in culturally relevant areas (e.g., Arts & Humanities) compared to general fields like STEM

For detailed results and analysis, please refer to our paper.

Citation

If you use MILU in your research, please cite our paper:

@misc{verma2024milumultitaskindiclanguage,
      title={MILU: A Multi-task Indic Language Understanding Benchmark}, 
      author={Sshubam Verma and Mohammed Safi Ur Rahman Khan and Vishwajeet Kumar and Rudra Murthy and Jaydeep Sen},
      year={2024},
      eprint={2411.02538},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.02538}, 
}```

## License

This dataset is released under the [MIT License](https://opensource.org/licenses/MIT).

## Contact

For any questions or feedback, please contact:
- Sshubam Verma (sshubamverma@ai4bharat.org)
- Mohammed Safi Ur Rahman Khan (safikhan@ai4bharat.org)
- Rudra Murthy (rmurthyv@in.ibm.com)
- Vishwajeet Kumar (vishk024@in.ibm.com)

## Links

- [GitHub Repository](https://github.com/AI4Bharat/MILU)
- [Paper](https://arxiv.org/abs/2411.02538)
- [Hugging Face Dataset](https://huggingface.co/datasets/ai4bharat/MILU)