Multi Lingual Bug Reports

GraphsImagesTextsCC BY 4.0Introduced 2025-02-20

Dataset Description

The dataset used in this study comprises bug reports extracted from the Visual Studio Code GitHub repository, specifically focusing on those labeled with the english-please tag. This label indicates that the original submission was written in a language other than English, providing a clear signal for multilingual content. The dataset spans a five-year period (March 2019--June 2024), ensuring a diverse representation of bug types, user environments, and technical contexts.

Characteristics

The dataset contains 1,381 multilingual bug reports, each consisting of:

  • The original bug report written in a non-English language.
  • A translated version in English.
  • Metadata such as issue number, creation date, labels, and status.
  • Categorization into functional, UI, and performance-related issues based on the content.

Motivation & Summary

This dataset is motivated by the need to improve multilingual bug tracking and translation evaluation. Given the increasing globalization of software development, developers and QA teams frequently encounter bug reports in languages they do not understand. By providing a structured corpus of translated bug reports, this dataset facilitates:

  • Comparative translation evaluation (e.g., ChatGPT vs AWS Translate vs DeepL).
  • Linguistic analysis of technical bug reporting across different languages.
  • Insights into common software issues encountered by diverse users.
  • Improving multilingual issue tracking through automated labeling and categorization.

Potential Use Cases

This dataset can be beneficial for various research and development applications, including:

  • Machine Translation Benchmarking: Evaluating the performance of translation models in a technical domain.
  • Natural Language Processing (NLP) Tasks: Training classifiers to categorize bug reports based on their content.
  • Software Engineering Research: Understanding trends in bug reporting, issue resolution, and localization challenges.
  • Automated Bug Triage: Developing AI-driven solutions for assigning and prioritizing bug reports in multilingual repositories.