HateBR

Introduced 2021-03-27

The HateBR dataset is a significant resource for studying offensive language and hate speech detection in Brazilian Portuguese. Here are the key details about this dataset:

  1. Collection and Annotation:

    • The HateBR dataset was collected from Brazilian Instagram comments related to politicians.
    • It was manually annotated by specialists who carefully labeled each comment.
    • The dataset consists of 7,000 documents.
  2. Annotation Layers:

    • The HateBR dataset includes annotations at three different levels:
      • Binary Classification: Comments are labeled as either offensive or non-offensive.
      • Offensiveness Levels: Comments are categorized as highly, moderately, or slightly offensive.
      • Hate Speech Targets: Comments are further classified into nine specific hate speech categories:
        • Xenophobia
        • Racism
        • Homophobia
        • Sexism
        • Religious intolerance
        • Partyism
        • Apology for the dictatorship
        • Antisemitism
        • Fatphobia
  3. Inter-Annotator Agreement:

    • Each comment was annotated by three different annotators to ensure reliability.
    • The dataset achieved high inter-annotator agreement.
  4. Baseline Performance:

    • Baseline experiments using machine learning models achieved an F1-score of 85%, outperforming existing baselines for Portuguese language hate speech datasets.
  5. Corpus and Models:

    • The HateBR dataset includes a corpus of annotated comments.
    • The repository contains the best models presented in the associated research paper.
  6. File Format:

    • The HateBr.csv file provides four columns:
      • 1st column: Instagram comments.
      • 2nd column: Offensive language classification (offensive vs. non-offensive).
      • 3rd column: Offensiveness level (highly, moderately, slightly offensive).
      • 4th column: Hate speech classification (nine different targets).

Source: Conversation with Bing, 3/16/2024 (1) HateBR - Offensive Language and Hate Speech Dataset in ... - GitHub. https://github.com/franciellevargas/HateBR. (2) ruanchaves/hatebr · Datasets at Hugging Face. https://huggingface.co/datasets/ruanchaves/hatebr. (3) Papers with Code - HateBR: Large expert annotated corpus of Brazilian .... https://paperswithcode.com/paper/hatebr-large-expert-annotated-corpus-of.