PT Hate Speech
Introduced 2019-08-01
The PT Hate Speech is a valuable resource for studying hate speech in the Portuguese language. Here are the key details about this dataset:
-
Composition:
- The dataset consists of 5,668 tweets written in Portuguese.
- Annotators labeled these tweets using two different schemes based on their expertise levels.
-
Annotation Schemes:
- Non-experts initially annotated the tweets using binary labels: either 'hate' or 'no-hate'.
- Expert annotators then classified the tweets using a fine-grained hierarchical multiple label scheme. This scheme includes 81 hate speech categories in total.
-
Hierarchical Annotation Scheme:
- The hierarchical approach allows for identifying different types of hate speech and their intersections.
- The inter-annotator agreement varied across categories, reflecting the nuanced nature of hate speech perception.
-
Usefulness and Baseline Experiment:
- To demonstrate the dataset's usefulness, a baseline classification experiment was conducted using pre-trained word embeddings and LSTM models.
- The results achieved a state-of-the-art outcome.
Source: Conversation with Bing, 3/16/2024 (1) A Hierarchically-Labeled Portuguese Hate Speech Dataset. https://aclanthology.org/W19-3510/. (2) A Hierarchically-Labeled Portuguese Hate Speech Dataset - ACL Anthology. https://aclanthology.org/W19-3510.pdf. (3) A Hierarchically-Labeled Portuguese Hate Speech Dataset. https://paperswithcode.com/paper/a-hierarchically-labeled-portuguese-hate. (4) undefined. https://aclanthology.org/W19-3510.