A Large Self-Annotated Corpus for Sarcasm

Mikhail Khodak, Nikunj Saunshi, Kiran Vodrahalli

2017-04-19LREC 2018 5Sarcasm Detection

Abstract

We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated -- sarcasm is labeled by the author, not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.

Results

Task	Dataset	Metric	Value	Model
Sarcasm Detection	SARC (pol-unbal)	Avg F1	27	Bag-of-Words
Sarcasm Detection	SARC (all-bal)	Accuracy	75.8	Bag-of-Bigrams
Sarcasm Detection	SARC (pol-bal)	Accuracy	76.5	Bag-of-Bigrams

Related Papers

CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models2025-06-10 Leveraging Large Language Models for Sarcastic Speech Annotation in Sarcasm Detection2025-06-01 IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection2025-05-22 Nek Minit: Harnessing Pragmatic Metacognitive Prompting for Explainable Sarcasm Detection of Australian and Indian English2025-05-21 Token-free Models for Sarcasm Detection2025-05-02 Assessing how hyperparameters impact Large Language Models' sarcasm detection performance2025-04-08 Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models2025-03-24 Intermediate-Task Transfer Learning: Leveraging Sarcasm Detection for Stance Detection2025-03-05