We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated -- sarcasm is labeled by the author, not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Sarcasm Detection | SARC (pol-unbal) | Avg F1 | 27 | Bag-of-Words |
| Sarcasm Detection | SARC (all-bal) | Accuracy | 75.8 | Bag-of-Bigrams |
| Sarcasm Detection | SARC (pol-bal) | Accuracy | 76.5 | Bag-of-Bigrams |