Papers With Code 2 | ML Benchmarks, SotA Results & Code

We introduce misinfo-general, a benchmark dataset for evaluating misinformation models’ ability to perform out-of-distribution generalisation. Misinformation changes rapidly, much quicker than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation models need to be able to perform out-of-distribution generalisation, an understudied problem in existing datasets.

Constructed on top of the various NELA corpora (2017, 2018, 2019, 2020, 2021, 2022), misinfo-general is a large, diverse dataset consisting of news articles from reliable and unreliable publishers. Unlike NELA, we apply several rounds of deduplication and filtering to ensure all articles are of reasonable quality.

We use distant labelling to provide each publisher with rich metadata annotations. These annotations allow for simulating various generalisation splits that misinformation models are confronted with during deployment. We focus on 6 such splits-time, event, topic, publisher, political bias, misinformation type-but more are possible.

By releasing this dataset publicly, we hope to encourage future works that design misinformation models specifically with out-of-distribution generalisation in mind.