3,148 machine learning datasets
3,148 dataset results
BEAMetrics (Benchmark to Evaluate Automatic Metrics) is resource to make research into new metrics for evaluation of generated language easier to evaluate. BEAMetrics users can quickly compare existing and new metrics with human judgements across a diverse set of tasks, quality dimensions (fluency vs. coherence vs. informativeness etc), and languages.
Multitask learning has led to significant advances in Natural Language Processing, including the decaNLP benchmark where question answering is used to frame 10 natural language understanding tasks in a single model. PQ-decaNLP is a crowd-sourced corpus of paraphrased questions, annotated with paraphrase phenomena. This enables analysis of how transformations such as swapping the class labels and changing the sentence modality lead to a large performance degradation.
CLUES (Constrained Language Understanding Evaluation Standard) is a benchmark for evaluating the few-shot learning capabilities of NLU models.
A collection of long-running (80+ episodes) science fiction TV show synopses, scraped from Fandom.com wikis. Collected Nov 2017. Each episode is considered a "story".
Official dataset of Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP.
The first annotated corpus for multilingual analysis of potentially unfair clauses in online Terms of Service. The data set comprises a total of 100 contracts, obtained from 25 documents annotated in four different languages: English, German, Italian, and Polish. For each contract, potentially unfair clauses for the consumer are annotated, for nine different unfairness categories.
SubSumE Dataset This repository contains the SubSumE dataset for subjective document summarization. See the paper and the talk for details on dataset creation. Also check out our work SuDocu on example-based document summarization.
AnswerSumm is a dataset of 4,631 CQA threads for answer summarization, curated by professional linguists.
Digital Edition: Essays from Hannah Arendt We have created a NER dataset from the digital edition "Sechs Essays" by Hannah Arendt. It consists of 23 documents from the period 1932-1976, which are available as TEI files online (see https://hannah-arendt-edition.net/3p.html?lang=de).
Digital Edition: Sturm Edition Source: Schrade, Torsten: „Startseite“, in: DER STURM. Digitale Quellenedition zur Geschichte der internationalen Avantgarde, erarbeitet und herausgegeben von Marjam Trautmann und Torsten Schrade. Mainz, Akademie der Wissenschaften und der Literatur, Version 1 vom 16. Jul. 2018.
DataCLUE is the first Data-Centric benchmark applied in NLP field.
Freely licensed dataset with warrants for 2k authentic arguments from news comments. On this basis, we present a new challenging task, the argument reasoning comprehension task. Given an argument with a claim and a premise, the goal is to choose the correct implicit warrant from two options. Both warrants are plausible and lexically close, but lead to contradicting claims.
WikiContradiction is a novel wiki dataset for self-contradiction Wikipedia article detection.
Product Page is a large-scale and realistic dataset of webpages. The dataset contains 51,701 manually labeled product pages from 8,175 real e-commerce websites. The pages can be rendered entirely in a web browser and are suitable for computer vision applications. This makes it substantially richer and more diverse than other datasets proposed for element representation learning, classification and prediction on the web.
The ComMA Dataset v0.2 is a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here (and made available as part of the ComMA@ICON shared task), consists of a total 15,000 annotated comments in four languages - Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English.
Inspired by Wang et al. 2021, we decided to utilize the top-voted and well-documented Kaggle notebooks to construct the notebookCDGdataset
The released GIF Reply dataset contains 1,562,701 real text-GIF conversation turns on Twitter. In these conversations, 115,586 unique GIFs are used. Metadata, including OCR extracted text, annotated tags, and object names, are also available for some GIFs in this dataset.
ShadowLink dataset is designed to evaluate the impact of entity overshadowing on the task of entity disambiguation. Paper: "Robustness Evaluation of Entity Disambiguation Using Prior Probes: the Case of Entity Overshadowing" by Vera Provatorova, Svitlana Vakulenko, Samarth Bhargav, Evangelos Kanoulas. EMNLP 2021.
We release various types of word embeddings for multiple Indian languages. Please note that for a majority of our work, we had transliterated the corpora to the Devanagiri script and the script is changed. Word Embedding models using FastText, ElMo, and cross-lingual models based on an orthogonal alignment of monolingual models for all pairs of these languages.
CLIPS, ovvero Corpora e Lessici dell'Italiano Parlato e Scritto, è uno degli otto progetti (Progetto n. 2) del Cluster C18 "LINGUISTICA COMPUTAZIONALE: RICERCHE MONOLINGUI E MULTILINGUI" (Legge 488), finanziato dal Ministero dell'Istruzione, dell'Università e della Ricerca (MIUR).