Datasets

3,148 machine learning datasets

3,148 dataset results

DiS-ReX

DiS-ReX is a multilingual dataset for distantly supervised (DS) relation extraction (RE). The dataset has over 1.5 million instances, spanning 4 languages (English, Spanish, German and French). The dataset has 36 positive relation types + 1 no relation (NA) class.

4 papers0 benchmarksTexts

Concadia

Concadia is a publicly available Wikipedia-based corpus, which consists of 96,918 images with corresponding English-language descriptions, captions, and surrounding context.

4 papers0 benchmarksImages, Texts

WikiCLIR

WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models.

4 papers0 benchmarksTexts

KazakhTTS

KazakhTTS is an open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 91 hours of transcribed audio recordings spoken by two professional speakers (female and male). It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech (TTS) applications in both academia and industry.

4 papers0 benchmarksSpeech, Texts

Italian disinformation

This is a large-scale dataset of tweets associated to thousands of news articles published on Italian disinformation websites in the context of 2019 European elections.

4 papers0 benchmarksTexts

NELA-GT-2020

NELA-GT-2020 is an updated version of the NELA-GT-2019 dataset. NELA-GT-2020 contains nearly 1.8M news articles from 519 sources collected between January 1st, 2020 and December 31st, 2020. Just as with NELA-GT-2018 and NELA-GT-2019, these sources come from a wide range of mainstream news sources and alternative news sources. Included in the dataset are source-level ground truth labels from Media Bias/Fact Check (MBFC) covering multiple dimensions of veracity. Additionally, new in the 2020 dataset are the Tweets embedded in the collected news articles, adding an extra layer of information to the data.

4 papers0 benchmarksTexts

SciDuet

SciDuet is a dataset for training and benchmarking models for automating document-to-slides generation. It consists of pairs of papers and their corresponding slides decks from recent years' NLP and ML conferences (e.g., ACL). This dataset contains 1,088 papers and 10,034 slides.

4 papers0 benchmarksTexts

SoftAttributes (SoftAttributes: Relative movie attribute dataset for soft attributes)

The dataset consists of sets of movie titles, with each set annotated with a single English soft attribute (subjective descriptive property, such as 'confusing' or 'romantic') and a reference movie. For each set, a crowd worker has placed the movies into three sets: more, equally, and less than the reference movie. There are 5,991 such sets, from which one can infer approximately 250,000 pairwise preferences over movies for the 60 distinct soft attributes studied.

4 papers0 benchmarksTexts

VANiLLa

VANiLLa is a dataset for Question Answering over Knowledge Graphs (KGQA) offering answers in natural language sentences. The answer sentences in this dataset are syntactically and semantically closer to the question than to the triple fact. The dataset consists of over 100k simple questions adapted from the CSQA and SimpleQuestionsWikidata datasets and generated using a semi-automatic framework.

4 papers0 benchmarksTexts

JobStack

JobStack is a new corpus for de-identification of personal data in job vacancies on Stackoverflow. De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data.

4 papers0 benchmarksTexts

CoDesc

CoDesc is a large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.

4 papers2 benchmarksTexts

FacetSum

FacetSum is a faceted summarization dataset for scientific documents. FacetSum has been built on Emerald journal articles, covering a diverse range of domains. Different from traditional document-summary pairs, FacetSum provides multiple summaries, each targeted at specific sections of a long document, including the purpose, method, findings, and value.

4 papers1 benchmarksTexts

CoNaLa-Ext (CoNaLa Extended With Question Text)

The CoNaLa Extended With Question Text is an extension to the original CoNaLa Dataset (Papers With Code Link) proposed in the NLP4Prog workshop paper "Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation". The key additions are that every example now has the full question body from its respective StackOverflow Question.

4 papers1 benchmarksTexts

CI-MNIST (Correlated and Imbalanced MNIST)

CI-MNIST (Correlated and Imbalanced MNIST) is a variant of MNIST dataset with introduced different types of correlations between attributes, dataset features, and an artificial eligibility criterion. For an input image $x$, the label $y \in \{1, 0\}$ indicates eligibility or ineligibility, respectively, given that $x$ is even or odd. The dataset defines the background colors as the protected or sensitive attribute $s \in \{0, 1\}$, where blue denotes the unprivileged group and red denotes the privileged group. The dataset was designed in order to evaluate bias-mitigation approaches in challenging setups and be capable of controlling different dataset configurations.

4 papers0 benchmarksImages, Tabular, Texts

ZhihuRec

ZhihuRec dataset is collected from a knowledge-sharing platform (Zhihu), which is composed of around 100M interactions collected within 10 days, 798K users, 165K questions, 554K answers, 240K authors, 70K topics, and more than 501K user query logs. There are also descriptions of users, answers, questions, authors, and topics, which are anonymous. To the best of our knowledge, this is the largest real-world interaction dataset for personalized recommendation.

4 papers0 benchmarksTexts

Goal

Goal is a novel dataset of football (or 'soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding.

4 papers0 benchmarksTexts, Videos

HKR (Handwritten Kazakh and Russian (HKR) Database for Text Recognition)

The database is written in Cyrillic and shares the same 33 characters. Besides these characters, the Kazakh alphabet also contains 9 additional specific characters. This dataset is a collection of forms. The sources of all the forms in the datasets were generated by LATEX which subsequently was filled out by persons with their handwriting. The database consists of more than 1400 filled forms. There are approximately 63000 sentences, more than 715699 symbols produced by approximately 200 diferent writers. We utilized three different datasets described as following:

4 papers2 benchmarksImages, Texts

PELD

PELD is a text-based emotional dialog dataset with personality traits for speakers.

4 papers0 benchmarksTexts

MultiSubs (MultiSubs: A Large-scale Multimodal and Multilingual Dataset)

MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach. We have introduced a fill-in-the-blank task and a lexical translation task to demonstrate the utility of the dataset. Please refer to our paper for a more detailed description of the dataset and tasks. Multisubs will benefit research on visual grounding of words especially in the context of free-form sentence.

4 papers2 benchmarksImages, Texts

XWINO

XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense reasoning capabilities.

4 papers1 benchmarksTexts

PreviousPage 68 of 158Next