3,148 machine learning datasets
3,148 dataset results
A high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization.
RadioTalk is a corpus of speech recognition transcripts sampled from talk radio broadcasts in the United States between October of 2018 and March of 2019. The corpus is intended for use by researchers in the fields of natural language processing, conversational analysis, and the social sciences. The corpus encompasses approximately 2.8 billion words of automatically transcribed speech from 284,000 hours of radio, together with metadata about the speech, such as geographical location, speaker turn boundaries, gender, and radio program information.
The RotoWire-Modified dataset is a cleaned extension of the RotoWire dataset, with writer information about each document. It contains 2705 samples for training, 532 for validation and 497 for testing.
scb-mt-en-th-2020 is an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents.
Search4Code is a large-scale web query based dataset of code search queries for C# and Java. The Search4Code data is mined from Microsoft Bing's anonymized search query logs using weak supervision technique.
The SG-NLG dataset is a pre-processed version of the DSTC8 Schema-Guided Dialogue SGD dataset, designed specifically for data-to-text Natural Language Generation (NLG). The original DSTC8 SGD contains ~20,000 dialogues spanning across ~20 domains.
The simply-CLEVR dataset aims to provide a benchmark dataset that can be used for transparent quantitative evaluation of explanation methods (aka heatmaps/XAI methods). It is made of simple Visual Question Answering (VQA) questions, which are derived from the original CLEVR task, and where each question is accompanied by two Ground Truth Masks that serve as a basis for evaluating explanations on the input image.
SQuAD-it is derived from the SQuAD dataset and it is obtained through semi-automatic translation of the SQuAD dataset into Italian. It represents a large-scale dataset for open question answering processes on factoid questions in Italian. The dataset contains more than 60,000 question/answer pairs derived from the original English dataset.
A large-scale Japanese video caption dataset consisting of 79,822 videos and 399,233 captions. Each caption in the dataset describes a video in the form of "who does what and where."
TextWorld KG is a dynamic Knowledge Graph (KG) extraction dataset. It is based on a set of text-based games generated using. That framework allows to extract the underlying partial KG for every state, i.e., the subgraph that represents the agent’s partial knowledge of the world – what it has observed so far. All games share the same overarching theme: the agent finds itself hungry in a simple modern house with the goal of gathering ingredients and cooking a meal.
TinySocial is a dataset to enable research on Social Visual Question Answering.
Twitter Cyberthreat Detection Dataset is a dataset that contains tweets from two sets of accounts related to cybersecurity. The tweets are annotated with different information such as whether they contain security-related information and named entities.
This dataset contains two subsets of flood images from Twitter: The Harz17 dataset comprises images from tweets containing flood-related keywords during the occurrence of a flood in the Harz region in Germany in July 2017. Similarly, the Rhine18 dataset comprises images related to a flood of the river Rhine in January 2018.
The TWT16 dataset contains ~30k conversations in Twitter, collected from January to June 2016.
This dataset comprises over 26,000 full names annotated with genders.
Urban Dict spelling variant is a variant spelling dataset for use of NLP research in the informal domain. It consists of around 25k variant spelling pairs form UrbanDictionary.
The Visual Discriminative Question Generation (VDQG) dataset contains 11202 ambiguous image pairs collected from Visual Genome. Each image pair is annotated with 4.6 discriminative questions and 5.9 non-discriminative questions on average.
Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).
The Wiki-Flick Event dataset for cross-modal event retrieval is a well-labelled but weakly-aligned dataset collected for cross-modality event retrieval. The dataset consists of 28,825 images on Flickr and 11,960 text articles from hundreds of social media, belonging to 82 categories of events.
The goal of this dataset is to understand how people experience sexism and sexual harassment in the workplace by discovering themes in 2,362 experiences posted on the Everyday Sexism Project's website