Datasets

3,148 machine learning datasets

3,148 dataset results

QBSUM

A high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization.

1 papers0 benchmarksTexts

RadioTalk is a corpus of speech recognition transcripts sampled from talk radio broadcasts in the United States between October of 2018 and March of 2019. The corpus is intended for use by researchers in the fields of natural language processing, conversational analysis, and the social sciences. The corpus encompasses approximately 2.8 billion words of automatically transcribed speech from 284,000 hours of radio, together with metadata about the speech, such as geographical location, speaker turn boundaries, gender, and radio program information.

1 papers0 benchmarksTexts

Rotowire-Modified

The RotoWire-Modified dataset is a cleaned extension of the RotoWire dataset, with writer information about each document. It contains 2705 samples for training, 532 for validation and 497 for testing.

1 papers0 benchmarksTexts

scb-mt-en-th-2020

scb-mt-en-th-2020 is an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents.

1 papers0 benchmarksTexts

Search4Code

Search4Code is a large-scale web query based dataset of code search queries for C# and Java. The Search4Code data is mined from Microsoft Bing's anonymized search query logs using weak supervision technique.

1 papers0 benchmarksTexts

SG-NLG (Schema-Guided Natural Language Generation)

The SG-NLG dataset is a pre-processed version of the DSTC8 Schema-Guided Dialogue SGD dataset, designed specifically for data-to-text Natural Language Generation (NLG). The original DSTC8 SGD contains ~20,000 dialogues spanning across ~20 domains.

1 papers0 benchmarksTexts

simply-CLEVR

The simply-CLEVR dataset aims to provide a benchmark dataset that can be used for transparent quantitative evaluation of explanation methods (aka heatmaps/XAI methods). It is made of simple Visual Question Answering (VQA) questions, which are derived from the original CLEVR task, and where each question is accompanied by two Ground Truth Masks that serve as a basis for evaluating explanations on the input image.

1 papers0 benchmarksImages, Texts

SQuAD-it

SQuAD-it is derived from the SQuAD dataset and it is obtained through semi-automatic translation of the SQuAD dataset into Italian. It represents a large-scale dataset for open question answering processes on factoid questions in Italian. The dataset contains more than 60,000 question/answer pairs derived from the original English dataset.

1 papers0 benchmarksTexts

STAIR Actions Captions

A large-scale Japanese video caption dataset consisting of 79,822 videos and 399,233 captions. Each caption in the dataset describes a video in the form of "who does what and where."

1 papers0 benchmarksTexts, Videos

TextWorld KG

TextWorld KG is a dynamic Knowledge Graph (KG) extraction dataset. It is based on a set of text-based games generated using. That framework allows to extract the underlying partial KG for every state, i.e., the subgraph that represents the agent’s partial knowledge of the world – what it has observed so far. All games share the same overarching theme: the agent finds itself hungry in a simple modern house with the goal of gathering ingredients and cooking a meal.

1 papers0 benchmarksGraphs, Texts

TinySocial

TinySocial is a dataset to enable research on Social Visual Question Answering.

1 papers0 benchmarksImages, Texts

Twitter Cyberthreat Detection Dataset

Twitter Cyberthreat Detection Dataset is a dataset that contains tweets from two sets of accounts related to cybersecurity. The tweets are annotated with different information such as whether they contain security-related information and named entities.

1 papers0 benchmarksTexts

Twitter Flood

This dataset contains two subsets of flood images from Twitter: The Harz17 dataset comprises images from tweets containing flood-related keywords during the occurrence of a flood in the Harz region in Germany in July 2017. Similarly, the Rhine18 dataset comprises images related to a flood of the river Rhine in January 2018.

1 papers0 benchmarksImages, Texts

TWT-16

The TWT16 dataset contains ~30k conversations in Twitter, collected from January to June 2016.

1 papers0 benchmarksTexts

UIT-ViNames

This dataset comprises over 26,000 full names annotated with genders.

1 papers0 benchmarksTexts

Urban Dict spelling variant

Urban Dict spelling variant is a variant spelling dataset for use of NLP research in the informal domain. It consists of around 25k variant spelling pairs form UrbanDictionary.

1 papers0 benchmarksTexts

VDQG (Visual Discriminative Question Generation)

The Visual Discriminative Question Generation (VDQG) dataset contains 11202 ambiguous image pairs collected from Visual Genome. Each image pair is annotated with 4.6 discriminative questions and 5.9 non-discriminative questions on average.

1 papers0 benchmarksImages, Texts

Wiki-en

Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).

1 papers0 benchmarksTexts

Wiki-Flickr Event Dataset

The Wiki-Flick Event dataset for cross-modal event retrieval is a well-labelled but weakly-aligned dataset collected for cross-modality event retrieval. The dataset consists of 28,825 images on Flickr and 11,960 text articles from hundreds of social media, belonging to 82 categories of events.

1 papers0 benchmarksImages, Texts

Workplace Sexual Harassment

The goal of this dataset is to understand how people experience sexism and sexual harassment in the workplace by discovering themes in 2,362 experiences posted on the Everyday Sexism Project's website

1 papers0 benchmarksTexts

PreviousPage 106 of 158Next