Datasets

19,997 machine learning datasets

19,997 dataset results

PANDORA

PANDORA is the first large-scale dataset of Reddit comments labeled with three personality models (including the well-established Big 5 model) and demographics (age, gender, and location) for more than 10k users.

14 papers0 benchmarks

WNLI (Winograd NLI)

The WNLI dataset is a part of the GLUE benchmark used for Natural Language Inference (NLI). It contains pairs of sentences, and the task is to determine whether the second sentence is an entailment of the first one or not. The dataset is used to train and evaluate models on their ability to understand these relationships between sentences.

14 papers1 benchmarks

Adverse Drug Events (ADE) Corpus

Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports.

14 papers6 benchmarks

AVSD (Audio-Visual Scene-Aware Dialog)

The Audio Visual Scene-Aware Dialog (AVSD) dataset, or DSTC7 Track 3, is a audio-visual dataset for dialogue understanding. The goal with the dataset and track was to design systems to generate responses in a dialog about a video, given the dialog history and audio-visual content of the video.

14 papers1 benchmarksAudio, Texts, Videos

RecipeNLG

Jsjsjwjwjwjwj

14 papers3 benchmarks

Linux (Linux Program Dependence Graphs)

The LINUX dataset consists of 48,747 Program Dependence Graphs (PDG) generated from the Linux kernel. Each graph represents a function, where a node represents one statement and an edge represents the dependency between the two statements

14 papers0 benchmarksGraphs

KITTI-Depth

The KITTI-Depth dataset includes depth maps from projected LiDAR point clouds that were matched against the depth estimation from the stereo cameras. The depth images are highly sparse with only 5% of the pixels available and the rest is missing. The dataset has 86k training images, 7k validation images, and 1k test set images on the benchmark server with no access to the ground truth.

14 papers0 benchmarksImages, Point cloud

TAU Urban Acoustic Scenes 2019

TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, metro station, pedestrian street, public square, street with medium level of traffic, travelling by a tram, travelling by a bus, travelling by an underground metro and urban park. Each acoustic scene has 1440 segments (240 minutes of audio). The dataset contains in total 40 hours of audio.

14 papers2 benchmarksAudio

MedHop

With the same format as WikiHop, the MedHop dataset is based on research paper abstracts from PubMed, and the queries are about interactions between pairs of drugs. The correct answer has to be inferred by combining information from a chain of reactions of drugs and proteins.

14 papers0 benchmarksTexts

Dakshina

The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon which consists of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of the language and the basic Latin alphabet.

14 papers0 benchmarksTexts

4DFAB

4DFAB is a large scale database of dynamic high-resolution 3D faces which consists of recordings of 180 subjects captured in four different sessions spanning over a five-year period (2012 - 2017), resulting in a total of over 1,800,000 3D meshes. It contains 4D videos of subjects displaying both spontaneous and posed facial behaviours. The database can be used for both face and facial expression recognition, as well as behavioural biometrics. It can also be used to learn very powerful blendshapes for parametrising facial behaviour.

14 papers0 benchmarksImages, Videos

SOREL-20M (Sophos/ReversingLabs-20 Million)

SOREL-20M is a large-scale dataset consisting of nearly 20 million files with pre-extracted features and metadata, high-quality labels derived from multiple sources, information about vendor detections of the malware samples at the time of collection, and additional “tags” related to each malware sample to serve as additional targets.

14 papers0 benchmarksTexts

MAP (Maybe Ambiguous Pronoun)

Maybe Ambiguous Pronoun is a dataset similar to GAP dataset, but without binary gender constraints.

14 papers0 benchmarks

AmazonQA

AmazonQA consists of 923k questions, 3.6M answers and 14M reviews across 156k products. Building on the well-known Amazon dataset, additional annotations are collected, marking each question as either answerable or unanswerable based on the available reviews.

14 papers0 benchmarksTexts

ACL ARC

ACL Anthology Reference Corpus (ACL ARC) is a collection of 10,920 academic papers from the ACL Anthology. ACL ARC is cleaned to remove:

14 papers0 benchmarks

CADP

A novel dataset for traffic accidents analysis.

14 papers0 benchmarks

CoDraw

The Collaborative Drawing game (CoDraw) dataset contains ~10K dialogs consisting of ~138K messages exchanged between human players in the CoDraw game. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate with each other using natural language.

14 papers0 benchmarksTexts

DEFT Corpus

A SemEval shared task in which participants must extract definitions from free text using a term-definition pair corpus that reflects the complex reality of definitions in natural language.

14 papers0 benchmarksTexts

DIPS

Contains biases but is two orders of magnitude larger than those used previously.

14 papers0 benchmarks

Fashionpedia

Fashionpedia consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes, built upon the Fashionpedia ontology.

14 papers0 benchmarks

PreviousPage 126 of 1000Next