Datasets

19,997 machine learning datasets

19,997 dataset results

VAST (VAried Stance Topics)

VAST consists of a large range of topics covering broad themes, such as politics (e.g., ‘a Palestinian state’), education (e.g., ‘charter schools’), and public health (e.g., ‘childhood vaccination’). In addition, the data includes a wide range of similar expressions (e.g., ‘guns on campus’ versus ‘firearms on campus’). This variation captures how humans might realistically describe the same topic and contrasts with the lack of variation in existing datasets.

18 papers1 benchmarksImages, Texts

3D60

Collects high quality 360 datasets with ground truth depth annotations, by re-using recently released large scale 3D datasets and re-purposing them to 360 via rendering.

18 papers0 benchmarks

ActivityNet Entities

ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe.

18 papers0 benchmarksVideos

Agriculture-Vision

A large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns. Collects 94,986 high-quality aerial images from 3,432 farmlands across the US, where each image consists of RGB and Near-infrared (NIR) channels with resolution as high as 10 cm per pixel.

18 papers0 benchmarks

CCD (Car Crash Dataset)

Car Crash Dataset (CCD) is collected for traffic accident analysis. It contains real traffic accident videos captured by dashcam mounted on driving vehicles, which is critical to developing safety-guaranteed self-driving systems. CCD is distinguished from existing datasets for diversified accident annotations, including environmental attributes (day/night, snowy/rainy/good weather conditions), whether ego-vehicles involved, accident participants, and accident reason descriptions.

18 papers2 benchmarksVideos

DoQA

A dataset with 2,437 dialogues and 10,917 QA pairs. The dialogues are collected from three Stack Exchange sites using the Wizard of Oz method with crowdsourcing.

18 papers0 benchmarks

DWIE (Deutsche Welle corpus for Information Extraction)

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation sub-tasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv) Entity Linking. DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.

18 papers6 benchmarksTexts

FDST (Fudan-ShanghaiTech)

The Fudan-ShanghaiTech dataset (FDST) is a dataset for video crowd counting. It contains 15K frames with about 394K annotated heads captured from 13 different scenes

18 papers0 benchmarksVideos

Groningen Meaning Bank

Groningen Meaning Bank is a semantic resource that anyone can edit and that integrates various semantic phenomena, including predicate-argument structure, scope, tense, thematic roles, animacy, pronouns, and rhetorical relations.

18 papers0 benchmarks

InfoTabS

InfoTabS comprises of human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes.

18 papers0 benchmarksTexts

KorNLI

KorNLI is a Korean Natural Language Inference (NLI) dataset. The dataset is constructed by automatically translating the training sets of the SNLI, XNLI and MNLI datasets. To ensure translation quality, two professional translators with at least seven years of experience who specialize in academic papers/books as well as business contracts post-edited a half of the dataset each and cross-checked each other’s translation afterward. It contains 942,854 training examples translated automatically and 7,500 evaluation (development and test) examples translated manually

18 papers0 benchmarksTexts

LaMem

An annotated image memorability dataset to date (with 60,000 labeled images from a diverse array of sources).

18 papers0 benchmarks

LS3D-W

A 3D facial landmark dataset of around 230,000 images.

18 papers0 benchmarks

LSHTC

LSHTC is a dataset for large-scale text classification. The data used in the LSHTC challenges originates from two popular sources: the DBpedia and the ODP (Open Directory Project) directory, also known as DMOZ. DBpedia instances were selected from the english, non-regional Extended Abstracts provided by the DBpedia site. The DMOZ instances consist of either Content vectors, Description vectors or both. A Content vectors is obtained by directly indexing the web page using standard indexing chain (preprocessing, stemming/lemmatization, stop-word removal).

18 papers0 benchmarks

MaskedFace-Net

Proposes three types of masked face detection dataset; namely, the Correctly Masked Face Dataset (CMFD), the Incorrectly Masked Face Dataset (IMFD) and their combination for the global masked face detection (MaskedFace-Net).

18 papers0 benchmarks

MED (Monotonicity Entailment Dataset)

MED is a new evaluation dataset that covers a wide range of monotonicity reasoning that was created by crowdsourcing and collected from linguistics publications. The dataset was constructed by collecting naturally-occurring examples by crowdsourcing and well-designed ones from linguistics publications. It consists of 5,382 examples.

18 papers1 benchmarksTexts

MMD (Multimodal Dialogs)

The MMD (MultiModal Dialogs) dataset is a dataset for multimodal domain-aware conversations. It consists of over 150K conversation sessions between shoppers and sales agents, annotated by a group of in-house annotators using a semi-automated manually intense iterative process.

18 papers0 benchmarksImages, Texts

ORCAS

ORCAS is a click-based dataset. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.

18 papers0 benchmarks

SALSA

A novel dataset facilitating multimodal and Synergetic sociAL Scene Analysis.

18 papers3 benchmarks

United Nations Parallel Corpus

The first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish.

18 papers0 benchmarksTexts

PreviousPage 109 of 1000Next