19,997 machine learning datasets
19,997 dataset results
An in-the-wild stereo image dataset, comprising 49,368 image pairs contributed by users of the Holopix mobile social platform.
An Independent components (IC) dataset containing spatiotemporal measures for over 200,000 ICs from more than 6,000 EEG recordings.
A new large dataset for illumination estimation. This dataset, called INTEL-TAU, contains 7022 images in total, which makes it the largest available high-resolution dataset for illumination estimation research.
IRS is an open dataset for indoor robotics vision tasks, especially disparity and surface normal estimation. It contains totally 103,316 samples covering a wide range of indoor scenes, such as home, office, store and restaurant.
Contains 60 female and 30 male actors performing a collection of 20 predefined everyday actions and sports movements, and one self-chosen movement.
A dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies. PolicyQA provides 714 human-annotated questions written for a wide range of privacy practices.
RareAct is a video dataset of unusual actions, including actions like “blend phone”, “cut keyboard” and “microwave shoes”. It aims at evaluating the zero-shot and few-shot compositionality of action recognition models for unlikely compositions of common action verbs and object nouns. It contains 122 different actions which were obtained by combining verbs and nouns rarely co-occurring together in the large-scale textual corpus from HowTo100M, but that frequently appear separately.
RICE is a remote sensing image dataset for cloud removal. The proposed dataset consists of two parts: RICE1 contains 500 pairs of images, each pair has images with cloud and cloudless size of 512512; RICE2 contains 450 sets of images, each set contains three 512512 size images, respectively, the reference picture without clouds, the picture of the cloud and the mask of its cloud.
An 'in the wild' dataset of 20,580 dog images for which 2D joint and silhouette annotations were collected.
TJU-DHD is a high-resolution dataset for object detection and pedestrian detection. The dataset contains 115,354 high-resolution images (52% images have a resolution of 1624×1200 pixels and 48% images have a resolution of at least 2,560×1,440 pixels) and 709,330 labelled objects in total with a large variance in scale and appearance.
The UCFRep dataset contains 526 annotated repetitive action videos. This dataset is built from the action recognition dataset UCF101.
Consists of over 39,000 images originating from people who are blind that are each paired with five captions.
MuST-Cinema is a Multilingual Speech-to-Subtitles corpus ideal for building subtitle-oriented machine and speech translation systems. It comprises audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.
Aesthetic Visual Analysis is a dataset for aesthetic image assessment that contains over 250,000 images along with a rich variety of meta-data including a large number of aesthetic scores for each image, semantic labels for over 60 categories as well as labels related to photographic style.
Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm. This is further exacerbated at the time of a pandemic. To tackle this, we curate and release a manually annotated dataset of 10,700 social media posts and articles of real and fake news on COVID-19. We benchmark the annotated dataset with four machine learning baselines - Decision Tree, Logistic Regression , Gradient Boost , and Support Vector Machine (SVM). We obtain the best performance of 93.46\% F1-score with SVM.
The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of arguments. The corpus consists of 731 user comments on Consumer Debt Collection Practices (CDCP) rule by the Consumer Financial Protection Bureau (CFPB); the resulting dataset contains 4931 elementary unit and 1221 support relation annotations. It is a resource for building argument mining systems that can not only extract arguments from unstructured text, but also identify what additional information is necessary for readers to understand and evaluate a given argument. Immediate applications include providing real-time feedback to commenters, specifying which types of support for which propositions can be added to construct better-formed arguments.
The HRWSI dataset consists of about 21K diverse high-resolution RGB-D image pairs derived from the Web stereo images. Also, it provides sky segmentation masks, instance segmentation masks as well as invalid pixel masks.
The latest CosmoFlow dataset includes around 10,000 cosmological N-body dark matter simulations. The simulations are run using MUSIC to generate the initial conditions, and are evolved with pyCOLA, a multithreaded Python/Cython N-body code. The output of these simulations is then binned into a 3D histogram of particle counts in a cube of size 512x512x512, which is sampled at 4 different redshifts.
ADAM is organized as a half day Challenge, a Satellite Event of the ISBI 2020 conference in Iowa City, Iowa, USA.
FoolMeTwice (FM2 for short) is a large dataset of challenging entailment pairs collected through a fun multi-player game. Gamification encourages adversarial examples, drastically lowering the number of examples that can be solved using "shortcuts" compared to other popular entailment datasets. Players are presented with two tasks. The first task asks the player to write a plausible claim based on the evidence from a Wikipedia page. The second one shows two plausible claims written by other players, one of which is false, and the goal is to identify it before the time runs out. Players "pay" to see clues retrieved from the evidence pool: the more evidence the player needs, the harder the claim. Game-play between motivated players leads to diverse strategies for crafting claims, such as temporal inference and diverting to unrelated evidence, and results in higher quality data for the entailment and evidence retrieval tasks.