19,997 machine learning datasets
19,997 dataset results
This mouse cerebellar atlas can be used for mouse cerebellar morphometry.
The H01 dataset is a 1.4 petabyte rendering of a small sample of human brain tissue, released by a collaboration between the Lichtman Laboratory at Harvard University and Google. The H01 sample was imaged at 4nm-resolution by serial section electron microscopy, reconstructed and annotated by automated computational techniques, and analyzed for preliminary insights into the structure of the human cortex.
A dataset for fine-grained art attribute recognition introduced in the 6th FGVC Workshop at CVPR 2019. It is a high-quality artwork image dataset with professional photographs of artworks from The Metropolitan Museum of Art and attribute labels curated or verified by experts.
Tweets and items from psychological scales for sexism detection with counterfactual examples.
UIT-ViSFD is a Vietnamese Smartphone Feedback Dataset as a new benchmark corpus built based on strict annotation schemes for evaluating aspect-based sentiment analysis, consisting of 11,122 human-annotated comments for mobile e-commerce, which is freely available for research purposes.
Schema2QA is the first large question answering dataset over real-world Schema.org data. It covers 6 common domains: restaurants, hotels, people, movies, books, and music, based on crawled Schema.org metadata from 6 different websites (Yelp, Hyatt, LinkedIn, IMDb, Goodreads, and last.fm.). In total, there are over 2,000,000 examples for training, consisting of both augmented human paraphrase data and high-quality synthetic data generated by Genie. All questions are annotated with executable virtual assistant programming language ThingTalk.
The Herbarium Half-Earth dataset is a large and diverse dataset of herbarium specimens to date for automatic taxon recognition. The Herbarium 2021: Half-Earth Challenge dataset includes more than 2.5M images representing nearly 65,000 species from the Americas and Oceania that have been aligned to a standardized plant list.
We crawled 5000 paper, slide pairs from conference proceeding websites. (e.g. acl.org and usenix.org).
CiteWorth is a a large, contextualized, rigorously cleaned labelled dataset for cite-worthiness detection built from a massive corpus of extracted plain-text scientific documents.
IndiaPoliceEvents is a corpus of 21,391 sentences from 1,257 English-language Times of India articles about events in the state of Gujarat during March 2002. This dataset is used for automated event extraction.
MultiOpEd is a corpus of multi-perspective news editorials. It is an open-domain news editorial corpus that supports various tasks pertaining to the argumentation structure in news editorials, focusing on automatic perspective discovery. News editorial is a genre of persuasive text, where the argumentation structure is usually implicit. However, the arguments presented in an editorial typically center around a concise, focused thesis, which we refer to as their perspective. MultiOpEd aims at supporting the study of multiple tasks relevant to automatic perspective discovery, where a system is expected to produce a single-sentence thesis statement summarizing the arguments presented.
Rent3D++ is an extension of the Rent3D floorplans + photos dataset. The floorplans are annotated with room outline polygons, doors/windows as line segments, object-icons as axis-aligned bounding boxes, room-door-room connectivity graphs, and photo-room assignments. We have extracted rectified surface crops from architectural surfaces in photos, and these can drive interior texturing/material modeling tasks. This dataset can be used with our paper Plan2Scene to generate textured 3D mesh models of houses using floorplans and photos.
Thumb Index 1000 (TI1K) is a dataset of 1000 hand images with the hand bounding box, and thumb and index fingertip positions. The dataset includes the natural movement of the thumb and index fingers making it suitable for mixed reality (MR) applications.
This collection includes datasets from 20 subjects with primary newly diagnosed glioblastoma who were treated with surgery and standard concomitant chemo-radiation therapy (CRT) followed by adjuvant chemotherapy. Two MRI exams are included for each patient: within 90 days following CRT completion and at progression (determined clinically, and based on a combination of clinical performance and/or imaging findings, and punctuated by a change in treatment or intervention). All image sets are in DICOM format and contain T1w (pre and post-contrast agent), FLAIR, T2w, ADC, normalized cerebral blood flow, normalized relative cerebral blood volume, standardized relative cerebral blood volume, and binary tumor masks (generated using T1w images). The perfusion images were generated from dynamic susceptibility contrast (GRE-EPI DSC) imaging following a preload of contrast agent. All of the series are co-registered with the T1+C images. The intent of this dataset is for assessing deep learnin
Python Programming Puzzles (P3) is an open-source dataset where each puzzle is defined by a short Python program , and the goal is to find an input which makes output "True". The puzzles are objective in that each one is specified entirely by the source code of its verifier, so evaluating is all that is needed to test a candidate solution. They do not require an answer key or input/output examples, nor do they depend on natural language understanding.
Titanic Dataset Description Overview The data is divided into two groups: - Training set (train.csv): Used to build machine learning models. It includes the outcome (also called the "ground truth") for each passenger, allowing models to predict survival based on “features” like gender and class. Feature engineering can also be applied to create new features. - Test set (test.csv): Used to evaluate model performance on unseen data. The ground truth is not provided; the task is to predict survival for each passenger in the test set using the trained model.
This dataset contains 510 focal stacks (49 different focal distances) from in-the-wild scenes with calculated depth from SFM. This dataset was designed for research on Autofocus but can be used for any research which is interested in focal stacks, defocus cues, or depth signals (particularly for interest in close depth).
RyanSpeech is a speech corpus for research on automated text-to-speech (TTS) systems. This dataset contains textual materials from real-world conversational settings. These materials contain over 10 hours of a professional male voice actor's speech recorded at 44.1 kHz.
DisKnE is a benchmark for Disease Knowledge Evaluation built from MedNLI and MEDIQA-NLI. This benchmark is constructed to specifically test the medical reasoning capabilities of ML models, such as mapping symptoms to diseases.