19,997 machine learning datasets
19,997 dataset results
JHU-CROWD++ is A large-scale unconstrained crowd counting dataset with 4,372 images and 1.51 million annotations. This dataset is collected under a variety of diverse scenarios and environmental conditions. In addition, the dataset provides comparatively richer set of annotations like dots, approximate bounding boxes, blur levels, etc.
The DNS Challenge at INTERSPEECH 2020 intended to promote collaborative research in single-channel Speech Enhancement aimed to maximize the perceptual quality and intelligibility of the enhanced speech. The challenge evaluated the speech quality using the online subjective evaluation framework ITU-T P.808. The challenge provides large datasets for training noise suppressors.
SRD is a dataset for shadow removal that contains 3088 shadow and shadow-free image pairs.
The VOICES corpus is a dataset to promote speech and signal processing research of speech recorded by far-field microphones in noisy room conditions.
Avenue Dataset contains 16 training and 21 testing video clips. The videos are captured in CUHK campus avenue with 30652 (15328 training, 15324 testing) frames in total.
It's a synthetic dataset, which contains 1000 graphs divided into two classes according to the motif they contain: either a “house” or a five-node cycle.
Reddit TIFU dataset is a newly collected Reddit dataset, where TIFU denotes the name of /r/tifu subbreddit. There are 122,933 text-summary pairs in total.
MedleyDB, is a dataset of annotated, royalty-free multitrack recordings. It was curated primarily to support research on melody extraction. For each song melody f₀ annotations are provided as well as instrument activations for evaluating automatic instrument recognition. The original dataset consists of 122 multitrack songs out of which 108 include melody annotations.
The Question Answering by Search And Reading (QUASAR) is a large-scale dataset consisting of QUASAR-S and QUASAR-T. Each of these datasets is built to focus on evaluating systems devised to understand a natural language query, a large corpus of texts and to extract an answer to the question from the corpus. Specifically, QUASAR-S comprises 37,012 fill-in-the-gaps questions that are collected from the popular website Stack Overflow using entity tags. The QUASAR-T dataset contains 43,012 open-domain questions collected from various internet sources. The candidate documents for each question in this dataset are retrieved from an Apache Lucene based search engine built on top of the ClueWeb09 dataset.
BillSum is the first dataset for summarization of US Congressional and California state bills.
ECHR is an English legal judgment prediction dataset of cases from the European Court of Human Rights (ECHR). The dataset contains ~11.5k cases, including the raw text.
CityFlow is a city-scale traffic camera dataset consisting of more than 3 hours of synchronized HD videos from 40 cameras across 10 intersections, with the longest distance between two simultaneous cameras being 2.5 km. The dataset contains more than 200K annotated bounding boxes covering a wide range of scenes, viewing angles, vehicle models, and urban traffic flow conditions.
WildDash is a benchmark evaluation method is presented that uses the meta-information to calculate the robustness of a given algorithm with respect to the individual hazards.
UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects for action recognition, 22,476 frames for pose estimation, 41,290 frames and 1,144 identities for person re-identification, and 22,263 frames for attribute recognition. The dataset was collected by a flying UAV in multiple urban and rural districts in both daytime and nighttime over three months, hence covering extensive diversities w.r.t subjects, backgrounds, illuminations, weathers, occlusions, camera motions, and UAV flying attitudes. This dataset can be used for UAV-based human behavior understanding, including action recognition, pose estimation, re-identification, and attribute recognition.
Berkeley Deep Drive-X (eXplanation) is a dataset is composed of over 77 hours of driving within 6,970 videos. The videos are taken in diverse driving conditions, e.g. day/night, highway/city/countryside, summer/winter etc. On average 40 seconds long, each video contains around 3-4 actions, e.g. speeding up, slowing down, turning right etc., all of which are annotated with a description and an explanation. Our dataset contains over 26K activities in over 8.4M frames.
Gait3D is a large-scale 3D representation-based gait recognition dataset. It contains 4,000 subjects and over 25,000 sequences extracted from 39 cameras in an unconstrained indoor scene.
UBnormal is a new supervised open-set benchmark composed of multiple virtual scenes for video anomaly detection. Unlike existing data sets, the data set introduces abnormal events annotated at the pixel level at training time, for the first time enabling the use of fully-supervised learning methods for abnormal event detection. To preserve the typical open-set formulation, the data set includes disjoint sets of anomaly types in the training and test collections of videos.
MultiCoNER is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions.
The IMPACT dataset contains 50 human created prompts for each category, 200 in total, to test LLMs general writing ability. Instructed LLMs demonstrate promising ability in writing-based tasks, such as composing letters or ethical debates. This dataset consists prompts across 4 diverse usage scenarios: - Informative Writing: User queries such as self-help advice or explanations for various concept - Professional Writing: Format involves suggestions presentations or emails in a business setting - Argumentative Writing: Debate positions on ethical and societal question - Creative Writing: Diverse writing formats such as stories, poems, and songs.
The Elephant MIL dataset is a benchmark used in multiple instance learning (MIL), which falls under the broader categories of image classification and content-based image retrieval. The task is to determine if an image contains an elephant. Each image is treated as a "bag," and within each bag, the image is segmented into various regions called "instances," represented by feature vectors that capture visual characteristics like color, texture, and shape. A bag is labeled as positive if at least one instance contains an elephant, and negative if none of the instances do. The dataset includes 200 images (bags) with a total of 1220 1220 segments (instances), averaging ~6.1 segments per image. The challenge is that only some segments in a positive image might actually show an elephant, so the goal is to correctly classify the entire image based on these segments. This dataset is widely used to evaluate MIL algorithms, especially in cases where only parts of the data might contain the relev