19,997 machine learning datasets
19,997 dataset results
The quality of AI-generated images has rapidly increased, leading to concerns of authenticity and trustworthiness.
MIMIC-CXR-LT. We construct a single-label, long-tailed version of MIMIC-CXR in a similar manner. MIMIC-CXR is a multi-label classification dataset with over 200,000 chest X-rays labeled with 13 pathologies and a “No Findings” class. The resulting MIMIC-CXR-LT dataset contains 19 classes, of which 10 are head classes, 6 are medium classes, and 3 are tail classes. MIMIC-CXR-LT contains 111,792 images labeled with one of 18 diseases, with 87,493 training images and 23,550 test set images. The validation and balanced test sets contain 15 and 30 images per class, respectively.
Bike flow data of New York City with grid 16x8.
Bike flow data of New York City.
ARMBench is a large-scale, object-centric benchmark dataset for robotic manipulation in the context of a warehouse. ARMBench contains images, videos, and metadata that corresponds to 235K+ pick-and-place activities on 190K+ unique objects. The data is captured at different stages of manipulation, i.e., pre-pick, during transfer, and after placement.
Extension of the PASTIS benchmark with radar and optical image time series.
The Edinburgh International Accents of English Corpus (EdAcc) is a new automatic speech recognition (ASR) dataset composed of 40 hours of English dyadic conversations between speakers with a diverse set of accents. EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker.
Accurate 3D human pose estimation is essential for sports analytics, coaching, and injury prevention. However, existing datasets for monocular pose estimation do not adequately capture the challenging and dynamic nature of sports movements. In response, we introduce SportsPose, a large-scale 3D human pose dataset consisting of highly dynamic sports movements. With more than 176,000 3D poses from 24 different subjects performing 5 different sports activities, SportsPose provides a diverse and comprehensive set of 3D poses that reflect the complex and dynamic nature of sports movements. Contrary to other markerless datasets we have quantitatively evaluated the precision of SportsPose by comparing our poses with a commercial marker-based system and achieve a mean error of 34.5 mm across all evaluation sequences. This is comparable to the error reported on the commonly used 3DPW dataset. We further introduce a new metric, local movement, which describes the movement of the wrist and ankle
Databricks Dolly 15k is a dataset containing 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. It is authored by more than 5,000 Databricks employees during March and April of 2023. The training records are natural, expressive and designed to represent a wide range of the behaviors, from brainstorming and content generation to information extraction and summarization.
During the MILAN research project (MachIne Learning for AstroNomy), the research team uses Stellina observation stations to collect raw images of deep sky objects. Thus, the research team built a dataset that represents what can be obtained during classical Electronically Assisted Astronomy sessions in the Luxembourg Greater Region. The dataset is composed of ZIP files -- and each zip file contains raw images in FITS format (Flexible Image Transport System): data comes directly from the Sony IMX178 sensor of the Stellina observation station (no debayerisation and no post processing). Funding: This work was funded by the Luxembourg National Research Fund (FNR -- https://www.fnr.lu/) .
The French National Institute of Geographical and Forest Information (IGN) has the mission to document and measure land-cover on French territory and provides referential geographical datasets, including high-resolution aerial images and topographic maps. The monitoring of land-cover plays a crucial role in land management and planning initiatives, which can have significant socio-economic and environmental impact. Together with remote sensing technologies, artificial intelligence (IA) promises to become a powerful tool in determining land-cover and its evolution. IGN is currently exploring the potential of IA in the production of high-resolution land cover maps. Notably, deep learning methods are employed to obtain a semantic segmentation of aerial images. However, territories as large as France imply heterogeneous contexts: variations in landscapes and image acquisition make it challenging to provide uniform, reliable and accurate results across all of France.
CDS2K is a benchmark for Concealed scene understanding (CSU), which is a hot computer vision topic aiming to perceive objects with camouflaged properties. It is a concealed defect segmentation dataset from the five well-known defect segmentation databases. It contains five sub-databases: MVTecAD, NEU, CrackForest, KolektorSDD, and MagneticTile. The defective regions are highlighted with red rectangles.
The LIMUC dataset is the largest publicly available labeled ulcerative colitis dataset that compromises 11276 images from 564 patients and 1043 colonoscopy procedures. Three experienced gastroenterologists were involved in the annotation process, and all images are labeled according to the Mayo endoscopic score (MES).
SpeechInstruct is a large-scale cross-modal speech instruction dataset. It contains 37,969 quadruplets composed of speech instructions, text instructions, text responses, and speech responses.
Dynamic Human Bodies dataset (DHB), containing 10 point cloud sequences from the MITAMA dataset and 4 from the 8IVFB dataset. The sequences in DHB record 3D human motions with large and non-rigid deformation in real world. The overall dataset contains more than 3000 point cloud frames. And each frame has 1024 points.
SPACE is a large-scale opinion summarization benchmark for the evaluation of unsupervised summarizers. SPACE is built on TripAdvisor hotel reviews and includes a training set of approximately 1.1 million reviews for over 11 thousand hotels.
OVQA contains 19,020 medical visual question and answer pairs generated from 2,001 medical images collected from 2,212 EMRs in Orthopedics.
A Game Of Sorts is a collaborative image ranking task. Players are asked to rank a set of images based on a given sorting criterion. The game provides a framework for the evaluation of visually grounded language understanding and generation of referring expressions in multimodal dialogue settings.
Youku-mPLUG is a large Chinese high-quality video-language dataset which is collected from Youku.com, a well-known Chinese video-sharing website, with strict criteria of safety, diversity, and quality. It contains 10 million video-text pairs for pre-training and 0.3 millon videos for downstream benchmarks covering Video-Text Retrieval, Video Captioning and Video Category Classification.
We generate epistemic reasoning problems using modal logic to target theory of mind (tom) in natural language processing models.