19,997 machine learning datasets
19,997 dataset results
Weibo21 is a benchmark of fake news dataset for multi-domain fake news detection (MFND) with domain label annotated, which consists of 4,488 fake news and 4,640 real news from 9 different domains.
**CrossMoDA is a large and multi-class benchmark for unsupervised cross-modality Domain Adaptation. The goal of the challenge is to segment two key brain structures involved in the follow-up and treatment planning of vestibular schwannoma (VS): the VS and the cochleas. Currently, the diagnosis and surveillance in patients with VS are commonly performed using contrast-enhanced T1 (ceT1) MR imaging.
The Zillow Indoor Dataset (ZInD) provides extensive visual data that covers a real world distribution of unfurnished residential homes. It consists of primary 360º panoramas with annotated room layouts, windows, doors and openings (W/D/O), merged rooms, secondary localized panoramas, and final 2D floor plans. The figure above illustrates the various representations (from left to right beyond capture): Room layout with W/D/O annotations, merged layouts, 3D textured mesh, and final 2D floor plan.
SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles (UAVs) in maritime scenarios. Building highly complex autonomous UAV systems that aid in SAR missions requires robust computer vision algorithms to detect and track objects or persons of interest. This data set provides three sets of tracks: object detection, single-object tracking and multi-object tracking. Each track consists of its own data set and leaderboard.
The Endomapper dataset is the first collection of complete endoscopy sequences acquired during regular medical practice, including slow and careful screening explorations, making secondary use of medical data. Its original purpose is to facilitate the development and evaluation of VSLAM (Visual Simultaneous Localization and Mapping) methods in real endoscopy data. The first release of the dataset is composed of 50 sequences with a total of more than 13 hours of video. It is also the first endoscopic dataset that includes both the computed geometric and photometric endoscope calibration as well as the original calibration videos. Meta-data and annotations associated to the dataset varies from anatomical landmark and description of the procedure labeling, tools segmentation masks, COLMAP 3D reconstructions, simulated sequences with groundtruth and meta-data related to special cases, such as sequences from the same patient. This information will improve the research in endoscopic VSLAM, a
The RSBlur dataset provides pairs of real and synthetic blurred images with ground truth sharp images. The dataset enables the evaluation of deblurring methods and blur synthesis methods on real-world blurred images. Training, validation, and test sets consist of 8,878, 1,120, and 3,360 blurred images, respectively.
This is a dataset for video frame interpolation task. The dataset contains the 1920×1080 videos in 240 FPS for videos captured with iPhone 11 and in 120 FPS for gaming content captured with OBS.
Gender-biased FFHQ dataset (bFFHQ) has age as a target label and gender as a correlated bias, and the images are from the FFHQ dataset. The images include the dominant number of young women (i.e., aged 10-29) and old men (i.e., aged 40-59) in the training data.
Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.
Node classification on Wisconsin with 60%/20%/20% random splits for training/validation/test.
The dataset was created for video quality assessment problem. It was formed with 36 clips from Vimeo, which were selected from 18,000+ open-source clips with high bitrate (license CCBY or CC0).
BURST is a benchmark suite built upon TAO that requires tracking and segmenting multiple objects from camera video. The benchmark contains 6 different sub-tasks divided into 2 groups that all share the same data for training/validation/testing.
CMB is a comprehensive, multi-level Medical Benchmark in Chinese. It encompasses 280,839 multiple-choice questions and 74 complex case consultation questions, covering all clinical medical specialties and various professional levels. The platform aims to holistically evaluate a model's medical knowledge and clinical consultation capabilities.
The heavily occluded scene text (HOST) dataset is a dataset that contains images of text with occlusions. It is used to improve the recognition performance of occluded text in machine vision applications 1. The dataset is composed of 4832 images that are manually occluded in weak or heavy degrees.
A holistic approach to cross-channel image noise modeling and its application to image denoising
SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.
Regression dataset for molecular docking scores (predicted molecule-protein binding affinity). Contains ~250,000 molecules against 58 protein targets.
ODEX is an open-domain 📖, multilingual 🌍, execution-based 🛠 natural language to code generation 💻 data benchmark. ODEX has 945 NL-Code pairs spanning 79 diverse libraries, along with 1,707 human-written test cases for execution. The NL-Code pairs are harvested from StackOverflow forums to encourage natural and practical coding queries. Moreover, ODEX supports four natural languages as intents: English, español (Spanish), 日本語 (Japanese), and Pусский (Russian).
The AggreFact dataset is a benchmark for evaluating the factuality of summaries generated by different summarization models. It aggregates factuality error annotations from nine existing datasets and stratifies them according to the underlying summarization model.
KinFaceW-II Dataset consists of 1000 pairs of facial images of individuals with a kin relation. This database also considers four common kin relations: father and daughter (F-D), father and son (F-S), mother and daughter (M-D), mother and son (M-S). Different from the KinFaceW-I database, the positive pairs in this dataset are taken from the same photo.