19,997 machine learning datasets
19,997 dataset results
Alibaba Cluster Trace captures detailed statistics for the co-located workloads of long-running and batch jobs over a course of 24 hours. The trace consists of three parts: (1) statistics of the studied homogeneous cluster of 1,313 machines, including each machine’s hardware configuration, and the runtime {CPU, Memory, Disk} resource usage for a duration of 12 hours (the 2nd half of the 24-hour period); (2) long-running job workloads, including a trace of all container deployment requests and actions, and a resource usage trace for 12 hours; (3) co-located batch job workloads, including a trace of all batch job requests and actions, and a trace of per-instance resource usage over 24 hours.
The TON_IoT datasets are new generations of Internet of Things (IoT) and Industrial IoT (IIoT) datasets for evaluating the fidelity and efficiency of different cybersecurity applications based on Artificial Intelligence (AI). The datasets have been called ‘ToN_IoT’ as they include heterogeneous data sources collected from Telemetry datasets of IoT and IIoT sensors, Operating systems datasets of Windows 7 and 10 as well as Ubuntu 14 and 18 TLS and Network traffic datasets. The datasets were collected from a realistic and large-scale network designed at the IoT Lab of the UNSW Canberra Cyber, the School of Engineering and Information technology (SEIT), UNSW Canberra @ the Australian Defence Force Academy (ADFA).
GermanQuAD is a Question Answering (QA) dataset of 13,722 extractive question/answer pairs in German.
Ghera is a repository of Android app vulnerabilities.
Cross-source point cloud dataset for registration task. It includes point clouds from structure from motion (SFM), Kinect, Lidar.
Scribble is a new outline dataset consisting of 200 images (150 train, 50 test) for each of 10 classes – basketball, chicken, cookie, cupcake, moon, orange, soccer, strawberry, watermelon and pineapple. All the images have a white background and were collected using search keywords on popular search engines. In each image, we obtain rough outlines for the image. We find the largest blob in the image after thresholding it into a black and white image. We fill the interior holes of the largest blob and obtain a smooth outline using the SavitzkyGolay filter.
This dataset aims at evaluating the License Plate Character Segmentation (LPCS) problem. The experimental results of the paper Benchmark for License Plate Character Segmentation were obtained using a dataset providing 101 on-track vehicles captured during the day. The video was recorded using a static camera in early 2015.
CAMO++ is a dataset for camouflaged object segmentation. This dataset increases the number of images with hierarchical pixel-wise ground-truths. The authors also provide a benchmark suite for the task of camouflaged instance segmentation.
We learn high fidelity human depths by leveraging a collection of social media dance videos scraped from the TikTok mobile social networking application. It is by far one of the most popular video sharing applications across generations, which include short videos (10-15 seconds) of diverse dance challenges as shown above. We manually find more than 300 dance videos that capture a single person performing dance moves from TikTok dance challenge compilations for each month, variety, type of dances, which are moderate movements that do not generate excessive motion blur. For each video, we extract RGB images at 30 frame per second, resulting in more than 100K images. We segmented these images using Removebg application, and computed the UV coordinates from DensePose.
DUO is a dataset for Underwater object detection for robot picking. The dataset contains a collection of diverse underwater images with more rational annotations.
CHIP Semantic Textual Similarity, a dataset for sentence similarity in the non-i.i.d. (non-independent and identically distributed) setting, is used for the CHIP-STS task. Specifically, the task aims to transfer learning between disease types on Chinese disease questions and answer data. Given question pairs related to 5 different diseases (The disease types in the training and testing set are different), the task intends to determine whether the semantics of the two sentences are similar.
CHIP Clinical Diagnosis Normalization, a dataset that aims to standardize the terms from the final diagnoses of Chinese electronic medical records, is used for the CHIP-CDN task. Given the original phrase, the task is required to normalize it to standard terminology based on the International Classification of Diseases (ICD-10) standard for Beijing Clinical Edition v601.
The standard digital image database with and without chest lung nodules (JSRT database) was created(*1) by the Japanese Society of Radiological Technology (JSRT) in cooperation with the Japanese Radiological Society (JRS) in 1998. Since then, the JSRT database has been used by a number of researchers in the world for various research purposes such as image processing, image compression, evaluation of image display, computer-aided diagnosis (CAD), picture archiving and communication system (PACS), and for training and testing.
DocNLI is a large-scale dataset for document-level NLI. DocNLI is transformed from a broad range of NLP problems and covers multiple genres of text. The premises always stay in the document granularity, whereas the hypotheses vary in length from single sentences to passages with hundreds of words. Additionally, DocNLI has pretty limited artifacts which unfortunately widely exist in some popular sentence-level NLI datasets.
DISC21 is a benchmark for large-scale image similarity detection. This benchmark is used for the Image Similarity Challenge at NeurIPS'21 (ISC2021). The goal is to determine whether a query image is a modified copy of any image in a reference corpus of size 1~million. The benchmark features a variety of image transformations such as automated transformations, hand-crafted image edits and machine-learning based manipulations. This mimics real-life cases appearing in social media, for example for integrity-related problems dealing with misinformation and objectionable content. The strength of the image manipulations, and therefore the difficulty of the benchmark, is calibrated according to the performance of a set of baseline approaches. Both the query and reference set contain a majority of ``distractor'' images that do not match, which corresponds to a real-life needle-in-haystack setting, and the evaluation metric reflects that.
WiC: The Word-in-Context Dataset A reliable benchmark for the evaluation of context-sensitive word embeddings.
XAI-Bench is a suite of synthetic datasets along with a library for benchmarking feature attribution algorithms. Unlike real-world datasets, synthetic datasets allow the efficient computation of conditional expected values that are needed to evaluate ground-truth Shapley values and other metrics. The synthetic datasets released offer a wide variety of parameters that can be configured to simulate real-world data.
PAD (Purpose-driven Affordance Dataset) is a dataset for affordance detection, which refers to identifying the potential action possibilities of objects in an image, which is an important ability for robot perception and manipulation. The dataset consists of 4K images from 31 affordance and 72 object categories.
OpenForensics is a large-scale dataset posing a high level of challenges that is designed with face-wise rich annotations explicitly for face forgery detection and segmentation. With its rich annotations, the OpenForensics dataset has great potentials for research in both deepfake prevention and general human face detection.
SoundingEarth consists of co-located aerial imagery and audio samples all around the world.