Datasets

19,997 machine learning datasets

19,997 dataset results

satnet-sudoku (SATNet's Sudoku training test)

A set of easy Sudoku instances used in the SATNet paper for training SatNet on how to learn to play Sudoku.

rrn-sudoku (RRN sudoku instances dataset)

A set of 180,000 Sudoku grids with a variable number of hints from the minimal number of 17 (extremely hard instances) to 34 (easy instances), with 10,000 instances per level of hardness.

2 papers0 benchmarksTexts

many-solutions-sudoku (Dataset of Sudoku grids with more than one solution)

A data set of Sudoku grids with more than one solution.

2 papers0 benchmarksTexts

Protein structures Ingraham (Dataset of protein backbones and sequences)

A data set introduced for training on the protein design task.

2 papers0 benchmarks3D

MeDAL Retina Dataset (MeDAL Retina Dataset)

Our primary objective in creating this dataset is to support researchers in the advancement of algorithms for keypoints detection and the pretraining of large models on retinal images using a self-supervised approach. The keypoints in the dataset have been carefully annotated by students from our lab, ensuring meticulous accuracy.

2 papers0 benchmarksImages, Medical

TIE (https://github.com/raianand1991/TIE)

Click to add a brief description of the dataset (Markdown and LaTeX enabled). The TIE(Technical Indian English) dataset is a massive speech dataset of ~750 GB, consisting of ~9.8K technical lectures in English, along with their transcripts. The lectures were delivered by instructors from all over India and were sourced from the NPTEL website

2 papers0 benchmarks

Rosario Dataset (The Rosario dataset: Multisensor data for localization and mapping in agricultural environments)

Agricultural dataset collected on-board out weed removing robot. The dataset is composed by six different sequences in a soybean field and it contains stereo images, IMU measurements, wheel odometry and GPS-RTK (positional ground-truth)

2 papers0 benchmarks

SK-VG

SK-VG is a dataset for Scene Knowledge-guided Visual Grounding, where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, SK-VG is the first dataset of the fourth type, where for each image, we provide human-written knowledge to describe its content.

2 papers0 benchmarksImages

CIC IoT Dataset 2022

CIC IoT Dataset 2022 This project aims to generate a state-of-the-art dataset for profiling, behavioural analysis, and vulnerability testing of different IoT devices with different protocols such as IEEE 802.11, Zigbee-based and Z-Wave. The following illustrates the main objectives of the CIC-IoT dataset project:

2 papers0 benchmarks

IoT Traffic Traces (IoT Traffic Traces - Data Collected for IEEE TMC 2018)

IOT TRAFFIC TRACES Data Collected for IEEE TMC 2018 Cite our data

2 papers0 benchmarks

IoT Benign and Attack Traces (IoT Benign and Attack Traces - Data Collected for ACM SOSR 2019)

IOT BENIGN AND ATTACK TRACES

2 papers0 benchmarks

Dataset of UAI 2021 Paper "An Unsupervised Video Game Playstyle Metric via State Discretization"

This is a part of dataset of the paper published in UAI 2021 (37th Conference on Uncertainty in Artificial Intelligence).

2 papers0 benchmarks

SKILL-102 (SKILL 102 Lifelong Learning Dataset)

SKILL-102 consists of 102 image classification datasets. Each one supports one complex classification task, and the corresponding dataset was obtained from previously published sources (e.g., task 1: classify flowers into 102 classes, such as lily, rose, petunia, etc using 8,185 train/val/test images (Nilsback & Zisserman, 2008a); task 2: classify 67 types of scenes, such as kitchen, bedroom, gas station, library, etc using 15,524 images (Quattoni & Torralba, 2009). In total, SKILL-102 comprises 102 tasks, 5,033 classes, and 2,041,225 training images. To the best of our knowledge, SKILL-102 is the most challenging completely real (not synthesized or permuted) image classification benchmark for LL and SKILL algorithms, with the largest number of tasks, number of classes, and inter-task variance.

2 papers0 benchmarksImages

Marmoset-8K (DeepLabCut multi-animal Marmoset dataset)

All animal procedures are overseen by veterinary staff of the MIT and Broad Institute Department of Comparative Medicine, in compliance with the NIH guide for the care and use of laboratory animals and approved by the MIT and Broad Institute animal care and use committees. Video of common marmosets (Callithrix jacchus) was collected in the laboratory of Guoping Feng at MIT. Marmosets were recorded using Kinect V2 cameras (Microsoft) with a resolution of 1080p and frame rate of 30 Hz. After acquisition, images to be used for training the network were manually cropped to 1000 x 1000 pixels or smaller. The dataset is 7,600 labeled frames from 40 different marmosets collected from 3 different colonies (in different facilities). Each cage contains a pair of marmosets, where one marmoset had light blue dye applied to its tufts. One human annotator labeled the 15 marker points on each animal present in the frame (frames contained either 1 or 2 animals).

2 papers4 benchmarks

Fish-100

Schools of inland silversides (Menidia beryllina, n=14 individuals per school) were recorded in the Lauder Lab at Harvard University while swimming at 15 speeds (0.5 to 8 BL/s, body length, at 0.5 BL/s intervals) in a flow tank with a total working section of 28 x 28 x 40 cm as described in previous work, at a constant temperature (18±1°C) and salinity (33 ppt), at a Reynolds number of approximately 10,000 (based on BL). Dorsal views of steady swimming across these speeds were recorded by high-speed video cameras (FASTCAM Mini AX50, Photron USA, San Diego, CA, USA) at 60-125 frames per second (feeding videos at 60 fps, swimming alone 125 fps). The dorsal view was recorded above the swim tunnel and a floating Plexiglas panel at the water surface prevented surface ripples from interfering with dorsal view videos. Five keypoints were labeled (tip, gill, peduncle, dorsal fin tip, caudal tip). 100 frames were labeled, making this a real-world sized laboratory dataset.

2 papers4 benchmarksImages

DEplain-APA-sent

DEplain-APA-sent: A German Parallel Corpus for Sentence Simplification on News Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

2 papers4 benchmarksTexts

DEplain-web-sent

DEplain-web-sent: A German Parallel Corpus for Sentence Simplification on Web Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

2 papers4 benchmarksTexts

BioVid (BioVid Heat Pain Database)

To advance methods for pain assessment, in particular automatic assessment methods, the BioVid Heat Pain Database was collected in a collaboration of the Neuro-Information Technology group of the University of Magdeburg and the Medical Psychology group of the University of Ulm. In our study, 90 participants were subjected to experimentally induced heat pain in four intensities. To compensate for varying heat pain sensitivities, the stimulation temperatures were adjusted based on the subject-specific pain threshold and pain tolerance. Each of the four pain levels was stimulated 20 times in randomized order. For each stimulus, the maximum temperature was held for 4 seconds. The pauses between the stimuli were randomized between 8-12 seconds. The pain stimulation experiment was conducted twice: once with un-occluded face and once with facial EMG sensors.

2 papers0 benchmarksBiomedical, Medical, Videos

FinBench

FinBench is a benchmark for evaluating the performance of machine learning models with both tabular data inputs and profile text inputs.

2 papers0 benchmarksTabular, Texts

ccHarmony

ccHarmony is a color checker (cc) based image harmonization dataset. The dataset contains 350 real images and 426 segmented foregrounds, in which each real image has one or two segmented foregrounds. Each foreground is associated with 10 synthetic composite images. Therefore, our dataset has in total 4260 pairs of synthetic composite images and ground-truth real images. We split all pairs into 3080 training pairs and 1180 test pairs.

2 papers0 benchmarksImages

PreviousPage 340 of 1000Next