Datasets

19,997 machine learning datasets

19,997 dataset results

BMELD

BMELD is a bilingual (English-Chinese) dialogue corpus for Neural chat translation.

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

6 papers0 benchmarks

SPACE

SPACE is a simulator for physical Interactions and causal learning in 3D environments. The SPACE simulator is used to generate the SPACE dataset, a synthetic video dataset in a 3D environment, to systematically evaluate physics-based models on a range of physical causal reasoning tasks. Inspired by daily object interactions, the SPACE dataset comprises videos depicting three types of physical events: containment, stability and contact.

6 papers0 benchmarks3D, Environment

SWSR (Sina Weibo Sexism Review)

The Sina Weibo Sexism Review (SWSR) dataset is a dataset to research online sexism in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language.

6 papers0 benchmarksTexts

HiXray

HiXray is a High-quality X-ray security inspection image dataset, which contains 102,928 common prohibited items of 8 categories. It has been gathered from the real-world airport security inspection and annotated by professional security inspectors

6 papers0 benchmarksHyperspectral images

Lyra

Lyra is a dataset for code generation that consists on Python code with embedded SQL. This dataset contains 2,000 carefully annotated database manipulation programs from real usage projects. Each program is paired with both a Chinese comment and an English comment.

6 papers0 benchmarksTexts

BSARD (Belgian Statutory Article Retrieval Dataset)

The Belgian Statutory Article Retrieval Dataset (BSARD) is a French native corpus for studying statutory article retrieval. BSARD consists of more than 22,600 statutory articles from Belgian law and about 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles from the corpus.

6 papers3 benchmarksTexts

ReadingBank

ReadingBank is a benchmark dataset for reading order detection built with weak supervision from WORD documents, which contains 500K document images with a wide range of document types as well as the corresponding reading order information.

6 papers2 benchmarks

LiDAR-MOS (LiDAR-based Moving Object Segmentation)

Tasks. In moving object segmentation of point cloud sequences, one has to provide motion labels for each point of the test sequences 11-21. Therefore, the input to all evaluated methods is a list of coordinates of the three-dimensional points along with their remission, i.e., the strength of the reflected laser beam which depends on the properties of the surface that was hit. Each method should then output a label for each point of a scan, i.e., one full turn of the rotating LiDAR sensor. Here, we only distinguish between static and moving object classes.

6 papers0 benchmarks3D, LiDAR, Point cloud

MLFW (Masked LFW)

The Masked LFW (MLFW), based on Cross-Age LFW (CALFW) database, is built using a simple but effective tool that generates masked faces from unmasked faces automatically.

6 papers6 benchmarks

GD-VCR

Geo-Diverse Visual Commonsense Reasoning (GD-VCR) is a new dataset to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense.

6 papers2 benchmarksImages, Texts

ISIC 2020 Challenge Dataset (Official dataset of the SIIM-ISIC Melanoma Classification Challenge 2020)

The dataset contains 33,126 dermoscopic training images of unique benign and malignant skin lesions from over 2,000 patients. Each image is associated with one of these individuals using a unique patient identifier. All malignant diagnoses have been confirmed via histopathology, and benign diagnoses have been confirmed using either expert agreement, longitudinal follow-up, or histopathology. A thorough publication describing all features of this dataset is available in the form of a pre-print that has not yet undergone peer review.

6 papers2 benchmarks

CAMELS Multifield Dataset

CMD is a publicly available collection of hundreds of thousands 2D maps and 3D grids containing different properties of the gas, dark matter, and stars from more than 2,000 different universes. The data has been generated from thousands of state-of-the-art (magneto-)hydrodynamic and gravity-only N-body simulations from the CAMELS project.

6 papers0 benchmarks3D, Images, Physics

safe-control-gym

safe-control-gym is an open-source benchmark suite that extends OpenAI's Gym API with (i) the ability to specify (and query) symbolic models and constraints and (ii) introduce simulated disturbances in the control inputs, measurements, and inertial properties. We provide implementations for three dynamic systems -- the cart-pole, 1D, and 2D quadrotor -- and two control tasks -- stabilization and trajectory tracking.

6 papers0 benchmarksEnvironment

FusedChat

FusedChat is an inter-mode dialogue dataset. It contains dialogue sessions fusing task-oriented dialogues (TOD) and open-domain dialogues (ODD). Based on MultiWOZ, FusedChat appends or prepends an ODD to every existing TOD. See more details in the paper.

6 papers44 benchmarksTexts

MOLD (Marathi Offensive Language Dataset)

MOLD is a Marathi dataset for offensive language identification

6 papers0 benchmarksTexts

Galaxy Zoo DECaLS

Approx. 300,000 images of galaxies labelled by shape.

6 papers0 benchmarksImages

FMFCC-A

FMFCC-A is a large publicly-available Mandarin dataset for synthetic speech detection, which contains 40,000 synthesized Mandarin utterances that generated by 11 Mandarin TTS systems and two Mandarin VC systems, and 10,000 genuine Mandarin utterance collected from 58 speakers. The FMFCCA dataset is divided into the training, development and evaluation sets, which are used for the research of detection of synthesised Mandarin speech under various previously unknown speech synthesis systems or audio post-processing operations.

6 papers0 benchmarksSpeech

CoDa (The Color Dataset)

The Color Dataset (CoDa) is a probing dataset to evaluate the representation of visual properties in language models. CoDa consists of color distributions for 521 common objects, which are split into 3 groups: Single, Multi, and Any.

6 papers0 benchmarksTexts

Modern Office-31

Modern Office-31 is a refurbished version of the commonly used Office-31 dataset. Modern Office-31 rectifies many of the annotation errors and low quality images in the Amazon domain of the original Office-31 dataset. Additionally, this dataset adds another synthetic domain based on the Adaptiope dataset.

6 papers0 benchmarksImages

PreviousPage 200 of 1000Next

Datasets

BMELD

The Boston Housing Dataset

SPACE

SWSR (Sina Weibo Sexism Review)

HiXray

Lyra

BSARD (Belgian Statutory Article Retrieval Dataset)

ReadingBank

LiDAR-MOS (LiDAR-based Moving Object Segmentation)

MLFW (Masked LFW)

GD-VCR

ISIC 2020 Challenge Dataset (Official dataset of the SIIM-ISIC Melanoma Classification Challenge 2020)

CAMELS Multifield Dataset

safe-control-gym

FusedChat

MOLD (Marathi Offensive Language Dataset)

Galaxy Zoo DECaLS

FMFCC-A

CoDa (The Color Dataset)

Modern Office-31

Datasets

BMELD

The Boston Housing Dataset

SPACE

SWSR (Sina Weibo Sexism Review)

HiXray

Lyra

BSARD (Belgian Statutory Article Retrieval Dataset)

ReadingBank

LiDAR-MOS (LiDAR-based Moving Object Segmentation)

MLFW (Masked LFW)

GD-VCR

ISIC 2020 Challenge Dataset (Official dataset of the SIIM-ISIC Melanoma Classification Challenge 2020)

CAMELS Multifield Dataset

safe-control-gym

FusedChat

MOLD (Marathi Offensive Language Dataset)

Galaxy Zoo DECaLS

FMFCC-A

CoDa (The Color Dataset)

Modern Office-31