Datasets

19,997 machine learning datasets

19,997 dataset results

Hansel

Hansel is a human-annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities:

TAS-NIR

TAS-NIR is a VIS+NIR dataset of semantically annotated images in unstructured outdoor environments. It consists of 209 VIS+NIR image pairs with a fine-grained semantic segmentation.

2 papers0 benchmarksImages

JEMMA is an Extensible Java Dataset for ML4Code Applications, which is a large-scale dataset targeted at ML4 code. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods.

2 papers0 benchmarksTexts

SPARF

SPARF is a large-scale ShapeNet-based synthetic dataset for novel view synthesis consisting of ~17 million images rendered from nearly 40,000 shapes at high resolution (400×400 pixels).

2 papers0 benchmarksImages

Berlin V2X

The Berlin V2X dataset offers high-resolution GPS-located wireless measurements across diverse urban environments in the city of Berlin for both cellular and sidelink radio access technologies, acquired with up to 4 cars over 3 days. The data enables thus a variety of different ML studies towards vehicle-to-anything (V2X) communication.

2 papers0 benchmarksTabular, Time series

PropSegmEnt

PropSegmEnt is a corpus of over 35K propositions annotated by expert human raters. The dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity.

2 papers0 benchmarksTexts

Dusha (Dusha Crowd, Dusha Podcast)

Dusha is a dataset for speech emotion recognition (SER) tasks. The corpus contains approximately 350 hours of data, more than 300 000 audio recordings with Russian speech and their transcripts. It is annotated using a crowd-sourcing platform and includes two subsets: acted and real-life.

2 papers0 benchmarksAudio, Texts

XAlign

It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta), Telugu (te), and monolingual dataset in English (en). This is the Wikipedia text <--> Wikidata KG aligned corpus used to train the data-to-text generation model. The Train & validation splits are created using distant supervision methods and Test data is generated through human annotations.

2 papers4 benchmarksTexts

Merger Agreement Understanding Dataset (MAUD)

MAUD is an expert-annotated merger agreement reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points study, where lawyers and law students answered 92 questions about 152 merger agreements.

2 papers0 benchmarksTexts

Argoverse 2 Map Change

The Argoverse 2 Map Change Dataset is a collection of 1,000 scenarios with ring camera imagery, lidar, and HD maps. Two hundred of the scenarios include changes in the real-world environment that are not yet reflected in the HD map, such as new crosswalks or repainted lanes. By sharing a map dataset that labels the instances in which there are discrepancies with sensor data, we encourage the development of novel methods for detecting out-of-date map regions.

2 papers0 benchmarksLiDAR, Videos

Causal Triplet

Causal Triplet is a causal representation learning benchmark featuring not only visually more complex scenes, but also two crucial desiderata commonly overlooked in previous works:

2 papers0 benchmarksImages

VTC (Videos, Titles and Comments)

VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.

2 papers0 benchmarksAudio, Images, Texts, Videos

ValNov Subtask A

Binary labels for Validity and Novelty respectively are given for each Conclusion.

2 papers12 benchmarksTexts

High-gamma dataset discribed in Schirrmeister et al. 2017 (EEG High-Gamma Dataset)

High-gamma dataset discribed in Schirrmeister et al. 2017

2 papers0 benchmarksEEG

OLKAVS (An Open Large-Scale Korean Audio-Visual Speech Dataset)

The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading.

2 papers0 benchmarks

Trinity Speech-Gesture Dataset

Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker producing spontaneous speech on different topics. The actor’s motion was captured with 20 Viconcameras at 59.94 frames per second(fps), and the skeleton includes 69 joints.

2 papers5 benchmarks

ASHRAE energy prediction III

Assessing the value of energy efficiency improvements can be challenging as there's no way to truly know how much energy a building would have used without the improvements. The best we can do is to build counterfactual models. Once a building is overhauled the new (lower) energy consumption is compared against modeled values for the original building to calculate the savings from the retrofit. More accurate models could support better market incentives and enable lower-cost financing.

2 papers0 benchmarks

ATUE

ATUE is an antibody study benchmark with four real-world supervised tasks covering therapeutic antibody engineering, B cell analysis, and antibody discovery.

2 papers0 benchmarksBiology, Texts

FaceOcc (Face Occlusion Dataset)

FaceOcc is a high-quality face occlusion dataset which contains all mislabeled occlusions in CelebAMask-HQ and complements some occlusions and textures from the internet. The occlusion types cover sunglasses, spectacles, hands, masks, scarfs, microphones, etc.

2 papers0 benchmarksImages

The Copiale Cipher

The Copiale Cipher is a 105 pages manuscript containing all in all around 75 000 characters. Beautifully bound in green and gold brocade paper, written on high quality paper with two different watermarks, the manuscript can be dated back to around 1750. Apart from what is obviously an owner's mark (“Philipp 1866”) and a note in the end of the last page (“Copiales 3”), the manuscript is completely encoded. The cipher employed consists of 100 different symbols, comprising all from Latin and Greek letters, to diacritics and graphich signs such as Zodiac and alchemical symbols. Catchwords (preview fragments) of one to three or four characters are written at the bottom of left–hand pages.

2 papers0 benchmarks

PreviousPage 334 of 1000Next