Datasets

19,997 machine learning datasets

19,997 dataset results

SYSU-MM01-C

SYSU-MM01-C is an evaluation set that consists of algorithmically generated corruptions applied to the SYSU-MM01 test-set. These corruptions consist of Noise: Gaussian, shot, impulse, and speckle; Blur: defocus, frosted glass, motion, zoom, and Gaussian; Weather: snow, frost, fog, brightness, spatter, and rain; Digital: contrast, elastic, pixel, JPEG compression, and saturate. Each corruption has five severity levels, resulting in 100 distinct corruptions.

4 papers12 benchmarksImages

RegDB-C

RegDB-C is an evaluation set that consists of algorithmically generated corruptions applied to the RegDB test-set (color images). These corruptions consist of Noise: Gaussian, shot, impulse, and speckle; Blur: defocus, frosted glass, motion, zoom, and Gaussian; Weather: snow, frost, fog, brightness, spatter, and rain; Digital: contrast, elastic, pixel, JPEG compression, and saturate. Each corruption has five severity levels, resulting in 100 distinct corruptions.

4 papers0 benchmarksImages

Beatles

This dataset includes the beat and downbeat annotations for Beatles albums. The annotations are provided by M. E. P. Davies et. al [1].

4 papers2 benchmarksAudio

CoDEx Large

CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false.

4 papers4 benchmarks

REAL-M

Real-M is a crowd-sourced speech-separation corpus of real-life mixtures. The mixtures are recorded in different acoustic environments using a wide variety of recording devices such as laptops and smartphones, thus reflecting more closely potential application scenarios.

4 papers0 benchmarksSpeech

Explor_all

Explor_all font image dataset https://drive.google.com/file/d/1P2DbNbVw4Q__WcV1YdzE7zsDKilmd3pO/view

4 papers2 benchmarks

PhysioNet Challenge 2018 (You Snooze You Win - The PhysioNet Computing in Cardiology Challenge 2018)

Data for this challenge were contributed by the Massachusetts General Hospital’s (MGH) Computational Clinical Neurophysiology Laboratory (CCNL), and the Clinical Data Animation Laboratory (CDAC). The dataset includes 1,985 subjects which were monitored at an MGH sleep laboratory for the diagnosis of sleep disorders. The data were partitioned into balanced training (n = 994), and test sets (n = 989).

4 papers6 benchmarks

Santesteban VTO

Physics-based simulated garments on top of SMPL bodies. The data is generated used a modified version of ARCSim and sequences from the CMU Motion Capture Database converted to SMPL format in SURREAL. Each simulated sequence is stored as a .pkl file that contains the following data:

4 papers0 benchmarks

AMA (Articulated Mesh Animation)

Articulated Mesh Animation (AMA) is a real-world dataset containing 10 mesh sequences depicting 3 different humans performing various actions

4 papers0 benchmarks3d meshes, Images

ArgKP-2021

Data set covering a set of debatable topics, where for each topic and stance, a set of triplets of the form <argument, KP, label> is provided. The data set is based on the ArgKP data set, which contains arguments contributed by the crowd on 28 debatable topics, split by their stance towards the topic, and KPs written by an expert for those topics. Crowd annotations were collected to determine whether a KP represents an argument, i.e., is a match for an argument. The arguments in ArgKP are a subset of the IBM-ArgQ-Rank-30kArgs data set. For a test set, we extended ArgKP, adding three new debatable topics, that were also not part of IBM-ArgQ-Rank-30kArgs. The test set was collected specifically for KPA-2021, and was carefully designed to be similar in various aspects to the training data 2 . For each topic, crowd sourced arguments were collected, expert KPs generated, and match/no match annotations for argument/KP pairs obtained, resulting in a data set compatible with the ArgKP fo

4 papers0 benchmarks

LegalNERo (Romanian Named Entity Recognition in the Legal domain)

LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).

4 papers1 benchmarks

RTASC (ROBIN Technical Acquisition Speech Corpus)

The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format.

4 papers0 benchmarksSpeech, Tabular, Texts

ClimART (Climate Atmospheric Radiative Transfer)

Numerical simulations of Earth's weather and climate require substantial amounts of computation. This has led to a growing interest in replacing subroutines that explicitly compute physical processes with approximate machine learning (ML) methods that are fast at inference time. Within weather and climate models, atmospheric radiative transfer (RT) calculations are especially expensive. This has made them a popular target for neural network-based emulators. However, prior work is hard to compare due to the lack of a comprehensive dataset and standardized best practices for ML benchmarking. To fill this gap, we build a large dataset, ClimART, with more than \emph{10 million samples from present, pre-industrial, and future climate conditions}, based on the Canadian Earth System Model. ClimART poses several methodological challenges for the ML community, such as multiple out-of-distribution test sets, underlying domain physics, and a trade-off between accuracy and inference speed.

4 papers0 benchmarksEnvironment, Physics

RISeC (Recipe Instruction Semantics Corpus)

We propose a newly annotated dataset for information extraction on recipes. Unlike previous approaches to machine comprehension of procedural texts, we avoid a priori pre-defining domain-specific predicates to recognize (e.g., the primitive instructionsin MILK) and focus on basic understanding of the expressed semantics rather than directly reduce them to a simplified state representation.

4 papers0 benchmarksTexts

DaReCzech (Dataset for text relevance ranking in Czech)

DareCzech DaReCzech is a dataset for text relevance ranking in Czech. The dataset consists of more than 1.6M annotated query-documents pairs, which makes it one of the largest available datasets for this task.

4 papers2 benchmarksTexts

PeopleSansPeople (PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision)

In recent years, person detection and human pose estimation have made great strides, helped by large-scale labeled datasets. However, these datasets had no guarantees or analysis of human activities, poses, or context diversity. Additionally, privacy, legal, safety, and ethical concerns may limit the ability to collect more human data. An emerging alternative to real-world data that alleviates some of these issues is synthetic data. However, creation of synthetic data generators is incredibly challenging and prevents researchers from exploring their usefulness. Therefore, we release a human-centric synthetic data generator PeopleSansPeople which contains simulation-ready 3D human assets, a parameterized lighting and camera system, and generates 2D and 3D bounding box, instance and semantic segmentation, and COCO pose labels. Using PeopleSansPeople, we performed benchmark synthetic data training using a Detectron2 Keypoint R-CNN variant [1]. We found that pre-training a network using sy

4 papers0 benchmarksImages

DISRPT2019 (DISRPT2019 shared task on Discourse Unit Segmentation and Connective Detection)

The DISRPT 2019 workshop introduces the first iteration of a cross-formalism shared task on discourse unit segmentation. Since all major discourse parsing frameworks imply a segmentation of texts into segments, learning segmentations for and from diverse resources is a promising area for converging methods and insights. We provide training, development and test datasets from all available languages and treebanks in the RST, SDRT and PDTB formalisms, using a uniform format. Because different corpora, languages and frameworks use different guidelines for segmentation, the shared task is meant to promote design of flexible methods for dealing with various guidelines, and help to push forward the discussion of standards for discourse units. For datasets which have treebanks, we will evaluate in two different scenarios: with and without gold syntax, or otherwise using provided automatic parses for comparison.

4 papers0 benchmarksSpeech, Texts

eICU-CRD (eICU Collaborative Research Database)

The eICU Collaborative Research Database is a large multi-center critical care database made available by Philips Healthcare in partnership with the MIT Laboratory for Computational Physiology.

4 papers0 benchmarksMedical, Tables, Tabular, Time series

FACTIFY (a dataset on multi-modal fact verification)

FACTIFY is a dataset on multi-modal fact verification. It contains images, textual claim, reference textual documenta and image. The task is to classify the claims into support, not-enough-evidence and refute categories with the help of the supporting data. We aim to combat fake news in the social media era by providing this multi-modal dataset. Factify contains 50,000 claims accompanied with 100,000 images, split into training, validation and test sets.

4 papers0 benchmarksImages, Texts

CUGE

CUGE is a Chinese Language Understanding and Generation Evaluation benchmark with the following features: (1) Hierarchical benchmark framework, where datasets are principally selected and organized with a language capability-task-dataset hierarchy. (2) Multi-level scoring strategy, where different levels of model performance are provided based on the hierarchical framework.

4 papers0 benchmarksTexts

PreviousPage 242 of 1000Next