Datasets

19,997 machine learning datasets

19,997 dataset results

DAVIS-585

A dataset for interactive segmentation with simulated initial masks.

PhysioNet Challenge 2021 (The PhysioNet/Computing in Cardiology Challenge 2021)

Data Description The training data contains twelve-lead ECGs. The validation and test data contains twelve-lead, six-lead, four-lead, three-lead, and two-lead ECGs:

6 papers29 benchmarksBiomedical, Time series

SF-XL test v2 (San Francisco eXtra Large test v2)

Test set version 2 for the San Francisco eXtra Large dataset

6 papers3 benchmarks

tdcommons (Therapeutics Data Commons)

Therapeutics Data Commons is an open-science initiative with AI/ML-ready datasets and AI/ML tasks for therapeutics, spanning the discovery and development of safe and effective medicines. TDC provides an ecosystem of tools, libraries, leaderboards, and community resources, including data functions, strategies for systematic model evaluation, meaningful data splits, data processors, and molecule generation oracles. All resources are integrated via an open Python library.

6 papers44 benchmarks

GBCU (Gallbladder Cancer Ultrasound Dataset)

GBCU is the first public dataset for Gallbladder Cancer identification from Ultrasound images. GBCU contains a total of 1255 (432 normal, 558 benign, and 265 malignant) annotated abdominal Ultrasound images collected from 218 patients. Of the 218 patients, 71, 100, and 47 were from the normal, benign, and malignant classes, respectively. The sizes of the training and testing sets are 1133 and 122, respectively. To ensure generalization to unseen patients, all images of any particular patient were either in the train or the test split. We acquired data samples from patients referred to PGIMER, Chandigarh (a referral hospital in Northern India) for abdominal ultrasound examinations of suspected Gallbladder pathologies. The study was approved by the Ethics Committee of PGIMER, Chandigarh. We obtained informed written consent from the patients at the time of recruitment, and protect their privacy by fully anonymizing the data. Grayscale B-mode static images, including both sagittal and axi

6 papers1 benchmarksImages, Medical

NLU++ (NLLU++ : A Multi-Label, Slot-Rich, Generalisable Dataset for Natural Language Understanding in Task-Oriented Dialogue)

nlu++ is a dataset for natural language understanding (NLU) in task-oriented dialogue (ToD) systems, with the aim to provide a much more challenging evaluation environment for dialogue NLU models, up to date with the current application and industry requirements. nlu++ is divided into two domains (banking and hotels) and brings several crucial improvements over current commonly used NLU datasets. 1) Nlu++ provides fine-grained domain ontologies with a large set of challenging multi-intent sentences, introducing and validating the idea of intent modules that can be combined into complex intents that convey complex user goals, combined with finer-grained and thus more challenging slot sets. 2) The ontology is divided into domain-specific and generic (i.e., domain-universal) intent modules that overlap across domains, promoting cross-domain reusability of annotated examples. 3) The dataset design has been inspired by the problems observed in industrial ToD systems, and 4) it has been coll

6 papers0 benchmarksTexts

Kompetencer (Danish Job Postings Classification Dataset)

Kompetencer (en: competences) is a Danish job posting dataset annotated for nested spans of competences.

6 papers0 benchmarksTexts

CLAMS (Cross-linguistic Analysis of Models on Syntax)

Targeted syntactic evaluation datasets in 5 languages: English, French, German, Russian, and Hebrew. Data are translated from the targeted syntactic evaluation data of Marvin & Linzen (2018): https://aclanthology.org/D18-1151/ . All stimuli focus on subject-verb agreement.

6 papers0 benchmarksTexts

GWA (Geometric-Wave Acoustic)

GWA is a large-scale audio dataset of over 2 million synthetic room impulse responses (IRs) and their corresponding detailed geometric and simulation configurations. Our dataset samples acoustic environments from over 6.8K high-quality diverse and professionally designed houses represented as semantically labeled 3D meshes

6 papers0 benchmarks

MFQE v2 (Multi-Frame Quality Enhancement v2 Dataset)

A dataset for compressed video quality enhancement.

6 papers2 benchmarks

EmoDB Dataset (Berlin Database of Emotional Speech)

The EMODB database is the freely available German emotional database. The database is created by the Institute of Communication Science, Technical University, Berlin, Germany. Ten professional speakers (five males and five females) participated in data recording. The database contains a total of 535 utterances. The EMODB database comprises of seven emotions: 1) anger; 2) boredom; 3) anxiety; 4) happiness; 5) sadness; 6) disgust; and 7) neutral. The data was recorded at a 48-kHz sampling rate and then down-sampled to 16-kHz.

6 papers4 benchmarks

BRCA-M2C

Dataset for multi-class cell classification in breast cancer H\&E images using dot annotations . The labelled cell classes are lymphocytes, tumor or epithelial cells, and stromal cells.

6 papers0 benchmarks

BBC News Summary

This dataset was created using a dataset used for data categorization that onsists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005 used in the paper of D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006; whose all rights, including copyright, in the content of the original articles are owned by the BBC. More at http://mlg.ucd.ie/datasets/bbc.html

6 papers0 benchmarks

AnoShift (AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection)

AnoShift is a large-scale anomaly detection benchmark, which focuses on splitting the test data based on its temporal distance to the training set, introducing three testing splits: IID, NEAR, and FAR. This testing scenario proves to capture the in-time performance degradation of anomaly detection methods for classical to masked language models.

6 papers8 benchmarksTabular, Time series

Long Video Dataset

We randomly selected three videos from the Internet, that are longer than 1.5K frames and have their main objects continuously appearing. Each video has 20 uniformly sampled frames manually annotated for evaluation.

6 papers9 benchmarks

French Timebank

French TimeBank, a corpus for French annotated in ISO-TimeML.

6 papers3 benchmarksTexts

Youtube-VIS 2022 Validation

Video object segmentation has been studied extensively in the past decade due to its importance in understanding video spatial-temporal structures as well as its value in industrial applications. Recently, data-driven algorithms (e.g. deep learning) have become the dominant approach to computer vision problems and one of the most important keys to their successes is the availability of large-scale datasets. Previously, we presented the first large-scale video object segmentation dataset named YouTubeVOS and hosted the Large-scale Video Object Segmentation Challenge in conjuction with ECCV 2018, ICCV 2019 and CVPR 2021. This year, we are thrilled to invite you to the 4th Large-scale Video Object Segmentation Challenge in conjunction with CVPR 2022. The benchmark would be an augmented version of the YouTubeVOS dataset with more annotations. Some incorrect annotations are also corrected. For more details, check our website for the workshop and challenge.

6 papers5 benchmarksVideos

QM8

QM8 dataset is a collection of molecular data used for studying quantum mechanical calculations of electronic spectra and excited state energy of small molecules. The QM8 dataset consists of approximately 7,165 molecules. These molecules are a subset of the GDB-13 database, which contains nearly 1 billion stable and synthetically accessible organic molecules. The subset includes all molecules with up to 23 atoms, including 7 heavy atoms (C, N, O, and S).

6 papers2 benchmarks

ClonedPerson

The ClonedPerson dataset is a large-scale synthetic person re-identification dataset introduced in the paper "Cloning Outfits from Real-World Images to 3D Characters for Generalizable Person Re-Identification" in CVPR 2022. It is generated by MakeHuman and Unity3D. Characters in this dataset use an automatic approach to directly clone the whole outfits from real-world person images to virtual 3D characters, such that any virtual person thus created will appear very similar to its real-world counterpart. The dataset contains 887,766 synthesized person images of 5,621 identities.

6 papers20 benchmarks

ZEGGS Dataset

ZEGGS dataset contains 67 sequences of monologues performed by a female actor speaking in English and covers 19 different motion styles.

6 papers0 benchmarks

PreviousPage 202 of 1000Next