Datasets

19,997 machine learning datasets

19,997 dataset results

N-Omniglot

N-Omniglot is a neuromorphic dataset for few-shot learning. It contains 1,623 categories of handwritten characters, with only 20 samples per class.

4 papers0 benchmarksImages

FR-FS (Fall Recognition in Figure Skating)

The FR-FS dataset contains 417 videos collected from FIV dataset and Pingchang 2018 Winter Olympic Games. FR-FS contains the critical movements of the athlete’s take-off, rotation, and landing. Among them, 276 are smooth landing videos, and 141 are fall videos. To test the generalization performance of our proposed model, we randomly select 50% of the videos from the fall and landing videos as the training set and the testing set.

4 papers0 benchmarksVideos

EurekaAlert (Eureka Alert)

This dataset contains around 5000 scholarly articles and their corresponding easy summary from eureka alert blog, the dataset can be used for the combined task of summarization and simplification.

4 papers3 benchmarks

Incidents1M

Incidents1M is a large-scale multi-label dataset for incident detection which contains 977,088 images, with 43 incident and 49 place categories. It is an evolution of the Incidents dataset that doubles the dataset size and includes more incident labels.

4 papers0 benchmarksImages

CUB-GHA (CUB Gaze-based Human Attention)

CUB-GHA is a dataset for fine-grained classification with human attention annotations. The dataset collects human gaze data for the fine-grained classification dataset CUB and builds a dataset named CUB-GHA (Gaze-based Human Attention).

4 papers0 benchmarksImages

Sepehr_RumTel01

The expansion of social networks has accelerated the transmission of information and news at every communities. Over the past few years, the number of users, audiences and social networking publishers, are increased dramatically too. Among the massive amounts of information and news reported on these networks, we are faced with issues that have not been verified which is called “rumors”. Identifying rumors on social networks is carried out in the form of rumor detection approaches; the massive amount of these news and information force to use the machine learning techniques. The most important problem with auto-detection approaches is the lack of a database of rumors. For that matter, in this article, a collection of rumors published on the social network “telegrams” have been collected. These data are gathered from five Persian-language channels that have specially reviewed this issue. The collected data set contains 3283 messages with 2829 attachments, having a volume of over 1.6 gig

4 papers1 benchmarks

EEG Motor Movement/Imagery Dataset

This data set consists of over 1500 one- and two-minute EEG recordings, obtained from 109 volunteers.

4 papers1 benchmarks

DAGW (Danish Gigaword)

It’s hard to develop good tools for processing Danish with computers when no large and wide-coverage dataset of Danish text is readily available. To address this, the Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. The general goals are to create a dataset that is:

4 papers0 benchmarksTexts

DABS (Domain-Agnostic Benchmark for Self-supervised learning)

DABS is a domain-agnostic benchmark for self-supervised learning to encourage research and progress towards domain-agnostic methods.

4 papers6 benchmarksImages

TUH EEG Seizure Corpus (Temple University Hospital (TUH) EEG Corpus)

Our goal is to enable deep learning research in neuroscience by releasing the largest publicly available unencumbered database of EEG recordings. This ongoing project currently includes over 30,000 EEGs spanning the years from 2002 to present. Data collected can be used for both research and commercialization purposes.

4 papers1 benchmarks

JaQuAD

JaQuAD (Japanese Question Answering Dataset) is a question answering dataset in Japanese that consists of 39,696 extractive question-answer pairs on Japanese Wikipedia articles.

4 papers2 benchmarksTexts

PCFG SET (Probabilistic Context Free Grammar String Edit Task)

The Probabilistic Context Free Grammar String Edit Task (PCFG SET) dataset is a dataset with sequence to sequence problems specifically designed to test different aspects of compositional generalisation. In particular, the dataset contains splits to test for systematicity, productivity, substitutivity, localism and overgeneralisation.

4 papers0 benchmarksTexts

VizWiz-VQA-Grounding

The VizWiz-VQA-Grounding dataset is a dataset that visually grounds answers to visual questions asked by people with visual impairments.

4 papers0 benchmarksImages, Texts

NYT10-HRL

a dataset from A Hierarchical Framework for Relation Extraction with Reinforcement Learning

4 papers1 benchmarks

Something-Something-100

Something-Something-100 is a dataset split created from Something-Something V2. A total of 100 classes are selected and each comprises 100 samples. The 100 classes were split into 64, 12, and 24 non-overlapping classes to use as the meta-training set, meta-validation set, and meta-testing set, respectively. Link to exactly selected samples can be found here: https://github.com/ffmpbgrnn/CMN/tree/master/smsm-100

4 papers2 benchmarksRGB Video

XLING (XLING BLI Dataset)

The XLING BLI Dataset contains bilingual dictionaries for 28 language pairs. For each of the language pairs, there are 5 dictionary files: 4 training dictionaries of varying sizes (500, 1K, 3K, and 5K translation pairs) and one testing dictionary containing 2K test word pairs. All results reported in the above paper have been obtained on test dictionaries of respective language pairs.

4 papers0 benchmarks

PanLex-BLI (PanLex-based bilingual lexicons for 210 language pairs)

PanLex-based bilingual lexicons for 210 language pairs

4 papers0 benchmarks

RR (Review-Rebuttal)

Review-Rebuttal (RR) dataset is introduced to facilitate the study of argument pair extraction in the peer review and rebuttal domain.

4 papers3 benchmarksTexts

TimberSeg 1.0

The TimberSeg 1.0 dataset is composed of 220 images showing wood logs in various environments and conditions in Canada. The images are densely annotated with segmentation masks for each log instance, as well as the corresponding bounding box and class label. This dataset aim towards enabling autonomous forestry forwarders, therefore it contains nearly 2500 instances of wood logs from an operators' point-of-view. Images were taken in the forest, near the roadside, in lumberyards and above timber-filled trailers. The logs were annotated considering a grasping perspective, meaning that only the logs above the piles and accessible are segmented.

4 papers0 benchmarksImages

i2b2 De-identification Dataset (Informatics for Integrating Biology and the Bedside (i2b2) Project — De-identification Dataset)

This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.

4 papers2 benchmarksTexts

PreviousPage 243 of 1000Next