Datasets

19,997 machine learning datasets

19,997 dataset results

In-Shop (In-shop Clothes Retrieval Benchmark)

In-shop Clothes Retrieval Benchmark evaluates the performance of in-shop Clothes Retrieval. This is a large subset of DeepFashion, containing large pose and scale variations. It also has large diversities, large quantities, and rich annotations, including:

154 papers2 benchmarksImages

NCBI Disease

The NCBI Disease corpus consists of 793 PubMed abstracts, which are separated into training (593), development (100) and test (100) subsets. The NCBI Disease corpus is annotated with disease mentions, using concept identifiers from either MeSH or OMIM.

154 papers1 benchmarksTexts

A-OKVQA

A-OKVQA is crowdsourced visual question answering dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer.

154 papers2 benchmarksImages, Texts

dSprites (Disentanglement testing Sprites dataset)

dSprites is a dataset of 2D shapes procedurally generated from 6 ground truth independent latent factors. These factors are color, shape, scale, rotation, x and y positions of a sprite.

153 papers0 benchmarksImages

CC12M (Conceptual 12M)

Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training.

153 papers0 benchmarksImages, Texts

PATTERN

PATTERN is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social networks by modulating the intra- and extra-communities connections, thereby controlling the difficulty of the task. PATTERN tests the fundamental graph task of recognizing specific predetermined subgraphs.

153 papers1 benchmarksGraphs

ROCStories

ROCStories is a collection of commonsense short stories. The corpus consists of 100,000 five-sentence stories. Each story logically follows everyday topics created by Amazon Mechanical Turk workers. These stories contain a variety of commonsense causal and temporal relations between everyday events. Writers also develop an additional 3,742 Story Cloze Test stories which contain a four-sentence-long body and two candidate endings. The endings were collected by asking Mechanical Turk workers to write both a right ending and a wrong ending after eliminating original endings of given short stories. Both endings were required to make logical sense and include at least one character from the main story line. The published ROCStories dataset is constructed with ROCStories as a training set that includes 98,162 stories that exclude candidate wrong endings, an evaluation set, and a test set, which have the same structure (1 body + 2 candidate endings) and a size of 1,871.

152 papers5 benchmarksTexts

S2ORC

A large corpus of 81.1M English-language academic papers spanning many academic disciplines. Rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. Aggregated papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date.

152 papers3 benchmarksTexts

OLID (Offensive Language Identification Dataset)

The OLID is a hierarchical dataset to identify the type and the target of offensive texts in social media. The dataset is collected on Twitter and publicly available. There are 14,100 tweets in total, in which 13,240 are in the training set, and 860 are in the test set. For each tweet, there are three levels of labels: (A) Offensive/Not-Offensive, (B) Targeted-Insult/Untargeted, (C) Individual/Group/Other. The relationship between them is hierarchical. If a tweet is offensive, it can have a target or no target. If it is offensive to a specific target, the target can be an individual, a group, or some other objects. This dataset is used in the OffensEval-2019 competition in SemEval-2019.

152 papers2 benchmarksTexts

MegaDepth

The MegaDepth dataset is a dataset for single-view depth prediction that includes 196 different locations reconstructed from COLMAP SfM/MVS.

152 papers0 benchmarksImages

NAS-Bench-101

NAS-Bench-101 is the first public architecture dataset for NAS research. To build NASBench-101, the authors carefully constructed a compact, yet expressive, search space, exploiting graph isomorphisms to identify 423k unique convolutional architectures. The authors trained and evaluated all of these architectures multiple times on CIFAR-10 and compiled the results into a large dataset of over 5 million trained models. This allows researchers to evaluate the quality of a diverse range of models in milliseconds by querying the precomputed dataset.

152 papers4 benchmarksTabular

CIFAR100-LT

The Long-tailed Version of CIFAR100

152 papers0 benchmarks

Video-MME

Video-MME stands for Video Multi-Modal Evaluation. It is the first-ever comprehensive evaluation benchmark specifically designed for Multi-modal Large Language Models (MLLMs) in video analysis¹. This benchmark is significant because it addresses the need for a high-quality assessment of MLLMs' performance in processing sequential visual data, which has been less explored compared to their capabilities in static image understanding.

152 papers2 benchmarksTexts, Videos

SQuAD (Stanford Question Answering Dataset)

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.

151 papers6 benchmarksTexts

FCE (First Certificate in English)

The Cambridge Learner Corpus First Certificate in English (CLC FCE) dataset consists of short texts, written by learners of English as an additional language in response to exam prompts eliciting free-text answers and assessing mastery of the upper-intermediate proficiency level. The texts have been manually error-annotated using a taxonomy of 77 error types. The full dataset consists of 323,192 sentences. The publicly released subset of the dataset, named FCE-public, consists of 33,673 sentences split into test and training sets of 2,720 and 30,953 sentences, respectively.

151 papers1 benchmarksTexts

VRD (Visual Relationship Detection dataset)

The Visual Relationship Dataset (VRD) contains 4000 images for training and 1000 for testing annotated with visual relationships. Bounding boxes are annotated with a label containing 100 unary predicates. These labels refer to animals, vehicles, clothes and generic objects. Pairs of bounding boxes are annotated with a label containing 70 binary predicates. These labels refer to actions, prepositions, spatial relations, comparatives or preposition phrases. The dataset has 37993 instances of visual relationships and 6672 types of relationships. 1877 instances of relationships occur only in the test set and they are used to evaluate the zero-shot learning scenario.

151 papers7 benchmarksImages, Texts

REDDIT-BINARY

REDDIT-BINARY consists of graphs corresponding to online discussions on Reddit. In each graph, nodes represent users, and there is an edge between them if at least one of them respond to the other’s comment. There are four popular subreddits, namely, IAmA, AskReddit, TrollXChromosomes, and atheism. IAmA and AskReddit are two question/answer based subreddits, and TrollXChromosomes and atheism are two discussion-based subreddits. A graph is labeled according to whether it belongs to a question/answer-based community or a discussion-based community.

150 papers4 benchmarksGraphs

ASDiv (Academia Sinica Diverse MWP Dataset)

We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in terms of both language patterns and problem types) English math word problem (MWP) corpus for evaluating the capability of various MWP solvers. Existing MWP corpora for studying AI progress remain limited either in language usage patterns or in problem types. We thus present a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem types taught in elementary school. Each MWP is annotated with its problem type and grade level (for indicating the level of difficulty). Furthermore, we propose a metric to measure the lexicon usage diversity of a given MWP corpus, and demonstrate that ASDiv is more diverse than existing corpora. Experiments show that our proposed corpus reflects the true capability of MWP solvers more faithfully.

150 papers0 benchmarks

MMLU-Pro

The MMLU-Pro dataset is an enhanced version of the Massive Multitask Language Understanding (MMLU) benchmark. It's designed to be more robust and challenging, aiming to rigorously benchmark large language models' capabilities in language comprehension and reasoning across diverse domains. Here are some key features of the MMLU-Pro dataset:

150 papers1 benchmarks

MOT16 (Multiple Object Tracking 2016)

The MOT16 dataset is a dataset for multiple object tracking. It a collection of existing and new data (part of the sources are from and ), containing 14 challenging real-world videos of both static scenes and moving scenes, 7 for training and 7 for testing. It is a large-scale dataset, composed of totally 110407 bounding boxes in training set and 182326 bounding boxes in test set. All video sequences are annotated under strict standards, their ground-truths are highly accurate, making the evaluation meaningful.

149 papers6 benchmarksImages, Videos

PreviousPage 23 of 1000Next