Datasets

19,997 machine learning datasets

19,997 dataset results

x-stance

A large-scale stance detection dataset from comments written by candidates of elections in Switzerland. The dataset consists of German, French and Italian text, allowing for a cross-lingual evaluation of stance detection. It contains 67 000 comments on more than 150 political issues (targets).

16 papers0 benchmarks

CryoNuSeg

CryoNuSeg is a fully annotated FS-derived cryosectioned and H&E-stained nuclei instance segmentation dataset. The dataset contains images from 10 human organs that were not exploited in other publicly available datasets, and is provided with three manual mark-ups to allow measuring intra-observer and inter-observer variability.

16 papers0 benchmarksImages

DADA-2000

DADA-2000 is a large-scale benchmark with 2000 video sequences (named as DADA-2000) is contributed with laborious annotation for driver attention (fixation, saccade, focusing time), accident objects/intervals, as well as the accident categories, and superior performance to state-of-the-arts are provided by thorough evaluations.

16 papers0 benchmarksImages

TextComplexityDE

TextComplexityDE is a dataset consisting of 1000 sentences in German language taken from 23 Wikipedia articles in 3 different article-genres to be used for developing text-complexity predictor models and automatic text simplification in German language. The dataset includes subjective assessment of different text-complexity aspects provided by German learners in level A and B. In addition, it contains manual simplification of 250 of those sentences provided by native speakers and subjective assessment of the simplified sentences by participants from the target group. The subjective ratings were collected using both laboratory studies and crowdsourcing approach.

16 papers1 benchmarksTexts

BG-20k (Background Dataset - 20k)

BG-20k contains 20,000 high-resolution background images excluded salient objects, which can be used to help generate high quality synthetic data.

16 papers0 benchmarksImages

SPoC (Pseudocode-to-Code)

Pseudocode-to-Code (SPoC) is a program synthesis dataset, containing 18,356 programs with human-authored pseudocode and test cases.

16 papers0 benchmarksTexts

IDRiD (Indian Diabetic Retinopathy Image Dataset)

Indian Diabetic Retinopathy Image Dataset (IDRiD) dataset consists of typical diabetic retinopathy lesions and normal retinal structures annotated at a pixel level. This dataset also provides information on the disease severity of diabetic retinopathy and diabetic macular edema for each image. This dataset is perfect for the development and evaluation of image analysis algorithms for early detection of diabetic retinopathy.

16 papers6 benchmarksBiomedical, Images, Medical

CLEVR-Hans

The CLEVR-Hans data set is a novel confounded visual scene data set, which captures complex compositions of different objects. This data set consists of CLEVR images divided into several classes.

16 papers0 benchmarksImages

SPGISpeech

SPGISpeech (pronounced “speegie-speech”) is a large-scale transcription dataset, freely available for academic research. SPGISpeech is a collection of 5,000 hours of professionally-transcribed financial audio. Contrary to previous transcription datasets, SPGISpeech contains global english accents, strongly varying audio quality as well as both spontaneous and presentation style speech. The transcripts have each been cross-checked by multiple professional editors for high accuracy and are fully formatted including sentence structure and capitalization.

16 papers1 benchmarksSpeech

Casual Conversations

Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions.

16 papers0 benchmarksAudio, Images, Videos

PlasticineLab

PasticineLab is a differentiable physics benchmark, which includes a diverse collection of soft body manipulation tasks. In each task, the agent uses manipulators to deform the plasticine into the desired configuration. The underlying physics engine supports differentiable elastic and plastic deformation using the DiffTaichi system, posing many under-explored challenges to robotic agents.

16 papers0 benchmarksEnvironment

NewsCLIPpings

NewsCLIPpings is a dataset for detecting mismatched images and captions. Different to previous misinformation datasets, in NewsCLIPpings both the images and captions are unmanipulated, but some of them are mismatched.

16 papers0 benchmarksImages, Texts

SketchyCOCO

SketchyCOCO dataset consists of two parts:

16 papers3 benchmarksImages

e-SNLI-VE

e-SNLI-VE is a large VL (vision-language) dataset with NLEs (natural language explanations) with over 430k instances for which the explanations rely on the image content. It has been built by merging the explanations from e-SNLI and the image-sentence pairs from SNLI-VE.

16 papers2 benchmarksImages, Texts

ILDC (Indian Legal Documents Corpus)

The ILDC dataset (Indian Legal Documents Corpus) is a large corpus of 35k Indian Supreme Court cases annotated with original court decisions. A portion of the corpus (a separate test set) is annotated with gold standard explanations by legal experts. The dataset is used for Court Judgment Prediction and Explanation (CJPE). The task requires an automated system to predict an explainable outcome of a case.

16 papers0 benchmarksTexts

GitTables

GitTables is a corpus of currently 1M relational tables extracted from CSV files in GitHub covering 96 topics. Table columns in GitTables have been annotated with more than 2K different semantic types from Schema.org and DBpedia. The column annotations consist of semantic types, hierarchical relations, range types, table domain and descriptions.

16 papers0 benchmarksTables

X-Fact

X-FACT is a large publicly available multilingual dataset for factual verification of naturally existing real-world claims. The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers. The dataset includes a multilingual evaluation benchmark that measures both out-of-domain generalization, and zero-shot capabilities of the multilingual models.

16 papers0 benchmarksTexts

OntoNotes 4.0 (OntoNotes Release 4.0)

OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04 and OntoNotes Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast conversation and web data in English and Chinese and newswire data in Arabic. This cumulative publication consists of 2.4 million words as follows: 300k words of Arabic newswire 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text.

16 papers0 benchmarksTexts

ICFG-PEDES (Identity-Centric and Fine-Grained Person Description Dataset)

One large-scale database for Text-to-Image Person Re-identification, i.e., Text-based Person Retrieval.

16 papers14 benchmarksImages, Texts

Wiki-One

This dataset is a Wikipedia dump, split by relations to perform Few-Shot Knowledge Graph Completion.

16 papers0 benchmarksGraphs

PreviousPage 118 of 1000Next