Datasets

3,148 machine learning datasets

3,148 dataset results

AVeriTeC (AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web)

AVeriTeC (Automated Verification of Textual Claims) is a dataset of 4568 real-world claims covering fact-checks by 50 different organizations. Each claim is annotated with question-answer pairs supported by evidence available online, as well as textual justifications explaining how the evidence combines to produce a verdict. The Claims in AVeriTeC are classified into four labels: "Supported", "Refuted", "Not Enough Evidence", "Conflicting Evidence/Cherry-picking". The dataset also contains several fields of metadata such as the speaker of the claim, the publisher of the claim, the date the claim was published, and the location most relevant to the claim. These can be used to support questions, answers, and justifications.

17 papers3 benchmarksTexts

GSM-Plus

By perturbing the widely used GSM8K dataset, an adversarial dataset for grade-school math called GSM-Plus is created. Motivated by the capability taxonomy for solving math problems mentioned in Polya's principles, this paper identifies 5 perspectives to guide the development of GSM-Plus:

17 papers4 benchmarksTexts

CMU-MOSI (Multimodal Corpus of Sentiment Intensity)

The Multimodal Corpus of Sentiment Intensity (CMU-MOSI) dataset is a collection of 2199 opinion video clips. Each opinion video is annotated with sentiment in the range [-3,3]. The dataset is rigorously annotated with labels for subjectivity, sentiment intensity, per-frame and per-opinion annotated visual features, and per-milliseconds annotated audio features.

16 papers8 benchmarksAudio, Texts, Videos

ChemProt

ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions annotated by domain experts and was used in the BioCreative VI text mining chemical-protein interactions shared task.

16 papers2 benchmarksBiomedical, Texts

ViGGO

The ViGGO corpus is a set of 6,900 meaning representation to natural language utterance pairs in the video game domain. The meaning representations are of 9 different dialogue acts.

16 papers2 benchmarksTexts

JuICe (JuICe Dataset)

JuICe is a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments. Compared with existing contextual code generation datasets, JuICe provides refined human-curated data, open-domain code, and an order of magnitude more training data.

16 papers0 benchmarksTexts

TableBank

To address the need for a standard open domain table benchmark dataset, the author propose a novel weak supervision approach to automatically create the TableBank, which is orders of magnitude larger than existing human labeled datasets for table analysis. Distinct from traditional weakly supervised training set, our approach can obtain not only large scale but also high quality training data.

16 papers0 benchmarksTexts

SemArt

SemArt is a multi-modal dataset for semantic art understanding. SemArt is a collection of fine-art painting images in which each image is associated to a number of attributes and a textual artistic comment, such as those that appear in art catalogues or museum collections. It contains 21,384 samples that provides artistic comments along with fine-art paintings and their attributes for studying semantic art understanding.

16 papers0 benchmarksImages, Texts

SMHD (Self-reported Mental Health Diagnoses)

A novel large dataset of social media posts from users with one or multiple mental health conditions along with matched control users.

16 papers0 benchmarksImages, Texts

TVC (TV show Captions)

TV show Caption is a large-scale multimodal captioning dataset, containing 261,490 caption descriptions paired with 108,965 short video moments. TVC is unique as its captions may also describe dialogues/subtitles while the captions in the other datasets are only describing the visual content.

16 papers2 benchmarksTexts, Videos

TextComplexityDE

TextComplexityDE is a dataset consisting of 1000 sentences in German language taken from 23 Wikipedia articles in 3 different article-genres to be used for developing text-complexity predictor models and automatic text simplification in German language. The dataset includes subjective assessment of different text-complexity aspects provided by German learners in level A and B. In addition, it contains manual simplification of 250 of those sentences provided by native speakers and subjective assessment of the simplified sentences by participants from the target group. The subjective ratings were collected using both laboratory studies and crowdsourcing approach.

16 papers1 benchmarksTexts

SPoC (Pseudocode-to-Code)

Pseudocode-to-Code (SPoC) is a program synthesis dataset, containing 18,356 programs with human-authored pseudocode and test cases.

16 papers0 benchmarksTexts

NewsCLIPpings

NewsCLIPpings is a dataset for detecting mismatched images and captions. Different to previous misinformation datasets, in NewsCLIPpings both the images and captions are unmanipulated, but some of them are mismatched.

16 papers0 benchmarksImages, Texts

e-SNLI-VE

e-SNLI-VE is a large VL (vision-language) dataset with NLEs (natural language explanations) with over 430k instances for which the explanations rely on the image content. It has been built by merging the explanations from e-SNLI and the image-sentence pairs from SNLI-VE.

16 papers2 benchmarksImages, Texts

ILDC (Indian Legal Documents Corpus)

The ILDC dataset (Indian Legal Documents Corpus) is a large corpus of 35k Indian Supreme Court cases annotated with original court decisions. A portion of the corpus (a separate test set) is annotated with gold standard explanations by legal experts. The dataset is used for Court Judgment Prediction and Explanation (CJPE). The task requires an automated system to predict an explainable outcome of a case.

16 papers0 benchmarksTexts

X-Fact

X-FACT is a large publicly available multilingual dataset for factual verification of naturally existing real-world claims. The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers. The dataset includes a multilingual evaluation benchmark that measures both out-of-domain generalization, and zero-shot capabilities of the multilingual models.

16 papers0 benchmarksTexts

OntoNotes 4.0 (OntoNotes Release 4.0)

OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04 and OntoNotes Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast conversation and web data in English and Chinese and newswire data in Arabic. This cumulative publication consists of 2.4 million words as follows: 300k words of Arabic newswire 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text.

16 papers0 benchmarksTexts

ICFG-PEDES (Identity-Centric and Fine-Grained Person Description Dataset)

One large-scale database for Text-to-Image Person Re-identification, i.e., Text-based Person Retrieval.

16 papers14 benchmarksImages, Texts

Wukong

Wukong is a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods to facilitate the Vision-Language Pre-training (VLP). This dataset contains 100 million Chinese image-text pairs from the web. This base query list is taken from and is filtered according to the frequency of Chinese words and phrases.

16 papers0 benchmarksImages, Texts

MATRES (Multi-Axis Temporal RElations for Start-points)

This is the Multi-Axis Temporal RElations for Start-points (i.e., MATRES) dataset

16 papers3 benchmarksTexts

PreviousPage 36 of 158Next