Datasets

3,148 machine learning datasets

3,148 dataset results

VLM2-Bench (VLM²-Bench)

VLM²-Bench: Benchmarking Vision-Language Models on Visual Cue Matching Description VLM²-Bench is the first comprehensive benchmark designed to evaluate vision-language models' (VLMs) ability to visually link matching cues across multi-image sequences and videos. The benchmark consists of 9 subtasks with over 3,000 test cases, focusing on fundamental visual linking capabilities that humans use daily. A key example is identifying the same person across different photos without prior knowledge of their identity.

9 papers10 benchmarksImages, Texts, Videos

UKP (UKP Argument Annotated Essays)

The UKP Argument Annotated Essays corpus consists of argument annotated persuasive essays including annotations of argument components and argumentative relations.

8 papers0 benchmarksTexts

QUASAR-S (QUestion Answering by Search And Reading – Stack Overflow)

QUASAR-S is a large-scale dataset aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. It consists of 37,362 cloze-style (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. The posts and comments on the website serve as the background corpus for answering the cloze questions. The answer to each question is restricted to be another software entity, from an output vocabulary of 4874 entities.

8 papers0 benchmarksTexts

WMT 2018 News (WMT 2018 News Translation Task)

News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Chinese, Czech, Estonian, German, Finnish, Russian, Turkish) and additional 1500 sentences from each of the 7 languages translated to English. The sentences were selected from dozens of news websites and translated by professional translators.

8 papers0 benchmarksTexts

HOList

The official HOList benchmark for automated theorem proving consists of all theorem statements in the core, complex, and flyspeck corpora. The goal of the benchmark is to prove as many theorems as possible in the HOList environment in the order they appear in the database. That is, only theorems that occur before the current theorem are supposed to be used as premises (lemmata) in its proof.

8 papers0 benchmarksTexts

WikiTableT

WikiTableT contains Wikipedia article sections and their corresponding tabular data and various metadata. WikiTableT contains millions of instances while covering a broad range of topics and a variety of kinds of generation tasks.

8 papers0 benchmarksTexts

ArSentD-LEV

The Arabic Sentiment Twitter Dataset for the Levantine dialect (ArSenTD-LEV) is a dataset of 4,000 tweets with the following annotations: the overall sentiment of the tweet, the target to which the sentiment was expressed, how the sentiment was expressed, and the topic of the tweet.

8 papers0 benchmarksTexts

CMRC 2019 (Chinese Machine Reading Comprehension 2019)

CMRC 2019 is a Chinese Machine Reading Comprehension dataset that was used in The Third Evaluation Workshop on Chinese Machine Reading Comprehension. Specifically, CMRC 2019 is a sentence cloze-style machine reading comprehension dataset that aims to evaluate the sentence-level inference ability.

8 papers0 benchmarksTexts

CoNLL-2000

CoNLL-2000 is a dataset for dividing text into syntactically related non-overlapping groups of words, so-called text chunking.

8 papers0 benchmarksTexts

Logic2Text

Logic2Text is a large-scale dataset with 10,753 descriptions involving common logic types paired with the underlying logical forms. The logical forms show diversified graph structure of free schema, which poses great challenges on the model's ability to understand the semantics.

8 papers0 benchmarksTexts

OpenViDial

OpenViDial is a large-scale open-domain dialogue dataset with visual contexts. The dialogue turns and visual contexts are extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.

8 papers0 benchmarksImages, Texts

OPIEC (Open Information Extraction Corpus)

OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time.

8 papers0 benchmarksTexts

OrangeSum

Source: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

8 papers1 benchmarksTexts

PEYMA

Peyma is a Persian NER dataset to train and test NER systems. It is constructed by collecting documents from ten news websites.

8 papers0 benchmarksTexts

Quizbowl

Consists of multiple sentences whose clues are arranged by difficulty (from obscure to obvious) and uniquely identify a well-known entity such as those found on Wikipedia.

8 papers1 benchmarksTexts

SubjQA

SubjQA is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer). Moreover, both questions and answer spans are assigned a subjectivity label by annotators. Questions such as "How much does this product weigh?" is a factual question (i.e., low subjectivity), while "Is this easy to use?" is a subjective question (i.e., high subjectivity).

8 papers0 benchmarksTexts

TalkDown

TalkDown is a labelled dataset for condescension detection in context. The dataset is derived from Reddit, a set of online communities that is diverse in content and tone. The dataset is built from COMMENT and REPLY pairs in which the REPLY targets a specific quoted span (QUOTED) in the COMMENT as being condescending. The dataset contains 3,255 positive (condescend) samples and 3,255 negative ones.

8 papers0 benchmarksTexts

Video Storytelling

A new dataset describing textual stories for events.

8 papers0 benchmarksTexts, Videos

VMSMO

The Video-based Multimodal Summarization with Multimodal Output (VMSMO) corpus consists of 184,920 document-summary pairs, with 180,000 training pairs, 2,460 validation and test pairs. The task for this dataset is generating and appropriate textual summary of an article and choosing a proper cover frame from a video accompanying the article.

8 papers0 benchmarksTexts, Videos

L3CubeMahaSent

L3CubeMahaSent is a large publicly available Marathi Sentiment Analysis dataset. It consists of marathi tweets which are manually labelled.

8 papers0 benchmarksTexts

PreviousPage 50 of 158Next