Datasets

3,148 machine learning datasets

3,148 dataset results

ShapeWorld

ShapeWorld is a new evaluation methodology and framework for multimodal deep learning models, with a focus on formal-semantic style generalization capabilities. In this framework, artificial data is automatically generated according to predefined specifications. This controlled data generation makes it possible to introduce previously unseen instance configurations during evaluation, which consequently require the system to recombine learned concepts in novel ways.

23 papers0 benchmarksImages, Texts

CoSQA (Code Search and Question Answering)

CoSQA (Code Search and Question Answering) It includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators.

23 papers0 benchmarksTexts

mMARCO

mMARCO is a multilingual version of the MS MARCO passage ranking dataset comprising 8 languages that was created using machine translation.

23 papers0 benchmarksTexts

SituatedQA

SituatedQA is an open-retrieval QA dataset where systems must produce the correct answer to a question given the temporal or geographical context. Answers to the same question may change depending on the extralinguistic contexts (when and where the question was asked).

23 papers0 benchmarksTexts

MSRA CN NER (MSRA CN NER Dataset)

Simplified Chinese dataset for NER in The Third International Chinese Language Processing Bakeoff (2006), provided by Microsoft Research Asia (MSRA).

23 papers0 benchmarksTexts

PIE-Bench (Prompt-based Image Editing Benchmark)

PIE-Bench comprises 700 images featuring 10 distinct editing types. Images are evenly distributed in natural and artificial scenes (e.g., paintings) among four categories: animal, human, indoor, and outdoor. Each image in PIE-Bench includes five annotations: source image prompt, target image prompt, editing instruction, main editing body, and the editing mask. Notably, the editing mask annotation (indicating the anticipated editing region) is crucial in accurate metrics computations as we expect the editing to only occur within a designated area.

23 papers16 benchmarksImages, Texts

BEAT2 (BEAT-SMPLX-FLAME)

We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines MoShed SMPLX body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enh

23 papers9 benchmarks3d meshes, Audio, Texts, Time series

NExT-GQA

We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding. Specifically, by forcing vision-language models (VLMs) to answer questions and simultaneously provide visual evidence, we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content, versus spurious correlations from language or irrelevant visual context. Towards this, we construct NExT-GQA -- an extension of NExT-QA with 10.5K temporal grounding (or location) labels tied to the original QA pairs. With NExT-GQA, we scrutinize a variety of state-of-the-art VLMs. Through post-hoc attention analysis, we find that these models are weak in substantiating the answers despite their strong QA performance. This exposes a severe limitation of these models in making reliable predictions.

23 papers2 benchmarksTexts, Videos

Text8

Desc: About of Text8

22 papers2 benchmarksTexts

Django

The Django dataset is a dataset for code generation comprising of 16000 training, 1000 development and 1805 test annotations. Each data point consists of a line of Python code together with a manually created natural language description.

22 papers2 benchmarksTexts

AMR Bank (Abstract Meaning Representation)

The AMR Bank is a set of English sentences paired with simple, readable semantic representations. Version 3.0 released in 2020 consists of 59,255 sentences.

22 papers0 benchmarksTexts

iVQA (Instructional Video Question Answering)

An open-ended VideoQA benchmark that aims to: i) provide a well-defined evaluation by including five correct answer annotations per question and ii) avoid questions which can be answered without the video.

22 papers2 benchmarksTexts, Videos

KdConv (Knowledge-driven Conversation)

KdConv is a Chinese multi-domain Knowledge-driven Conversation dataset, grounding the topics in multi-turn conversations to knowledge graphs. KdConv contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics, while the corpus can also used for exploration of transfer learning and domain adaptation.

22 papers0 benchmarksTexts

BookTest

BookTest is a new dataset similar to the popular Children’s Book Test (CBT), however more than 60 times larger.

22 papers0 benchmarksTexts

MOROCO (MOldavian and ROmanian Dialectal COrpus)

The MOldavian and ROmanian Dialectal COrpus (MOROCO) is a corpus that contains 33,564 samples of text (with over 10 million tokens) collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports and tech. The data set is divided into 21,719 samples for training, 5,921 samples for validation and another 5,924 samples for testing.

22 papers0 benchmarksTexts

PIT (Paraphrase and Semantic Similarity in Twitter)

Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.

22 papers2 benchmarksTexts

Snopes

Fact-checking (FC) articles which contains pairs (multimodal tweet and a FC-article) from snopes.com.

22 papers1 benchmarksTexts

CxC (Crisscrossed Captions)

Crisscrossed Captions (CxC) contains 247,315 human-labeled annotations including positive and negative associations between image pairs, caption pairs and image-caption pairs.

22 papers1 benchmarksImages, Texts

XGLUE

XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training data is only available in English. This means that to succeed at XGLUE, a model must have a strong zero-shot cross-lingual transfer capability to learn from the English data of a specific task and transfer what it learned to other languages. Comparing to its concurrent work XTREME, XGLUE has two characteristics: First, it includes cross-lingual NLU and cross-lingual NLG tasks at the same time; Second, besides including 5 existing cross-lingual tasks (i.e. NER, POS, MLQA, PAWS-X and XNLI), XGLUE selects 6 new tasks from Bing scenarios as well, including News Classification (NC), Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG). Such diversities of languages, tasks and task origin provide a comprehensive benchmark for quantifying the quality of a pre-trained model on cross-lingual natural lan

22 papers2 benchmarksTexts

Douban Conversation Corpus

We release Douban Conversation Corpus, comprising a training data set, a development set and a test set for retrieval based chatbot. The statistics of Douban Conversation Corpus are shown in the following table.

22 papers0 benchmarksTexts

PreviousPage 30 of 158Next