Datasets

3,148 machine learning datasets

3,148 dataset results

QALD-9-Plus

QALD-9-Plus Dataset Description QALD-9-Plus is the dataset for Knowledge Graph Question Answering (KGQA) based on well-known QALD-9.

3 papers1 benchmarksTexts

MuLD (Multitask Long Document Benchmark)

MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.

3 papers0 benchmarksTexts

MCVQA (Multilingual and Code-mixed Visual Question Answering)

The MCVQA dataset consists of 248, 349 training questions and 121, 512 validation questions for real images in Hindi and Code-mixed. For each Hindi question, we also provide its 10 corresponding answers in Hindi.

3 papers0 benchmarksImages, Texts

KMIR (Knowledge Memorization, Identification, and Reasoning)

KMIR (Knowledge Memorization, Identification, and Reasoning) is a benchmark that covers 3 types of knowledge, including general knowledge, domain-specific knowledge, and commonsense, and provides 184,348 well-designed questions. KMIR can be used for evaluating knowledge memorization, identification and reasoning abilities of language models.

3 papers0 benchmarksTexts

IAM Dataset (A Comprehensive and Large-Scale Dataset for Integrated Argument Mining Tasks)

We introduce a large and comprehensive dataset to facilitate the study of several essential AM tasks in the debating system. In our work, we first review the existing subtasks (claim extraction, stance classification, evidence extraction), and then propose two integrated argument mining tasks: claim extraction with stance classification (CESC) and claim-evidence pair extraction (CEPE).

3 papers6 benchmarksTexts

TRECVID-AVS19 (V3C1)

The dataset has been designed to represent true web videos in the wild, with good visual quality and diverse content characteristics, The test video collection for TRECVID-AVS2019-TRECVID-AVS2021, which contains 1,082,649 web video clips, with even more diverse content, no predominant characteristics and low self-similarity.

3 papers1 benchmarksTexts, Videos

HateScore (HateScore : Human-in-the-Loop and Neutral Korean Multi-label Online Hate Speech Dataset)

2.2K neutral sentences from Wikipedia 1.7K additionally labeled sentences generated by the Human-in-the-Loop procedure (based on Korean Unsmile Dataset Base Model) 7.1K rule-generated neutral sentences

3 papers0 benchmarksTexts

The Little Prince (The Little Prince Corpus)

This corpus is an annotation of the novel The Little Prince by Antoine de Saint-Exupéry, published in 1943. We were inspired by the UNL project to include this novel, so that different groups could compare representations on the same text.

3 papers2 benchmarksGraphs, Texts

Twitter-COMMs

Detecting out-of-context media, such as "mis-captioned" images on Twitter, is a relevant problem, especially in domains of high public significance. Twitter-COMMs is a large-scale multimodal dataset with 884k tweets relevant to the topics of Climate Change, COVID-19, and Military Vehicles. This dataset can be used to develop methods to detect misinformation on social media platforms related to these three topics.

3 papers0 benchmarksImages, Texts

ErAConD (Error Annotated Conversational Dialog Dataset for Grammatical Error Correction)

ErAConD is a novel GEC dataset consisting of parallel original and corrected utterances drawn from open-domain chatbot conversations.

3 papers0 benchmarksTexts

Task2Dial

A novel dataset of document-grounded task-based dialogues, where an Information Giver (IG) provides instructions (by consulting a document) to an Information Follower (IF), so that the latter can successfully complete the task. In this unique setting, the IF can ask clarification questions which may not be grounded in the underlying document and require commonsense knowledge to be answered.

3 papers0 benchmarksTexts

BigNews

Contains 3,689,229 English news articles on politics, gathered from 11 United States (US) media outlets covering a broad ideological spectrum.

3 papers0 benchmarksTexts

WikiWiki

WikiWiki is a dataset for understanding entities and their place in a taxonomy of knowledge—their types. It consists of entities and passages from 10M Wikipedia articles linked to the Wikidata knowledge graph with 41K types.

3 papers0 benchmarksTexts

DICE: a Dataset of Italian Crime Event news (from Gazzetta di Modena [2011-2021])

The dataset contains the main components of the news articles published online by the newspaper named <a href="https://gazzettadimodena.gelocal.it/modena">Gazzetta di Modena</a>: url of the web page, title, sub-title, text, date of publication, crime category assigned to each news article by the author.

3 papers0 benchmarksTexts

Hephaestus (Hephaestus: A large scale multitask dataset towards InSAR understanding)

Hephaestus is the first large-scale InSAR dataset. Driven by volcanic unrest detection, it provides 19,919 unique satellite frames annotated with a diverse set of labels. Moreover, each sample is accompanied by a textual description of its contents. The goal of this dataset is to boost research on exploitation of interferometric data enabling the application of state-of-the-art computer vision+NLP methods. Furthermore, the annotated dataset is bundled with a large archive of unlabeled frames to enable large-scale self-supervised learning. The final size of the dataset amounts to 110,573 interferograms.

3 papers0 benchmarksImages, Texts

ConcurrentQA Benchmark

ConcurrentQA is a textual multi-hop QA benchmark to require concurrent retrieval over multiple data-distributions (i.e. Wikipedia and email data). The dataset follow the exact same schema and design as HotpotQA. The data set is downloadable here: https://github.com/facebookresearch/concurrentqa. It also contains model and result analysis code. This benchmark can also be used to study privacy when reasoning over data distributed in multiple privacy scopes --- i.e. Wikipedia in the public domain and emails in the private domain.

3 papers0 benchmarksTexts

Spanish TimeBank 1.0

Spanish TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Spanish texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language.

3 papers3 benchmarksTexts

Persuasion Strategies

Modeling what makes an advertisement persuasive, i.e., eliciting the desired response from consumer, is critical to the study of propaganda, social psychology, and marketing. Despite its importance, computational modeling of persuasion in computer vision is still in its infancy, primarily due to the lack of benchmark datasets that can provide persuasion-strategy labels associated with ads. Motivated by persuasion literature in social psychology and marketing, we introduce an extensive vocabulary of persuasion strategies and build the first ad image corpus annotated with persuasion strategies. The dataset also provides image segmentation masks, which labels persuasion strategies in the corresponding ad images on the test split.

3 papers0 benchmarksImages, Texts

UzWordnet (The Uzbek Wordnet)

UzWordnet is a lexical-semantic database, or a “word-net”, for the (Northern) Uzbek language (native: O’zbek till) compatible with Princeton Wordnet. By providing it open source (see License), we aim to motivate, support, and increase the application of database and knowledge graphs principles and techniques to the study of computational aspects of the (Northern) Uzbek language and, more generally, the usability of Uzbek within IT applications and the Internet.

3 papers0 benchmarksTexts

MDIA

MDIA is a large-scale multilingual benchmark for dialogue generation. It covers real-life conversations in 46 languages across 19 language families.

3 papers0 benchmarksTexts

PreviousPage 79 of 158Next