Datasets

3,148 machine learning datasets

3,148 dataset results

CUP

CUP (Context-sitUated Pun) is a dataset containing 4.5k tuples of context words and pun pairs, each labelled with whether they are compatible for composing a pun.

2 papers0 benchmarksTexts

LEPISZCZE

LEPISZCZE is an open-source comprehensive benchmark for Polish NLP and a continuous-submission leaderboard, concentrating public Polish datasets (existing and new) in specific tasks.

2 papers0 benchmarksTexts

PcMSP is a dataset annotated from 305 open access scientific articles for material science information extraction that simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations.

2 papers0 benchmarksTexts

HERDPhobia

HERDPhobia is an annotated hate speech detection dataset on Fulani herders in Nigeria -- in three languages: English, Nigerian-Pidgin, and Hausa.

2 papers0 benchmarksTexts

MCSCSet

MCSCSet is a large-scale specialist-annotated dataset, designed for the task of Medical-domain Chinese Spelling Correction that contains about 200k samples. MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists.

2 papers0 benchmarksMedical, Texts

CREPE

CREPE is QA dataset containing a natural distribution of presupposition failures from online information-seeking forums. It consists of 8400 Reddit questions with (1) whether there is any false presuppositions annotated, and (2) if any, the presupposition and its correction written.

2 papers0 benchmarksTexts

Geoclidean-Elements

Geoclidean-Elements dataset is derived from definitions in the first book of Euclid’s Elements, which focuses on plane geometry. Geoclidean-Elements includes 17 target concepts and 34 tasks.

2 papers0 benchmarksTexts

G-VUE (General-purpose Visual Understanding Evaluation)

General-purpose Visual Understanding Evaluation (G-VUE) is a comprehensive benchmark covering the full spectrum of visual cognitive abilities with four functional domains -- Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation.

2 papers0 benchmarksImages, Texts

OIR

OIR is a financial-domain dataset of the outbound intent recognition task. It aims to identify the intent of customer response in the outbound call scenario.

2 papers0 benchmarksTexts

ExHVV

ExHVV is a novel dataset that offers natural language explanations of connotative roles for three types of entities -- heroes, villains, and victims, encompassing 4,680 entities present in 3K memes.

2 papers0 benchmarksImages, Texts

DialogUSR

DialogUSR dataset covers 23 domains with a multi-step crowd-sourcing procedure. It comprises 36.7 Chinese characters by assembling 3.6 single-intent queries (including initial and follow-up queries) and is designed for dialogue utterance splitting and reformulation task.

2 papers0 benchmarksTexts

OCR-IDL (OCR Annotations for Industry Document Library Dataset)

The OCR-IDL dataset comprises the OCR annotations for a subset of 26M pages of the large-scale IDL document library. These annotations have a monetary value over $20,000 and are made publicly available with the aim of advancing the Document Intelligence research field. Our motivation is two-fold: First, by making these annotations public, we aim to level the differences between research groups and companies who have big private datasets to pre/train on. And second, we make use of a commercial OCR engine to obtain high quality annotations, leading to reduce the noise provided by OCR on pretraining and downstream tasks.

2 papers0 benchmarksImages, Texts

CA4P-483

CA4P-483 is a dataset designed to facilitate the sequence labeling tasks and regulation compliance identification between privacy policies and software. It contains 483 Chinese Android application privacy policies, over 11K sentences, and 52K fine-grained annotations.

2 papers0 benchmarksTexts

Hansel

Hansel is a human-annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities:

2 papers0 benchmarksTexts

JEMMA

JEMMA is an Extensible Java Dataset for ML4Code Applications, which is a large-scale dataset targeted at ML4 code. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods.

2 papers0 benchmarksTexts

PropSegmEnt

PropSegmEnt is a corpus of over 35K propositions annotated by expert human raters. The dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity.

2 papers0 benchmarksTexts

Dusha (Dusha Crowd, Dusha Podcast)

Dusha is a dataset for speech emotion recognition (SER) tasks. The corpus contains approximately 350 hours of data, more than 300 000 audio recordings with Russian speech and their transcripts. It is annotated using a crowd-sourcing platform and includes two subsets: acted and real-life.

2 papers0 benchmarksAudio, Texts

XAlign

It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta), Telugu (te), and monolingual dataset in English (en). This is the Wikipedia text <--> Wikidata KG aligned corpus used to train the data-to-text generation model. The Train & validation splits are created using distant supervision methods and Test data is generated through human annotations.

2 papers4 benchmarksTexts

Merger Agreement Understanding Dataset (MAUD)

MAUD is an expert-annotated merger agreement reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points study, where lawyers and law students answered 92 questions about 152 merger agreements.

2 papers0 benchmarksTexts

VTC (Videos, Titles and Comments)

VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.

2 papers0 benchmarksAudio, Images, Texts, Videos

PreviousPage 95 of 158Next