3,148 machine learning datasets
3,148 dataset results
CUP (Context-sitUated Pun) is a dataset containing 4.5k tuples of context words and pun pairs, each labelled with whether they are compatible for composing a pun.
LEPISZCZE is an open-source comprehensive benchmark for Polish NLP and a continuous-submission leaderboard, concentrating public Polish datasets (existing and new) in specific tasks.
PcMSP is a dataset annotated from 305 open access scientific articles for material science information extraction that simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations.
HERDPhobia is an annotated hate speech detection dataset on Fulani herders in Nigeria -- in three languages: English, Nigerian-Pidgin, and Hausa.
MCSCSet is a large-scale specialist-annotated dataset, designed for the task of Medical-domain Chinese Spelling Correction that contains about 200k samples. MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists.
CREPE is QA dataset containing a natural distribution of presupposition failures from online information-seeking forums. It consists of 8400 Reddit questions with (1) whether there is any false presuppositions annotated, and (2) if any, the presupposition and its correction written.
Geoclidean-Elements dataset is derived from definitions in the first book of Euclid’s Elements, which focuses on plane geometry. Geoclidean-Elements includes 17 target concepts and 34 tasks.
General-purpose Visual Understanding Evaluation (G-VUE) is a comprehensive benchmark covering the full spectrum of visual cognitive abilities with four functional domains -- Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation.
OIR is a financial-domain dataset of the outbound intent recognition task. It aims to identify the intent of customer response in the outbound call scenario.
ExHVV is a novel dataset that offers natural language explanations of connotative roles for three types of entities -- heroes, villains, and victims, encompassing 4,680 entities present in 3K memes.
DialogUSR dataset covers 23 domains with a multi-step crowd-sourcing procedure. It comprises 36.7 Chinese characters by assembling 3.6 single-intent queries (including initial and follow-up queries) and is designed for dialogue utterance splitting and reformulation task.
The OCR-IDL dataset comprises the OCR annotations for a subset of 26M pages of the large-scale IDL document library. These annotations have a monetary value over $20,000 and are made publicly available with the aim of advancing the Document Intelligence research field. Our motivation is two-fold: First, by making these annotations public, we aim to level the differences between research groups and companies who have big private datasets to pre/train on. And second, we make use of a commercial OCR engine to obtain high quality annotations, leading to reduce the noise provided by OCR on pretraining and downstream tasks.
CA4P-483 is a dataset designed to facilitate the sequence labeling tasks and regulation compliance identification between privacy policies and software. It contains 483 Chinese Android application privacy policies, over 11K sentences, and 52K fine-grained annotations.
Hansel is a human-annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities:
JEMMA is an Extensible Java Dataset for ML4Code Applications, which is a large-scale dataset targeted at ML4 code. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods.
PropSegmEnt is a corpus of over 35K propositions annotated by expert human raters. The dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity.
Dusha is a dataset for speech emotion recognition (SER) tasks. The corpus contains approximately 350 hours of data, more than 300 000 audio recordings with Russian speech and their transcripts. It is annotated using a crowd-sourcing platform and includes two subsets: acted and real-life.
It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta), Telugu (te), and monolingual dataset in English (en). This is the Wikipedia text <--> Wikidata KG aligned corpus used to train the data-to-text generation model. The Train & validation splits are created using distant supervision methods and Test data is generated through human annotations.
MAUD is an expert-annotated merger agreement reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points study, where lawyers and law students answered 92 questions about 152 merger agreements.
VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.