3,148 machine learning datasets
3,148 dataset results
Hinglish-TOP is a human annotated code-switched semantic parsing dataset containing 10k human annotations for Hindi-English (HINGLISH) code switched utterances, and over 170K CST5 generated code-switched utterances from the TOPv2 dataset.
CVE stands for Common Vulnerabilities and Exposures. CVE is a glossary that classifies vulnerabilities. The glossary analyzes vulnerabilities and then uses the Common Vulnerability Scoring System (CVSS) to evaluate the threat level of a vulnerability. A CVE score is often used for prioritizing the security of vulnerabilities.
This dataset contains around 218K sentences, with 1.5 million words, from 30 different books designed for Post-OCR text correction.
MM-Locate-News is a dataset for location estimation of news. It consists of 6395 news articles covering 237 cities and 152 countries across all continents as well as multiple domains such as health, environment, and politics. The dataset is collected in a weakly-supervised manner, and multiple data cleaning steps are applied to remove articles with potential inaccurate geolocation information. The acquired dataset addresses drawbacks of other datasets such as BreakingNews as it considers multimodal content of news to label the corresponding location.
ec-darkpattern is a dataset for dark pattern detection and prepared its baseline detection performance with state-of-the-art machine learning methods. The original dataset was obtained from Mathur et al.’s study in 2019 [11kScale], which consists of 1,818 dark pattern texts from shopping sites. Negative samples, i.e., non-dark pattern texts, by retrieving texts from the same websites as Mathur et al.'s dataset.
Comet is a dataset which contains 11.5k user-assistant dialogs (totalling 103k utterances), grounded in simulated personal memory graphs.
SummZoo, a benchmark consists of 8 diverse summarization tasks with multiple sets of few-shot samples for each task, covering both monologue and dialogue domains.
IDK-MRC is an Indonesian Machine Reading Comprehension (MRC) dataset consists of more than 10K questions in total with over 5K unanswerable questions with diverse question types.
Chinese Character Stroke Extraction (CCSE) is a benchmark containing two large-scale datasets: Kaiti CCSE (CCSE-Kai) and Handwritten CCSE (CCSE-HW). It is designed for stroke extraction problems.
Kor-Learner is a Korean grammatical error correction (GEC) dataset made from the NIKL learner corpus containing essays written by Korean learners and their grammatical error correction annotations by their tutors in an morpheme-level XML file format. It contains more than 28K sentence pairs.
Kor-Learner is a Korean grammatical error correction (GEC) dataset collected grammatically from two sources, and the correct sentences were read using Google Text-to-Speech(TTS) system. The general public was tasked with dictating grammatically correct sentences and transcribe them. It contains more than 17K sentence pairs.
Kor-Lang8 is a Korean grammatical error correction (GEC) dataset extracted from the NAIST Lang-8 Learner Corpora by the language label. It contains more than 109K sentence pairs.
ExPUNations is a humor dataset with such extensive and fine-grained annotations specifically for puns. This dataset is designed for two new tasks namely, explanation generation to aid with pun classification and keyword-conditioned pun generation
Definitions of jargon/terms in computer science, mathematics, and physics
A modification on the ShEMO dataset with help of an Automatic Speech Recognition (ASR) system.
Brazilian Protest is a dataset for event filtering that focuses on protests in multi-modal social media data, with most of the text in Portuguese. The dataset contains 4.5 million tweets, of which 155 thousand are associated with an URL to an uncurated article and 370 thousand have an associated media content (including the media of the uncurated articles).
NCTE Transcripts consists of 1,660 45-60 minute long 4th and 5th grade elementary mathematics observations collected by the National Center for Teacher Effectiveness (NCTE) between 2010-2013. The anonymized transcripts represent data from 317 teachers across 4 school districts that serve largely historically marginalized students. The transcripts come with rich metadata, including turn-level annotations for dialogic discourse moves, classroom observation scores, demographic information, survey responses and student test scores.
Spiced is a paraphrase dataset of scientific findings annotated for degree of information change. Spiced contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers.
IMaSC is a Malayalam text and speech corpus made available by ICFOSS for the purpose of developing speech technology for Malayalam, particularly text-to-speech. The corpus contains 34,473 text-audio pairs of Malayalam sentences spoken by 8 speakers, totalling in approximately 50 hours of audio.
ProNCI consists of 22.5K proper noun compounds along with their free-form semantic interpretations. ProNCI is 60 times larger than prior noun compound datasets and also includes non-compositional examples.