Datasets

3,148 machine learning datasets

3,148 dataset results

Hinglish-TOP

Hinglish-TOP is a human annotated code-switched semantic parsing dataset containing 10k human annotations for Hindi-English (HINGLISH) code switched utterances, and over 170K CST5 generated code-switched utterances from the TOPv2 dataset.

1 papers0 benchmarksTexts

CVE (Common Vulnerabilities and Exposures)

CVE stands for Common Vulnerabilities and Exposures. CVE is a glossary that classifies vulnerabilities. The glossary analyzes vulnerabilities and then uses the Common Vulnerability Scoring System (CVSS) to evaluate the threat level of a vulnerability. A CVE score is often used for prioritizing the security of vulnerabilities.

1 papers0 benchmarksTexts

Dataset for Post-OCR text correction in Sanskrit

This dataset contains around 218K sentences, with 1.5 million words, from 30 different books designed for Post-OCR text correction.

1 papers0 benchmarksImages, Texts

MM-Locate-News

MM-Locate-News is a dataset for location estimation of news. It consists of 6395 news articles covering 237 cities and 152 countries across all continents as well as multiple domains such as health, environment, and politics. The dataset is collected in a weakly-supervised manner, and multiple data cleaning steps are applied to remove articles with potential inaccurate geolocation information. The acquired dataset addresses drawbacks of other datasets such as BreakingNews as it considers multimodal content of news to label the corresponding location.

1 papers0 benchmarksImages, Texts

ec-darkpattern

ec-darkpattern is a dataset for dark pattern detection and prepared its baseline detection performance with state-of-the-art machine learning methods. The original dataset was obtained from Mathur et al.’s study in 2019 [11kScale], which consists of 1,818 dark pattern texts from shopping sites. Negative samples, i.e., non-dark pattern texts, by retrieving texts from the same websites as Mathur et al.'s dataset.

1 papers0 benchmarksTexts

Comet

Comet is a dataset which contains 11.5k user-assistant dialogs (totalling 103k utterances), grounded in simulated personal memory graphs.

1 papers0 benchmarksTexts

SummZoo

SummZoo, a benchmark consists of 8 diverse summarization tasks with multiple sets of few-shot samples for each task, covering both monologue and dialogue domains.

1 papers0 benchmarksTexts

IDK-MRC

IDK-MRC is an Indonesian Machine Reading Comprehension (MRC) dataset consists of more than 10K questions in total with over 5K unanswerable questions with diverse question types.

1 papers0 benchmarksTexts

CCSE (Chinese Character Stroke Extraction)

Chinese Character Stroke Extraction (CCSE) is a benchmark containing two large-scale datasets: Kaiti CCSE (CCSE-Kai) and Handwritten CCSE (CCSE-HW). It is designed for stroke extraction problems.

1 papers0 benchmarksImages, Texts

Kor-Learner (Korean Learner Corpus)

Kor-Learner is a Korean grammatical error correction (GEC) dataset made from the NIKL learner corpus containing essays written by Korean learners and their grammatical error correction annotations by their tutors in an morpheme-level XML file format. It contains more than 28K sentence pairs.

1 papers0 benchmarksTexts

Kor-Native (Native Korean Corpus)

Kor-Learner is a Korean grammatical error correction (GEC) dataset collected grammatically from two sources, and the correct sentences were read using Google Text-to-Speech(TTS) system. The general public was tasked with dictating grammatically correct sentences and transcribe them. It contains more than 17K sentence pairs.

1 papers0 benchmarksTexts

Kor-Lang8 (Lang-8 Korean Corpus)

Kor-Lang8 is a Korean grammatical error correction (GEC) dataset extracted from the NAIST Lang-8 Learner Corpora by the language label. It contains more than 109K sentence pairs.

1 papers0 benchmarksTexts

ExPUNations

ExPUNations is a humor dataset with such extensive and fine-grained annotations specifically for puns. This dataset is designed for two new tasks namely, explanation generation to aid with pun classification and keyword-conditioned pun generation

1 papers0 benchmarksTexts

UJ-CS/Math/Phy

Definitions of jargon/terms in computer science, mathematics, and physics

1 papers0 benchmarksTexts

modified_shemo

A modification on the ShEMO dataset with help of an Automatic Speech Recognition (ASR) system.

1 papers0 benchmarksSpeech, Texts

Brazilian Protest

Brazilian Protest is a dataset for event filtering that focuses on protests in multi-modal social media data, with most of the text in Portuguese. The dataset contains 4.5 million tweets, of which 155 thousand are associated with an URL to an uncurated article and 370 thousand have an associated media content (including the media of the uncurated articles).

1 papers0 benchmarksTexts

NCTE Transcripts

NCTE Transcripts consists of 1,660 45-60 minute long 4th and 5th grade elementary mathematics observations collected by the National Center for Teacher Effectiveness (NCTE) between 2010-2013. The anonymized transcripts represent data from 317 teachers across 4 school districts that serve largely historically marginalized students. The transcripts come with rich metadata, including turn-level annotations for dialogic discourse moves, classroom observation scores, demographic information, survey responses and student test scores.

1 papers0 benchmarksTexts

Spiced

Spiced is a paraphrase dataset of scientific findings annotated for degree of information change. Spiced contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers.

1 papers0 benchmarksTexts

IMaSC (ICFOSS Malayalam Speech Corpus)

IMaSC is a Malayalam text and speech corpus made available by ICFOSS for the purpose of developing speech technology for Malayalam, particularly text-to-speech. The corpus contains 34,473 text-audio pairs of Malayalam sentences spoken by 8 speakers, totalling in approximately 50 hours of audio.

1 papers0 benchmarksAudio, Texts

ProNCI

ProNCI consists of 22.5K proper noun compounds along with their free-form semantic interpretations. ProNCI is 60 times larger than prior noun compound datasets and also includes non-compositional examples.

1 papers0 benchmarksTexts

PreviousPage 122 of 158Next