Datasets

3,148 machine learning datasets

3,148 dataset results

ProofNet#

ProofNet# is an evaluation benchmark derived from the original ProofNet, which contains 371 paired examples of informal undergraduate mathematical statements and their corresponding formalizations. Updated for Lean 4, ProofNet# corrects formalization errors and retains the original structure and content.

4 papers0 benchmarksTexts

Math500

math 500

4 papers0 benchmarksTexts

HierarCaps

Images with paired ground-truth caption hierarchies

4 papers0 benchmarksImages, Texts

M$^3$-VOS (M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation)

💡 Description A new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (M$^3$-VOS), to verify the ability of models to understand object phases, which consists of 479 high-resolution videos spanning over 10 distinct everyday scenarios. We collected 205,181 masks, with an average track duration of 14.27s. M$^3$-VOS covers 120+ categories of objects across 6 phases within 14 scenarios, encompassing 23 specific phase transitions.

4 papers2 benchmarksImages, Texts, Videos

ImgEdit

ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks.

4 papers0 benchmarksImages, Texts

ACL Title and Abstract Dataset

This dataset gathers 10,874 title and abstract pairs from the ACL Anthology Network (until 2016).

3 papers4 benchmarksTexts

LeNER-Br

LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.

3 papers2 benchmarksTexts

Jamendo Corpus

The Jamendo Corpus is a voice detection dataset consisting of 93 songs with Creative Commons license from the Jamendo free music sharing website. Segments of each song are annotated as “voice” (sung or spoken) or “no-voice”. The songs constitute a total of about 6 hours of music. The files are all from different artists and represent various genres from mainstream commercial music. The Jamendo audio files are coded in stereo Vorbis OGG 44.1kHz with 112KB/s bitrate. The original split contains 61, 16 and 16 songs in training, validation and testing set, respectively.

3 papers0 benchmarksAudio, Texts

M-VAD Names (M-VAD Names Dataset)

The dataset contains the annotations of characters' visual appearances, in the form of tracks of face bounding boxes, and the associations with characters' textual mentions, when available. The detection and annotation of the visual appearances of characters in each video clip of each movie was achieved through a semi-automatic approach. The released dataset contains more than 24k annotated video clips, including 63k visual tracks and 34k textual mentions, all associated with their character identities.

3 papers1 benchmarksTexts, Videos

FOBIE (Focused Open Biological Information Extraction)

The Focused Open Biology Information Extraction (FOBIE) dataset aims to support IE from Computer-Aided Biomimetics. The dataset contains ~1,500 sentences from scientific biological texts. These sentences are annotated with TRADE-OFFS and syntactically similar relations between unbounded arguments, as well as argument-modifiers.

3 papers0 benchmarksBiology, Texts

COVID-Q

COVID-Q consists of COVID-19 questions which have been annotated into a broad category (e.g. Transmission, Prevention) and a more specific class such that questions in the same class are all asking the same thing.

3 papers0 benchmarksTexts

ClarQ

ClarQ, consists of ∼2M examples distributed across 173 domains of stackexchange. This dataset is meant for training and evaluation of Clarification Question Generation Systems.

3 papers0 benchmarksTexts

GitHub Typo Corpus

Are you the kind of person who makes a lot of typos when writing code? Or are you the one who fixes them by making "fix typo" commits? Either way, thank you—you contributed to the state-of-the-art in the NLP field.

3 papers0 benchmarksTexts

Shmoop Corpus

Shmoop Corpus is a dataset of 231 stories that are paired with detailed multi-paragraph summaries for each individual chapter (7,234 chapters), where the summary is chronologically aligned with respect to the story chapter. From the corpus, a set of common NLP tasks are constructed, including Cloze-form question answering and a simplified form of abstractive summarization, as benchmarks for reading comprehension on stories.

3 papers0 benchmarksTexts

AmbigQA

Is a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question. A dataset covering 14,042 questions from NQ-open, an existing open-domain QA benchmark.

3 papers0 benchmarksTexts

COVID-CQ

COVID-CQ is a stance data set of user-generated content on Twitter in the context of COVID-19.

3 papers0 benchmarksTexts

CS (Chinese Simile)

This dataset is constructed and based on the online free-access fictions that are tagged with sci-fi, urban novel, love story, youth, etc. It is used for Writing Polishment with Smile (WPS) a task that aims to polish plain text with similes. All similes are extracted by rich regular expression, and the extraction precision is estimated as 92% by labelling 500 random extracted samples. It contains 5M samples for training and 2.5k for validation and test respectively.

3 papers0 benchmarksTexts

DAWT (Densely Annotated Wikipedia Texts)

The DAWT dataset consists of Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic.

3 papers0 benchmarksTexts

DMQA (DeepMind Q&A)

The DeepMind Q&A Dataset consists of two datasets for Question Answering, CNN and DailyMail. Each dataset contains many documents (90k and 197k each), and each document companies on average 4 questions approximately. Each question is a sentence with one missing word/phrase which can be found from the accompanying document/context.

3 papers0 benchmarksTexts

esXNLI

esXNLI is a bilingual NLI dataset. It comprises 2,490 examples from 5 different genres that were originally annotated in Spanish, and translated into English by professional translators. It serves as a counterpoint to XNLI, which was originally annotated in English and translated into 14 other languages, including Spanish. The dataset was conceived to be used in conjunction with the XNLI development set to analyse the effect of translation in cross-lingual transfer learning.

3 papers0 benchmarksTexts

PreviousPage 74 of 158Next