Datasets

3,148 machine learning datasets

3,148 dataset results

BugRepo (Bug Reports)

BugRepo maintains a collection of bug reports that are publicly available for research purposes. Bug reports are a main data source for facilitating NLP-based research in software engineering. We categorize the datasets into the following research directions.

2 papers0 benchmarksTexts

MSDA (Multi-source domain adaptation dataset for text recognition)

5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images

2 papers4 benchmarksImages, Texts

EVIL

To automatically generate Python and assembly programs used for security exploits, we curated a large dataset for feeding NMT techniques. A sample in the dataset consists of a snippet of code from these exploits and their corresponding description in the English language. We collected exploits from publicly available databases (exploitdb, shellstorm), public repositories (e.g., GitHub), and programming guidelines. In particular, we focused on exploits targeting Linux, the most common OS for security-critical network services, running on IA-32 (i.e., the 32-bit version of the x86 Intel Architecture). The dataset is stored in the folder EVIL/datasets and consists of two parts: i) Encoders: a Python dataset, which contains Python code used by exploits to encode the shellcode; ii) Decoders: an assembly dataset, which includes shellcode and decoders to revert the encoding.

2 papers0 benchmarksTexts

Fashion-MMT

Fashion-MNT is large-scale bilingual product description dataset called Fashion-MMT, which contains over 114k noisy and 40k manually cleaned description translations with multiple product images.

2 papers0 benchmarksImages, Texts

MuCo-VQA

MuCo-VQA consist of large-scale (3.7M) multilingual and code-mixed VQA datasets in multiple languages: Hindi (hi), Bengali (bn), Spanish (es), German (de), French (fr) and code-mixed language pairs: en-hi, en-bn, en-fr, en-de and en-es.

2 papers0 benchmarksImages, Texts, Videos

VGaokao

VGaokao is a verification style reading comprehension dataset designed for native speakers' evaluation.

2 papers0 benchmarksTexts

EmoCause

EmoCause is a dataset of annotated emotion cause words in emotional situations from the EmpatheticDialogues valid and test set. The goal is to recognize emotion cause words in sentences by training only on sentence-level emotion labels without word-level labels (i.e., weakly-supervised emotion cause recognition).

2 papers3 benchmarksTexts

Saint Gall

Saint Gall dataset contains handwritten historical manuscripts written in Latin that date back to the 9th century. It consists of 60 pages, 1 410 text lines and 11 597 words.

2 papers2 benchmarksImages, Texts

EFO-1-QA

EFO-1-QA is a new dataset to benchmark the combinatorial generalizability of Complex Query Answering (CQA) models by including 301 different queries types, which is 20 times larger than existing datasets.

2 papers0 benchmarksTexts

BiRdQA

BiRdQA is a bilingual multiple-choice question answering dataset with 6614 English riddles and 8751 Chinese riddles.

2 papers0 benchmarksTexts

GermEval

The GermEval dataset is a valuable resource for natural language processing (NLP) tasks, specifically named entity recognition (NER), conducted in the German language. Here are some key details about this dataset:

2 papers0 benchmarksTexts

OpenViDial 2.0

OpenViDial 2.0 is a larger-scale open-domain multi-modal dialogue dataset compared to the previous version OpenViDial 1.0. OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series from different resources, and each dialogue turn is paired with its corresponding visual context.

2 papers20 benchmarksImages, Texts

EDGAR-CORPUS

EDGAR-CORPUS is a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format.

2 papers0 benchmarksTexts

TLDR9+

TLDR9+ is a large-scale summarization dataset containing over 9 million training instances extracted from Reddit discussion forum. This dataset is specifically gathered to perform extreme summarization (i.e., generating one-sentence summary in high compression and abstraction) and is more than twice larger than the previously proposed dataset. With the help of human annotations, a more fine-grained dataset is distilled by sampling High-Quality instances from TLDR9+ and call it TLDRHQ. dataset.

2 papers3 benchmarksTexts

Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems

The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish - Slovak - Irish - Hungarian - French - Turkish - Spanish - Croatian

2 papers12 benchmarksTexts

UQuAD (Urdu Question Answering Dataset)

Large scale machine reading comprehension dataset in Urdu language.

2 papers4 benchmarksTexts

MoviePlotEvents (CMU Movie Summary Corpus with Events)

A version of the CMU Movie Summary Corpus (http://www.cs.cmu.edu/~ark/personas/), which was originally scraped from plot summaries from Wikipedia, with some cleaning and sentences turned into events & sorted into "genres" (via LDA).

2 papers0 benchmarksTexts

Evidence Inference 2.0

The dataset consists of biomedical articles describing randomized control trials (RCTs) that compare multiple treatments. Each of these articles will have multiple questions, or 'prompts' associated with them. These prompts will ask about the relationship between an intervention and comparator with respect to an outcome, as reported in the trial. For example, a prompt may ask about the reported effects of aspirin as compared to placebo on the duration of headaches.

2 papers0 benchmarksTexts

SSD (Sub-Slot Dialogue dataset)

SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots". arxiv

2 papers0 benchmarksTexts

PubMed Cognitive Control Abstracts (CogText)

A collection of 385,705 scientific abstracts about Cognitive Control and their GPT-3 embeddings.

2 papers1 benchmarksTexts

PreviousPage 91 of 158Next