3,148 machine learning datasets
3,148 dataset results
BugRepo maintains a collection of bug reports that are publicly available for research purposes. Bug reports are a main data source for facilitating NLP-based research in software engineering. We categorize the datasets into the following research directions.
5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images
To automatically generate Python and assembly programs used for security exploits, we curated a large dataset for feeding NMT techniques. A sample in the dataset consists of a snippet of code from these exploits and their corresponding description in the English language. We collected exploits from publicly available databases (exploitdb, shellstorm), public repositories (e.g., GitHub), and programming guidelines. In particular, we focused on exploits targeting Linux, the most common OS for security-critical network services, running on IA-32 (i.e., the 32-bit version of the x86 Intel Architecture). The dataset is stored in the folder EVIL/datasets and consists of two parts: i) Encoders: a Python dataset, which contains Python code used by exploits to encode the shellcode; ii) Decoders: an assembly dataset, which includes shellcode and decoders to revert the encoding.
Fashion-MNT is large-scale bilingual product description dataset called Fashion-MMT, which contains over 114k noisy and 40k manually cleaned description translations with multiple product images.
MuCo-VQA consist of large-scale (3.7M) multilingual and code-mixed VQA datasets in multiple languages: Hindi (hi), Bengali (bn), Spanish (es), German (de), French (fr) and code-mixed language pairs: en-hi, en-bn, en-fr, en-de and en-es.
VGaokao is a verification style reading comprehension dataset designed for native speakers' evaluation.
EmoCause is a dataset of annotated emotion cause words in emotional situations from the EmpatheticDialogues valid and test set. The goal is to recognize emotion cause words in sentences by training only on sentence-level emotion labels without word-level labels (i.e., weakly-supervised emotion cause recognition).
Saint Gall dataset contains handwritten historical manuscripts written in Latin that date back to the 9th century. It consists of 60 pages, 1 410 text lines and 11 597 words.
EFO-1-QA is a new dataset to benchmark the combinatorial generalizability of Complex Query Answering (CQA) models by including 301 different queries types, which is 20 times larger than existing datasets.
BiRdQA is a bilingual multiple-choice question answering dataset with 6614 English riddles and 8751 Chinese riddles.
The GermEval dataset is a valuable resource for natural language processing (NLP) tasks, specifically named entity recognition (NER), conducted in the German language. Here are some key details about this dataset:
OpenViDial 2.0 is a larger-scale open-domain multi-modal dialogue dataset compared to the previous version OpenViDial 1.0. OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series from different resources, and each dialogue turn is paired with its corresponding visual context.
EDGAR-CORPUS is a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format.
TLDR9+ is a large-scale summarization dataset containing over 9 million training instances extracted from Reddit discussion forum. This dataset is specifically gathered to perform extreme summarization (i.e., generating one-sentence summary in high compression and abstraction) and is more than twice larger than the previously proposed dataset. With the help of human annotations, a more fine-grained dataset is distilled by sampling High-Quality instances from TLDR9+ and call it TLDRHQ. dataset.
The dataset contains training and evaluation data for 12 languages: - Vietnamese - Romanian - Latvian - Czech - Polish - Slovak - Irish - Hungarian - French - Turkish - Spanish - Croatian
Large scale machine reading comprehension dataset in Urdu language.
A version of the CMU Movie Summary Corpus (http://www.cs.cmu.edu/~ark/personas/), which was originally scraped from plot summaries from Wikipedia, with some cleaning and sentences turned into events & sorted into "genres" (via LDA).
The dataset consists of biomedical articles describing randomized control trials (RCTs) that compare multiple treatments. Each of these articles will have multiple questions, or 'prompts' associated with them. These prompts will ask about the relationship between an intervention and comparator with respect to an outcome, as reported in the trial. For example, a prompt may ask about the reported effects of aspirin as compared to placebo on the duration of headaches.
SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots". arxiv
A collection of 385,705 scientific abstracts about Cognitive Control and their GPT-3 embeddings.