Datasets

3,148 machine learning datasets

3,148 dataset results

Narvik Road Dataset (DIT4BEARs Smart Road Dataset)

DIT4BEARs Internship Project (at UiT-The Arctic University of Norway) Dataset

MSJudge

This is a challenging dataset from real courtrooms to predict the legal judgment in a reasonably encyclopedic manner by leveraging the genuine input of the case -- plaintiff's claims and court debate data, from which the case's facts are automatically recognized by comprehensively understanding the multi-role dialogues of the court debate, and then learnt to discriminate the claims so as to reach the final judgment through multi-task learning.

1 papers0 benchmarksTexts

TFix's Code Patches Data

The dataset contains more than 100k code patch pairs extracted from open source projects on GitHub. Each pair comes with the erroneous and the fixed version of the corresponding code snippet. Instead of the whole file, the code snippets are extracted to focus on the problematic region (error line + other lines around it). For each sample, the repository name, the commit id, and the file names are provided so that one can access the complete files in case of interest.

1 papers4 benchmarksTexts

AP (Adversarial Paraphrase)

This is a paraphrasing dataset created using the adversarial paradigm. A task was designed called the Adversarial Paraphrasing Task (APT) whose objective was to write sentences that mean the same as a given sentence but have as different syntactical and lexical properties as possible.

1 papers2 benchmarksTexts

Perla Dataset (Perla Depression Screening Dataset)

This dataset contains the results of a depression screening experiment using two instruments: The PHQ-9 depression screening questionnaire and the chabot Perla.

1 papers0 benchmarksTexts

IRLCov19

IRLCov19 is a multilingual Twitter dataset related to Covid-19 collected in the period between February 2020 to July 2020 specifically for regional languages in India. It contains more than 13 million tweets.

1 papers0 benchmarksTexts

DUC 2006 (Document Understanding Conferences)

There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.

1 papers0 benchmarksTexts

Bambara Language Dataset

A Bambara dialectal dataset dedicated for Sentiment Analysis, available freely for Natural Language Processing research purposes

1 papers0 benchmarksTexts

gENder-IT

gENder-IT is an English-Italian challenge set focusing on the resolution of natural gender phenomena by providing word-level gender tags on the English source side and multiple gender alternative translations, where needed, on the Italian target side.

1 papers0 benchmarksTexts

DocBank-TB (DocBank-Table)

This dataset consisting 500 set of caption, table and coresponding paper page, processed from DocBank.

1 papers0 benchmarksTabular, Texts

MuDoCo_QueryRewrite (The MuDoCo dataset with Query Rewrite Annotations)

1 papers0 benchmarksTexts

VerbCL

VerbCL is a dataset that consists of the citation graph of court opinions, which cite previously published court opinions in support of their arguments. In particular, it focuses on the verbatim quotes, i.e., where the text of the original opinion is directly reused.

1 papers0 benchmarksTexts

THRED (Two-Hop Relation Extraction Dataset)

This is two-hop relation extraction dataset derived from WikiHop dataset [1].

1 papers0 benchmarksTexts

MAST (Multi-Attributed Structured Text-to-face Dataset)

A new data consolidation called Multi-Attributed and Structured Text-to-face (MAST) dataset. The motivation is to have a large corpus of high-quality face images with fine-grained and attribute-focussed annotations. This has the benefits of the attribute oriented approach as well as the semantics in a textual description.

1 papers0 benchmarksImages, Texts

N15News

N15News is a large-scale multimodal news dataset comprising 200K imagetext pairs and 15 categories, which exceeding the previous news dataset in both the number of categories and samples.

1 papers2 benchmarksImages, Texts

Source Code Tagger Training Set

Ensemble Tagger Training and Testing Set This data includes two files: The training set used to create the SCANL Ensemble tagger [1] and the "unseen" testing set that includes words from systems that are not available in the training set. These are derived from a prior dataset of Grammar Patterns; described in a different paper [2]. Within each of these csv files, you'll find several columns. We explain these columns below:

1 papers0 benchmarksTexts

catbAbI QA-mode (concatenated-bAbI)

We aim to improve the bAbI benchmark as a means of developing intelligent dialogue agents. To this end, we propose concatenated-bAbI (catbAbI): an infinite sequence of bAbI stories. catbAbI is generated from the bAbI dataset and during training, a random sample/story from any task is drawn without replacement and concatenated to the ongoing story. The preprocessig for catbAbI addresses several issues: it removes the supporting facts, leaves the questions embedded in the story, inserts the correct answer after the question mark, and tokenises the full sample into a single sequence of words. As such, catbAbI is designed to be trained in an autoregressive way and analogous to closed-book question answering.

1 papers1 benchmarksTexts

catbAbI LM-mode (concatenated-bAbI)

1 papers1 benchmarksTexts

EVIL-Encoders

This dataset contains samples to generate Python code for security exploits. In order to make the dataset representative of real exploits, it includes code snippets drawn from exploits from public databases. Differing from general-purpose Python code found in previous datasets, the Python code of real exploits entails low-level operations on byte data for obfuscation purposes (i.e., to encode shellcodes). Therefore, real exploits make extensive use of Python instructions for converting data between different encoders, for performing low-level arithmetic and logical operations, and for bit-level slicing, which cannot be found in the previous general-purpose Python datasets. In total, we built a dataset that consists of 1,114 original samples of exploit-tailored Python snippets and their corresponding intent in the English language. These samples include complex and nested instructions, as typical of Python programming. In order to perform more realistic training and for a fair evaluat

1 papers0 benchmarksTexts

EVIL-Decoders

This is an assembly dataset built on top of Shellcode_IA32, a dataset for automatically generating assembly from natural language descriptions that consists of 3,200 assembly instructions, commented in the English language, which were collected from shellcodes for IA-32 and written for the Netwide Assembler (NASM) for Linux. In order to make the data more representative of the code that we aim to generate (i.e., complete exploits, inclusive of decoders to be delivered in the shellcode), we enriched the dataset with further samples of assembly code, drawn from the exploits that we collected from public databases. Different from the previous dataset, the new one includes assembly code from real decoders used in actual exploits. The final dataset contains 3,715 unique pairs of assembly code snippets/English intents. To better support developers in the automatic generation of the assembly programs, we looked beyond a one-to-one mapping between natural language intents and their correspond

1 papers0 benchmarksTexts

PreviousPage 112 of 158Next