Datasets

3,148 machine learning datasets

3,148 dataset results

CSPRD (Chinese Stock Policy Retrieval Dataset)

The Chinese Stock Policy Retrieval Dataset (CSPRD) contains a Chinese policy corpus of 10,002 articles and 709 prospectus examples from 545 companies listed on China’s Science and Technology Innovation Board (STAR Market). CSPRD is bilingual in Chinese and English (Translated by ChatGPT) and is annotated by experienced experts from Shanghai Stock Exchange.

1 papers0 benchmarksTexts

FMC-MWO2KG (The MWO2KG Failure Mode Classification Dataset)

The Failure Mode Classification dataset released in the paper "MWO2KG and Echidna: Constructing and exploring knowledge graphs from maintenance data" by Stewart et al. The goal is to label a given observation (made by a maintainer) with the corresponding Failure Mode Code.

1 papers4 benchmarksTexts

CY101 Dataset

In this dataset an uppertorso humanoid robot with 7-DOF arm explored 100 different objects belonging to 20 different categories using 10 behaviors: Look, Crush, Grasp, Hold, Lift, Drop, Poke, Push, Shake and Tap.

1 papers0 benchmarksActions, Audio, Images, Interactive, RGB Video, Texts, Time series, Videos

Verified Smart Contracts

Verified Smart Contracts is a dataset of real Ethereum smart contracts, containing both Solidity and Vyper source code. It consists of every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 186,397 unique smart contracts are provided, filtered down from 2,217,692 smart contracts. The dataset contains 53,843,305 lines of code.

1 papers0 benchmarksTexts

Verified Smart Contract Code Comments

Verified Smart Contracts Code Comments is a dataset of real Ethereum smart contract functions, containing "code, comment" pairs of both Solidity and Vyper source code. The dataset is based on every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 1,541,370 smart contract functions are provided, parsed from 186,397 unique smart contracts, filtered down from 2,217,692 smart contracts.

1 papers1 benchmarksTexts

Vulnerable Verified Smart Contracts

Vulnerable Verified Smart Contracts is a dataset of real vulnerable Ethereum smart contracts. Based on the manually labeled Benchmark dataset of Solidity smart contracts. A total of 609 vulnerable contracts are provided, containing 1,117 vulnerabilities.

1 papers0 benchmarksTexts

CodeGen4Libs Dataset

The dataset is specifically constructed for the library-oriented code generation task, which are constructed in the paper “CodeGen4Libs: A Two-Stage Approach for Library-Oriented Code Generation”.

1 papers0 benchmarksTexts

GePaDe

This dataset encompasses 265 speeches (over 200,000 tokens) from the German Bundestag, primarily from the 19th legislative term (2017-2021), given by 195 distinct speakers representing 6 political parties.

1 papers2 benchmarksTexts

SaGA (The Bielefeld Speech and Gesture Alignment Corpus (SaGA))

The primary data of the SaGA corpus are made up of 25 dialogs of interlocutors (50), who engage in a spatial communication task combining direction-giving and sight description. Six of those dialogues with data only from the direction giver are available including audio (.wav) and video (.mp4) data. The secondary data consists of annotations (*.eaf) of gestures and speech-gesture referents, which have been completely and systematically annotated based on an annotation grid (cf. the SaGA documentation). The corpus is comprised of of 9881 isolated words and 1764 isolated gestures. The stimulus is a model of a town presented in a Virtual Reality (VR) environment. Upon finishing a "bus ride" through the VR town along five landmarks, a router explained the route as well as the wayside landmarks to an unknown and naive follower. The SaGA Corpus was curated for CLARIN as part of the Curation Project "Editing and Integration of Multimodal Resources in CLARIN-D" by the CLARIN-D Working Group 6

1 papers0 benchmarksAudio, Texts, Time series, Videos

Clickbait PDFs (From Attachments to SEO: Click Here to Learn More about Clickbait PDFs!)

The paper presents a study of Clickbait PDFs, which are PDF documents leading to various attacks on the Web. Clickbait PDFs are different from the well-known "MalPDFs", usually found in phishing emails, as they do not contain malware.

1 papers0 benchmarksImages, Texts

SE-PEF (Stack Exchange - Personalized Expert Finding)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

Reddit Ideology Database

Dataset with articles posted in the r/Liberal and r/Conservative subreddits. In total, we collected a corpus of 226,010 articles. We have collected news articles to understand political expression through the shared news articles.

1 papers1 benchmarksTexts

LSA-T (Lengua de Señas Argentina - Traducción)

LSA-T is the first continuous Argentinian Sign Language (LSA) dataset. It contains 14,880 sentence level videos of LSA extracted from the CN Sordos YouTube channel with labels and keypoints annotations for each signer. Videos are in 30 FPS full HD (1920x1080).

1 papers1 benchmarksRGB Video, Texts, Videos

CodeInstruct (InstructCoder, CodeInstruct)

InstructCoder is the first dataset designed to adapt LLMs for general code editing. It consists of over 100k instruction-input-output triplets and covers multiple distinct code editing scenarios, generated by ChatGPT. LLaMA-33B finetuned on InstructCoder performs on par with ChatGPT on a real-world test set derived from GitHub commits.

1 papers0 benchmarksTexts

Creative Visual Storytelling Anthology (ARL Creative Visual Storytelling Anthology)

The Creative Visual Storytelling Anthology is a collection of 100 author responses to an improved creative visual storytelling exercise over a sequence of three images. Each item contains four facet entries, corresponding to Entity, Scene, Narrative, and Title.

1 papers0 benchmarksImages, Texts

QASiNa (Question Answering Sirah Nabawiyah)

Question Answering Sirah Nabawiyah (QASiNa) Dataset is a reading comprehension dataset consists of QA from Sirah Nabawiyah literature in Indonesian Language

1 papers0 benchmarksTexts

Banking_CG

The dataset identifies the shortcomings of existing benchmarks in evaluating the problem of compositional generalization, which underscores the need for the development of datasets tailored to assess compositional generalization in open intent detection tasks.

1 papers1 benchmarksTexts

PreviousPage 129 of 158Next

Datasets

CSPRD (Chinese Stock Policy Retrieval Dataset)

FMC-MWO2KG (The MWO2KG Failure Mode Classification Dataset)

CY101 Dataset

Verified Smart Contracts

Verified Smart Contract Code Comments

Vulnerable Verified Smart Contracts

CodeGen4Libs Dataset

GePaDe

SaGA (The Bielefeld Speech and Gesture Alignment Corpus (SaGA))

Clickbait PDFs (From Attachments to SEO: Click Here to Learn More about Clickbait PDFs!)

SE-PEF (Stack Exchange - Personalized Expert Finding)

Reddit Ideology Database

LSA-T (Lengua de Señas Argentina - Traducción)

CodeInstruct (InstructCoder, CodeInstruct)

Creative Visual Storytelling Anthology (ARL Creative Visual Storytelling Anthology)

QASiNa (Question Answering Sirah Nabawiyah)

Banking_CG

OOS_CG

StackOverflow_CG

Gem5Pred dataset

Datasets

CSPRD (Chinese Stock Policy Retrieval Dataset)

FMC-MWO2KG (The MWO2KG Failure Mode Classification Dataset)

CY101 Dataset

Verified Smart Contracts

Verified Smart Contract Code Comments

Vulnerable Verified Smart Contracts

CodeGen4Libs Dataset

GePaDe

SaGA (The Bielefeld Speech and Gesture Alignment Corpus (SaGA))

Clickbait PDFs (From Attachments to SEO: Click Here to Learn More about Clickbait PDFs!)

SE-PEF (Stack Exchange - Personalized Expert Finding)

Reddit Ideology Database

LSA-T (Lengua de Señas Argentina - Traducción)

CodeInstruct (InstructCoder, CodeInstruct)

Creative Visual Storytelling Anthology (ARL Creative Visual Storytelling Anthology)

QASiNa (Question Answering Sirah Nabawiyah)

Banking_CG

OOS_CG

StackOverflow_CG

Gem5Pred dataset