3,148 machine learning datasets
3,148 dataset results
The Chinese Stock Policy Retrieval Dataset (CSPRD) contains a Chinese policy corpus of 10,002 articles and 709 prospectus examples from 545 companies listed on China’s Science and Technology Innovation Board (STAR Market). CSPRD is bilingual in Chinese and English (Translated by ChatGPT) and is annotated by experienced experts from Shanghai Stock Exchange.
The Failure Mode Classification dataset released in the paper "MWO2KG and Echidna: Constructing and exploring knowledge graphs from maintenance data" by Stewart et al. The goal is to label a given observation (made by a maintainer) with the corresponding Failure Mode Code.
In this dataset an uppertorso humanoid robot with 7-DOF arm explored 100 different objects belonging to 20 different categories using 10 behaviors: Look, Crush, Grasp, Hold, Lift, Drop, Poke, Push, Shake and Tap.
Verified Smart Contracts is a dataset of real Ethereum smart contracts, containing both Solidity and Vyper source code. It consists of every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 186,397 unique smart contracts are provided, filtered down from 2,217,692 smart contracts. The dataset contains 53,843,305 lines of code.
Verified Smart Contracts Code Comments is a dataset of real Ethereum smart contract functions, containing "code, comment" pairs of both Solidity and Vyper source code. The dataset is based on every deployed Ethereum smart contract as of 1st of April 2022, whose been verified on Etherscan and has a least one transaction. A total of 1,541,370 smart contract functions are provided, parsed from 186,397 unique smart contracts, filtered down from 2,217,692 smart contracts.
Vulnerable Verified Smart Contracts is a dataset of real vulnerable Ethereum smart contracts. Based on the manually labeled Benchmark dataset of Solidity smart contracts. A total of 609 vulnerable contracts are provided, containing 1,117 vulnerabilities.
The dataset is specifically constructed for the library-oriented code generation task, which are constructed in the paper “CodeGen4Libs: A Two-Stage Approach for Library-Oriented Code Generation”.
This dataset encompasses 265 speeches (over 200,000 tokens) from the German Bundestag, primarily from the 19th legislative term (2017-2021), given by 195 distinct speakers representing 6 political parties.
The primary data of the SaGA corpus are made up of 25 dialogs of interlocutors (50), who engage in a spatial communication task combining direction-giving and sight description. Six of those dialogues with data only from the direction giver are available including audio (.wav) and video (.mp4) data. The secondary data consists of annotations (*.eaf) of gestures and speech-gesture referents, which have been completely and systematically annotated based on an annotation grid (cf. the SaGA documentation). The corpus is comprised of of 9881 isolated words and 1764 isolated gestures. The stimulus is a model of a town presented in a Virtual Reality (VR) environment. Upon finishing a "bus ride" through the VR town along five landmarks, a router explained the route as well as the wayside landmarks to an unknown and naive follower. The SaGA Corpus was curated for CLARIN as part of the Curation Project "Editing and Integration of Multimodal Resources in CLARIN-D" by the CLARIN-D Working Group 6
The paper presents a study of Clickbait PDFs, which are PDF documents leading to various attacks on the Web. Clickbait PDFs are different from the well-known "MalPDFs", usually found in phishing emails, as they do not contain malware.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Dataset with articles posted in the r/Liberal and r/Conservative subreddits. In total, we collected a corpus of 226,010 articles. We have collected news articles to understand political expression through the shared news articles.
LSA-T is the first continuous Argentinian Sign Language (LSA) dataset. It contains 14,880 sentence level videos of LSA extracted from the CN Sordos YouTube channel with labels and keypoints annotations for each signer. Videos are in 30 FPS full HD (1920x1080).
InstructCoder is the first dataset designed to adapt LLMs for general code editing. It consists of over 100k instruction-input-output triplets and covers multiple distinct code editing scenarios, generated by ChatGPT. LLaMA-33B finetuned on InstructCoder performs on par with ChatGPT on a real-world test set derived from GitHub commits.
The Creative Visual Storytelling Anthology is a collection of 100 author responses to an improved creative visual storytelling exercise over a sequence of three images. Each item contains four facet entries, corresponding to Entity, Scene, Narrative, and Title.
Question Answering Sirah Nabawiyah (QASiNa) Dataset is a reading comprehension dataset consists of QA from Sirah Nabawiyah literature in Indonesian Language
The dataset identifies the shortcomings of existing benchmarks in evaluating the problem of compositional generalization, which underscores the need for the development of datasets tailored to assess compositional generalization in open intent detection tasks.
The dataset identifies the shortcomings of existing benchmarks in evaluating the problem of compositional generalization, which underscores the need for the development of datasets tailored to assess compositional generalization in open intent detection tasks.
The dataset identifies the shortcomings of existing benchmarks in evaluating the problem of compositional generalization, which underscores the need for the development of datasets tailored to assess compositional generalization in open intent detection tasks.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).