Draper VDisc Dataset
reza
Draper VDISC Dataset - Vulnerability Detection in Source Code
The dataset consists of the source code of 1.27 million functions mined from open source software, labeled by static analysis for potential vulnerabilities. For more details on the dataset and benchmark results, see https://arxiv.org/abs/1807.04320.
The data is provided in three HDF5 files corresponding to an 80:10:10 train/validate/test split, matching the splits used in our paper. The combined file size is roughly 1 GB. Each function's raw source code, starting from the function name, is stored as a variable-length UTF-8 string. Five binary 'vulnerability' labels are provided for each function, corresponding to the four most common CWEs in our data plus all others:
- CWE-120 (3.7% of functions)
- CWE-119 (1.9% of functions)
- CWE-469 (0.95% of functions)
- CWE-476 (0.21% of functions)
- CWE-other (2.7% of functions)
Functions may have more than one detected CWE each.
Please cite our paper if you use this dataset in a publication: https://arxiv.org/abs/1807.04320