Papers With Code 2 | ML Benchmarks, SotA Results & Code

This dataset is comprised of the dynamic analysis reports generated by CAPEv2, from both malware and goodware. We source the goodware as they do in Dambra et al. (https://arxiv.org/abs/2307.14657), where trough the community-maintained packages of Chocolatey they create a dataset that spans 2012 to 2020. The malware are sourced from VirusTotal, namely samples of Portable Executable from 2017 - 2020 that they release for academic purposes. In total, the dataset we assembled contains 26,200 PE samples: 8,600 (33%) goodware and 17,675 (67%) malware.

All samples were executed in a series of Windows 7 VMware virtual machines and without network connectivity due to policy/ethical constraints. These were orchestrated through CAPEv2, a maintained open-source successor of Cuckoo. This framework is widely used for analyzing and detecting potentially malicious binaries by executing them in an isolated environment and observing their behavior without risking the security of the host system. Every sample was executed for 150 seconds, an empirical lower bound on the time required to gather the full behavior of samples.

We tried to mitigate evasive checks by running the VMwareCloak script to remove well-known artifacts introduced by VMware. Moreover, we populated the filesystem with documents and other common types of files, to resemble a legitimate desktop workstation that malware may identify as a valuable target. The dynamic analysis output is a detailed report with information on the syscalls invoked by the binary and all the relative flags and arguments, as well as all interactions with the file system, registry, network, and other key elements of the operating system.

The final dataset contains only the dynamic analysis reports, without the original binaries.

AutoRobust