Papers With Code 2 | ML Benchmarks, SotA Results & Code

The paper presents a study of Clickbait PDFs, which are PDF documents leading to various attacks on the Web. Clickbait PDFs are different from the well-known "MalPDFs", usually found in phishing emails, as they do not contain malware.

The study leverages a dataset of PDF files we receive from two industrial collaborators, Cisco and InQuest Labs. As this is paid data, that the companies retrieve as part of their business logic, we are not allowed to share it. We are also not allowed to share the data we obtain via the VirusTotal Public API.

Nonetheless, we share PDF file hashes to allow retrieving them from VirusTotal. Moreover, we share the screenshots of the first pages and the URLs extracted from the PDFs. We focus on the URLs relevant for our hypotheses (the total number of extracted URLs is around four millions). In addition, we also share the language of the text in the PDFs and the search engine rankings of the PDFs distributed via SEO attacks.

Part of our experiments involve developing and training a deep learning model (based on DeepCluster). We created an additional Github repository containing the scripts that can help reproduce the clustering procedure.

This data is shared via several CSV files, a folder with .png images and a .npy file containing an intermediate result of our deep learning model. Each file also has an ad-hoc description in the "Data Explorer" tab of the dataset. This notebook contains the documentation and the information necessary to run the experiments presented in our paper.