Social Media Messages for Early Cyberattack Detection on Blockchain

TextsIntroduced 2025-03-19

ELTEX-Blockchain: A Domain-Specific Dataset for Cybersecurity

🔐 12k Synthetic Social Media Messages for Early Cyberattack Detection on Blockchain

Dataset Statistics

| Category | Samples | Description | |--------------|-------------|-----------------| | Cyberattack | 6,941 | Early warning signals and indicators of cyberattacks | | General | 4,507 | Regular blockchain discussions (non-security related) |

Dataset Structure

Each entry in the dataset contains:

  • message_id: Unique identifier for each message
  • content: The text content of the social media message
  • topic: Classification label ("cyberattack" or "general")

Performance

Gemma-2b-it fine-tuned on this dataset:

  • Achieves a Brier score of 0.16 using only synthetic data in our social media threat detection task
  • Shows competitive performance on this specific task when compared to general-purpose models known for capabilities in cybersecurity tasks, like granite-3.2-2b-instruct, and cybersecurity-focused LLMs trained on Primus
  • Demonstrates promising results for smaller models on this specific task, with our best hybrid model achieving an F1 score of 0.81 on our blockchain threat detection test set, though GPT-4o maintains superior overall accuracy (0.84) and calibration (Brier 0.10)

Attack Type Distribution

| Attack Vectors | Seed Examples | |-------------------|------------------------| | Social Engineering & Phishing | Credential theft, wallet phishing | | Smart Contract Exploits | Token claim vulnerabilities, flash loans | | Exchange Security Breaches | Hot wallet compromises, key theft | | DeFi Protocol Attacks | Liquidity pool manipulation, bridge exploits |

For more about Cyberattack Vectors, read Attack Vectors Wiki

Citation

@misc{razmyslovich2025eltexframeworkdomaindrivensynthetic,
      title={ELTEX: A Framework for Domain-Driven Synthetic Data Generation}, 
      author={Arina Razmyslovich and Kseniia Murasheva and Sofia Sedlova and Julien Capitaine and Eugene Dmitriev},
      year={2025},
      eprint={2503.15055},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.15055}, 
}