Shellcode_IA32: A Dataset for Automatic Shellcode Generation

Pietro Liguori, Erfan Al-Hossami, Domenico Cotroneo, Roberto Natella, Bojan Cukic, Samira Shaikh

2021-04-27ACL (NLP4Prog) 2021 8Machine Translation NMT Translation Code Generation

Abstract

We take the first step to address the task of automatically generating shellcodes, i.e., small pieces of code used as a payload in the exploitation of a software vulnerability, starting from natural language comments. We assemble and release a novel dataset (Shellcode_IA32), consisting of challenging but common assembly instructions with their natural language descriptions. We experiment with standard methods in neural machine translation (NMT) to establish baseline performance levels on this task.

Results

Task	Dataset	Metric	Value	Model
Code Generation	Shellcode_IA32	BLEU-4	62.97	LSTM-based Sequence to Sequence
Code Generation	Shellcode_IA32	Exact Match Accuracy	51.55	LSTM-based Sequence to Sequence

Related Papers

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18 A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 Towards Formal Verification of LLM-Generated Code from Natural Language Prompts2025-07-17 MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks2025-07-16 Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16 Function-to-Style Guidance of LLMs for Code Translation2025-07-15 The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15 Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding2025-07-14