SUDO Dataset

TextsIntroduced 2025-03-26

SUDO is a benchmark of 50 real-world malicious tasks designed to evaluate LLM-based computer agents in live desktop and web environments. It covers critical risk domains—including system security, content safety, societal harms, and privacy violations—based on the AirBench taxonomy. The dataset supports fine-grained evaluation using task-specific checklists and can be used to assess model misuse potential, build safer agents, or guide alignment research.

Benchmarks

Red Teaming/Attack Success Rate