SGXSTest
Singapore XSTest
For testing refusal behavior in a cultural setting, we introduce SGXSTest — a set of manually curated prompts designed to measure exaggerated safety within the context of Singaporean culture. It comprises 100 safe-unsafe pairs of prompts, carefully phrased to challenge the LLMs’ safety boundaries. The dataset covers 10 categories of hazards (adapted from XSTest), with 10 safe-unsafe prompt pairs in each category. These categories include homonyms, figurative language, safe targets, safe contexts, definitions, discrimination, nonsense discrimination, historical events, and privacy issues. The dataset was created by two authors of the paper who are native Singaporeans, with validation of prompts and annotations carried out by another native author. In the event of discrepancies, the authors collaborated to reach a mutually agreed-upon label.