Dataset Presets
Pre-configured dataset presets provide instant access to 30+ popular AI safety benchmarks from leading research institutions and safety organizations.
Basic Usageโ
attack_config = {
"attack_type": "baseline",
"dataset": {
"preset": "agentharm", # Preset name
"limit": 50, # Optional: max goals
"shuffle": True, # Optional: randomize
"seed": 42, # Optional: reproducibility
}
}
Use agentharm for agent safety testing, strongreject for jailbreak evaluation, or simplesafetytests for basic safety screening.
๐ค Agent Safetyโ
AgentHarmโ
Testing harmful agentic tasks and autonomous agent behaviors
| Preset | Goals | Description |
|---|---|---|
agentharm | 176+ | Harmful agentic tasks (public test split) |
agentharm_benign | โ | Benign tasks for baseline comparison |
Source: ai-safety-institute/AgentHarm Use case: Evaluating AI agents for harmful autonomous behaviors
# Test agent against harmful agentic tasks
attack_config = {
"attack_type": "advprefix",
"dataset": {"preset": "agentharm", "limit": 50},
}
๐ก๏ธ Jailbreak & Adversarial Evaluationโ
JailbreakBench (NeurIPS 2024)โ
Curated jailbreak behaviors across 10 OpenAI policy categories
| Preset | Goals | Description |
|---|---|---|
jailbreakbench | 100 | Curated misuse behaviors across policy categories |
Source: JailbreakBench/JBB-Behaviors Paper: arXiv:2404.01318 Use case: Standardized jailbreak evaluation with policy-aligned categories
# Curated jailbreak evaluation
attack_config = {
"attack_type": "pair",
"dataset": {"preset": "jailbreakbench", "shuffle": True},
"n_iterations": 10,
}
StrongREJECTโ
Forbidden prompts for jailbreak resistance testing
| Preset | Goals | Description |
|---|---|---|
strongreject | 324 | Forbidden prompts for jailbreak evaluation |
Source: Lemhf14/strongreject_small_dataset Use case: Testing model refusal capabilities against jailbreak attempts
# Evaluate jailbreak resistance
attack_config = {
"attack_type": "pair",
"dataset": {"preset": "strongreject", "shuffle": True, "limit": 100},
"n_iterations": 10,
}
HarmBenchโ
Standard harmful behavior evaluation
| Preset | Goals | Description |
|---|---|---|
harmbench | 200 | Standard harmful behavior prompts |
harmbench_contextual | 100 | Contextual harmful behavior prompts |
Source: walledai/HarmBench Use case: Comprehensive harmful behavior testing
AdvBenchโ
Adversarially-designed harmful goals
| Preset | Goals | Description |
|---|---|---|
advbench | 520 | Adversarial harmful goals for robustness testing |
Source: walledai/AdvBench Use case: Testing robustness against adversarial attacks
๐ Safety & Refusal Testingโ
BeaverTailsโ
Multi-category safety evaluation with 14 harm categories
| Preset | Goals | Description |
|---|---|---|
beavertails | 330K+ | Safety-labeled QA pairs across 14 harm categories |
Source: PKU-Alignment/BeaverTails Paper: arXiv:2307.04657 Categories: Animal abuse, child abuse, discrimination, drugs, financial crime, hate speech, misinformation, privacy, self-harm, sexual content, terrorism, violence, and more Use case: Comprehensive multi-category safety testing
# Large-scale safety evaluation
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "beavertails", "limit": 1000, "shuffle": True},
}
SALAD-Benchโ
Hierarchical safety benchmark with 3-level taxonomy
| Preset | Goals | Description |
|---|---|---|
saladbench | 21K | Harmful questions with hierarchical safety taxonomy (base set) |
saladbench_attack | 5K | Attack-enhanced questions with jailbreak methods |
Source: OpenSafetyLab/Salad-Data Paper: arXiv:2402.05044 Taxonomy: 3-level harm categories (representation, misinformation, malicious use, human autonomy, etc.) Use case: Granular safety testing with attack augmentation
# Test against attack-enhanced prompts
attack_config = {
"attack_type": "advprefix",
"dataset": {"preset": "saladbench_attack", "limit": 500},
}
SimpleSafetyTestsโ
Clear-cut harmful prompts for basic safety screening
| Preset | Goals | Description |
|---|---|---|
simplesafetytests | 100 | Clear-cut harmful prompts for quick safety checks |
Source: Bertievidgen/SimpleSafetyTests Use case: Quick safety sanity checks and baseline testing
Do-Not-Answerโ
Questions LLMs should refuse to answer
| Preset | Goals | Description |
|---|---|---|
donotanswer | 939 | Questions that responsible AI systems should refuse |
Source: LibrAI/do-not-answer Use case: Testing appropriate refusal behaviors
XSTestโ
Over-refusal and safety exaggeration evaluation
| Preset | Goals | Description |
|---|---|---|
xstest | โ | Prompts to test for safety exaggeration/over-refusal |
Source: Paul/xstest Use case: Ensuring models don't over-refuse benign requests
# Test for over-refusal
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "xstest"},
}
CoCoNotโ
Context-conditioned refusal evaluation
| Preset | Goals | Description |
|---|---|---|
coconot | โ | Context-conditioned refusal evaluation prompts |
Source: allenai/coconot Use case: Testing context-aware safety responses
๐งช Knowledge Hazards & Dangerous Capabilitiesโ
WMDP (Weapons of Mass Destruction Proxy)โ
Hazardous knowledge across biosecurity, cyber, and chemistry domains
| Preset | Domain | Description |
|---|---|---|
wmdp_bio | Biosecurity | Hazardous biology knowledge questions |
wmdp_cyber | Cybersecurity | Hazardous cybersecurity knowledge questions |
wmdp_chem | Chemistry | Hazardous chemistry knowledge questions |
Source: cais/wmdp Use case: Testing for hazardous knowledge leakage in high-risk domains
# Test for hazardous knowledge leakage across domains
for domain in ["wmdp_bio", "wmdp_cyber", "wmdp_chem"]:
attack_config = {
"attack_type": "advprefix",
"dataset": {"preset": domain, "limit": 50},
}
agent.hack(attack_config=attack_config)
SOS-Bench (Inspect Evals)โ
Safety alignment on scientific knowledge across 6 high-risk domains
| Preset | Goals | Description |
|---|---|---|
sosbench | 3,000 | Hazardous science prompts across biology, chemistry, pharmacy, physics, psychology, medical |
Source: SOSBench/SOSBench Paper: arXiv:2505.21605 Use case: Safety evaluation on scientific knowledge (UK AISI framework)
# Test hazardous scientific knowledge
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "sosbench", "limit": 100},
}
๐ Real-World & Regulation-Alignedโ
AIR-Bench 2024 (Inspect Evals)โ
Regulation-aligned safety evaluation mapped to government AI regulations
| Preset | Goals | Description |
|---|---|---|
airbench | 5,690 | Regulation-aligned malicious prompts with 4-level risk taxonomy |
Source: stanford-crfm/air-bench-2024 Paper: arXiv:2407.17436 Use case: Alignment with government AI regulations (Stanford CRFM, UK AISI)
# Regulation-aligned safety testing
attack_config = {
"attack_type": "advprefix",
"dataset": {"preset": "airbench", "limit": 500, "shuffle": True},
}
ToxicChatโ
Real-world toxic user prompts from production systems
| Preset | Goals | Description |
|---|---|---|
toxicchat | 10K | Real user prompts from Vicuna demo with toxicity and jailbreaking labels |
Source: lmsys/toxic-chat Paper: arXiv:2310.17389 Stats: ~7% toxic, ~2% jailbreaking attempts Use case: Testing against real-world adversarial user inputs
# Test against real-world toxic prompts
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "toxicchat", "limit": 200},
}
HarmfulQAโ
Red-teaming harmful questions across 10 academic topics
| Preset | Goals | Description |
|---|---|---|
harmfulqa | 1,960 | Harmful questions across 10 topics with red/blue conversations |
Source: declare-lab/HarmfulQA Paper: arXiv:2308.09662 Topics: Science, history, math, literature, philosophy, social sciences, health, geography, education, business Use case: Red-team testing across academic domains
โ๏ธ Fairness & Discriminationโ
Discrim-Eval (Anthropic)โ
Discrimination testing in LM decision-making across demographics
| Preset | Goals | Description |
|---|---|---|
discrim_eval | 9.4K | Decision prompts testing for discrimination by race, gender, and age |
Source: Anthropic/discrim-eval Paper: arXiv:2312.03689 Coverage: 70 decision scenarios ร 135 demographic combinations Use case: Testing for discriminatory biases in decision-making
# Evaluate discrimination in decision-making
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "discrim_eval", "limit": 1000},
}
๐ Prompt Injection & RAG Securityโ
Prompt Injectionsโ
Prompt injection attack samples for detection and testing
| Preset | Goals | Description |
|---|---|---|
prompt_injections | 662 | Prompt injection samples for attack detection |
Source: deepset/prompt-injections Use case: Testing prompt injection vulnerabilities
# Test prompt injection defenses
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "prompt_injections"},
}
RAG Securityโ
RAG and embedding security evaluation
| Preset | Goals | Description |
|---|---|---|
rag_security | 100K | RAG benchmark covering industry-specific domains |
Source: galileo-ai/ragbench Note: Interim dataset pending SafeRAG release (arXiv:2501.18636) Use case: RAG security testing and evaluation
๐ Truthfulness & Information Qualityโ
TruthfulQAโ
Evaluating truthfulness and reducing hallucinations
| Preset | Goals | Description |
|---|---|---|
truthfulqa | 817 | Questions to evaluate model truthfulness |
Source: truthfulqa/truthful_qa Use case: Testing for truthful responses and misinformation
๐ Listing All Presetsโ
Programmatic Accessโ
from hackagent.datasets import list_presets
presets = list_presets()
for name, description in sorted(presets.items()):
print(f"โข {name}: {description}")
Quick Referenceโ
Run this to see all 30+ available presets with descriptions:
from hackagent.datasets import PRESETS
print(f"Total presets available: {len(PRESETS)}")
for category in ["Agent Safety", "Jailbreak", "Safety", "Knowledge Hazards", "Real-World"]:
print(f"\n{category} Presets:")
# Filter and display by category
๐ก Choosing the Right Presetโ
| Goal | Recommended Presets |
|---|---|
| Quick safety check | simplesafetytests, donotanswer |
| Agent-specific testing | agentharm, agentharm_benign |
| Jailbreak resistance | jailbreakbench, strongreject, harmbench |
| Comprehensive safety | beavertails, saladbench, airbench |
| Hazardous knowledge | wmdp_bio, wmdp_cyber, wmdp_chem, sosbench |
| Real-world attacks | toxicchat, harmfulqa, prompt_injections |
| Fairness testing | discrim_eval |
| RAG systems | rag_security, prompt_injections |
- Start with
simplesafetytestsfor quick validation - Use
agentharmfor agent-specific behaviors - Combine multiple presets for comprehensive evaluation
- Always use
shuffle: Truewithseedfor reproducible random sampling