Skip to main content

Dataset Presets

Pre-configured dataset presets provide instant access to 30+ popular AI safety benchmarks from leading research institutions and safety organizations.

Basic Usageโ€‹

attack_config = {
"attack_type": "baseline",
"dataset": {
"preset": "agentharm", # Preset name
"limit": 50, # Optional: max goals
"shuffle": True, # Optional: randomize
"seed": 42, # Optional: reproducibility
}
}
Quick Start

Use agentharm for agent safety testing, strongreject for jailbreak evaluation, or simplesafetytests for basic safety screening.


๐Ÿค– Agent Safetyโ€‹

AgentHarmโ€‹

Testing harmful agentic tasks and autonomous agent behaviors

PresetGoalsDescription
agentharm176+Harmful agentic tasks (public test split)
agentharm_benignโ€”Benign tasks for baseline comparison

Source: ai-safety-institute/AgentHarm Use case: Evaluating AI agents for harmful autonomous behaviors

# Test agent against harmful agentic tasks
attack_config = {
"attack_type": "advprefix",
"dataset": {"preset": "agentharm", "limit": 50},
}

๐Ÿ›ก๏ธ Jailbreak & Adversarial Evaluationโ€‹

JailbreakBench (NeurIPS 2024)โ€‹

Curated jailbreak behaviors across 10 OpenAI policy categories

PresetGoalsDescription
jailbreakbench100Curated misuse behaviors across policy categories

Source: JailbreakBench/JBB-Behaviors Paper: arXiv:2404.01318 Use case: Standardized jailbreak evaluation with policy-aligned categories

# Curated jailbreak evaluation
attack_config = {
"attack_type": "pair",
"dataset": {"preset": "jailbreakbench", "shuffle": True},
"n_iterations": 10,
}

StrongREJECTโ€‹

Forbidden prompts for jailbreak resistance testing

PresetGoalsDescription
strongreject324Forbidden prompts for jailbreak evaluation

Source: Lemhf14/strongreject_small_dataset Use case: Testing model refusal capabilities against jailbreak attempts

# Evaluate jailbreak resistance
attack_config = {
"attack_type": "pair",
"dataset": {"preset": "strongreject", "shuffle": True, "limit": 100},
"n_iterations": 10,
}

HarmBenchโ€‹

Standard harmful behavior evaluation

PresetGoalsDescription
harmbench200Standard harmful behavior prompts
harmbench_contextual100Contextual harmful behavior prompts

Source: walledai/HarmBench Use case: Comprehensive harmful behavior testing

AdvBenchโ€‹

Adversarially-designed harmful goals

PresetGoalsDescription
advbench520Adversarial harmful goals for robustness testing

Source: walledai/AdvBench Use case: Testing robustness against adversarial attacks


๐Ÿ”’ Safety & Refusal Testingโ€‹

BeaverTailsโ€‹

Multi-category safety evaluation with 14 harm categories

PresetGoalsDescription
beavertails330K+Safety-labeled QA pairs across 14 harm categories

Source: PKU-Alignment/BeaverTails Paper: arXiv:2307.04657 Categories: Animal abuse, child abuse, discrimination, drugs, financial crime, hate speech, misinformation, privacy, self-harm, sexual content, terrorism, violence, and more Use case: Comprehensive multi-category safety testing

# Large-scale safety evaluation
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "beavertails", "limit": 1000, "shuffle": True},
}

SALAD-Benchโ€‹

Hierarchical safety benchmark with 3-level taxonomy

PresetGoalsDescription
saladbench21KHarmful questions with hierarchical safety taxonomy (base set)
saladbench_attack5KAttack-enhanced questions with jailbreak methods

Source: OpenSafetyLab/Salad-Data Paper: arXiv:2402.05044 Taxonomy: 3-level harm categories (representation, misinformation, malicious use, human autonomy, etc.) Use case: Granular safety testing with attack augmentation

# Test against attack-enhanced prompts
attack_config = {
"attack_type": "advprefix",
"dataset": {"preset": "saladbench_attack", "limit": 500},
}

SimpleSafetyTestsโ€‹

Clear-cut harmful prompts for basic safety screening

PresetGoalsDescription
simplesafetytests100Clear-cut harmful prompts for quick safety checks

Source: Bertievidgen/SimpleSafetyTests Use case: Quick safety sanity checks and baseline testing

Do-Not-Answerโ€‹

Questions LLMs should refuse to answer

PresetGoalsDescription
donotanswer939Questions that responsible AI systems should refuse

Source: LibrAI/do-not-answer Use case: Testing appropriate refusal behaviors

XSTestโ€‹

Over-refusal and safety exaggeration evaluation

PresetGoalsDescription
xstestโ€”Prompts to test for safety exaggeration/over-refusal

Source: Paul/xstest Use case: Ensuring models don't over-refuse benign requests

# Test for over-refusal
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "xstest"},
}

CoCoNotโ€‹

Context-conditioned refusal evaluation

PresetGoalsDescription
coconotโ€”Context-conditioned refusal evaluation prompts

Source: allenai/coconot Use case: Testing context-aware safety responses


๐Ÿงช Knowledge Hazards & Dangerous Capabilitiesโ€‹

WMDP (Weapons of Mass Destruction Proxy)โ€‹

Hazardous knowledge across biosecurity, cyber, and chemistry domains

PresetDomainDescription
wmdp_bioBiosecurityHazardous biology knowledge questions
wmdp_cyberCybersecurityHazardous cybersecurity knowledge questions
wmdp_chemChemistryHazardous chemistry knowledge questions

Source: cais/wmdp Use case: Testing for hazardous knowledge leakage in high-risk domains

# Test for hazardous knowledge leakage across domains
for domain in ["wmdp_bio", "wmdp_cyber", "wmdp_chem"]:
attack_config = {
"attack_type": "advprefix",
"dataset": {"preset": domain, "limit": 50},
}
agent.hack(attack_config=attack_config)

SOS-Bench (Inspect Evals)โ€‹

Safety alignment on scientific knowledge across 6 high-risk domains

PresetGoalsDescription
sosbench3,000Hazardous science prompts across biology, chemistry, pharmacy, physics, psychology, medical

Source: SOSBench/SOSBench Paper: arXiv:2505.21605 Use case: Safety evaluation on scientific knowledge (UK AISI framework)

# Test hazardous scientific knowledge
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "sosbench", "limit": 100},
}

๐ŸŒ Real-World & Regulation-Alignedโ€‹

AIR-Bench 2024 (Inspect Evals)โ€‹

Regulation-aligned safety evaluation mapped to government AI regulations

PresetGoalsDescription
airbench5,690Regulation-aligned malicious prompts with 4-level risk taxonomy

Source: stanford-crfm/air-bench-2024 Paper: arXiv:2407.17436 Use case: Alignment with government AI regulations (Stanford CRFM, UK AISI)

# Regulation-aligned safety testing
attack_config = {
"attack_type": "advprefix",
"dataset": {"preset": "airbench", "limit": 500, "shuffle": True},
}

ToxicChatโ€‹

Real-world toxic user prompts from production systems

PresetGoalsDescription
toxicchat10KReal user prompts from Vicuna demo with toxicity and jailbreaking labels

Source: lmsys/toxic-chat Paper: arXiv:2310.17389 Stats: ~7% toxic, ~2% jailbreaking attempts Use case: Testing against real-world adversarial user inputs

# Test against real-world toxic prompts
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "toxicchat", "limit": 200},
}

HarmfulQAโ€‹

Red-teaming harmful questions across 10 academic topics

PresetGoalsDescription
harmfulqa1,960Harmful questions across 10 topics with red/blue conversations

Source: declare-lab/HarmfulQA Paper: arXiv:2308.09662 Topics: Science, history, math, literature, philosophy, social sciences, health, geography, education, business Use case: Red-team testing across academic domains


โš–๏ธ Fairness & Discriminationโ€‹

Discrim-Eval (Anthropic)โ€‹

Discrimination testing in LM decision-making across demographics

PresetGoalsDescription
discrim_eval9.4KDecision prompts testing for discrimination by race, gender, and age

Source: Anthropic/discrim-eval Paper: arXiv:2312.03689 Coverage: 70 decision scenarios ร— 135 demographic combinations Use case: Testing for discriminatory biases in decision-making

# Evaluate discrimination in decision-making
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "discrim_eval", "limit": 1000},
}

๐Ÿ”“ Prompt Injection & RAG Securityโ€‹

Prompt Injectionsโ€‹

Prompt injection attack samples for detection and testing

PresetGoalsDescription
prompt_injections662Prompt injection samples for attack detection

Source: deepset/prompt-injections Use case: Testing prompt injection vulnerabilities

# Test prompt injection defenses
attack_config = {
"attack_type": "baseline",
"dataset": {"preset": "prompt_injections"},
}

RAG Securityโ€‹

RAG and embedding security evaluation

PresetGoalsDescription
rag_security100KRAG benchmark covering industry-specific domains

Source: galileo-ai/ragbench Note: Interim dataset pending SafeRAG release (arXiv:2501.18636) Use case: RAG security testing and evaluation


๐Ÿ“Š Truthfulness & Information Qualityโ€‹

TruthfulQAโ€‹

Evaluating truthfulness and reducing hallucinations

PresetGoalsDescription
truthfulqa817Questions to evaluate model truthfulness

Source: truthfulqa/truthful_qa Use case: Testing for truthful responses and misinformation


๐Ÿ“‹ Listing All Presetsโ€‹

Programmatic Accessโ€‹

from hackagent.datasets import list_presets

presets = list_presets()
for name, description in sorted(presets.items()):
print(f"โ€ข {name}: {description}")

Quick Referenceโ€‹

Run this to see all 30+ available presets with descriptions:

from hackagent.datasets import PRESETS

print(f"Total presets available: {len(PRESETS)}")
for category in ["Agent Safety", "Jailbreak", "Safety", "Knowledge Hazards", "Real-World"]:
print(f"\n{category} Presets:")
# Filter and display by category

๐Ÿ’ก Choosing the Right Presetโ€‹

GoalRecommended Presets
Quick safety checksimplesafetytests, donotanswer
Agent-specific testingagentharm, agentharm_benign
Jailbreak resistancejailbreakbench, strongreject, harmbench
Comprehensive safetybeavertails, saladbench, airbench
Hazardous knowledgewmdp_bio, wmdp_cyber, wmdp_chem, sosbench
Real-world attackstoxicchat, harmfulqa, prompt_injections
Fairness testingdiscrim_eval
RAG systemsrag_security, prompt_injections
Best Practices
  1. Start with simplesafetytests for quick validation
  2. Use agentharm for agent-specific behaviors
  3. Combine multiple presets for comprehensive evaluation
  4. Always use shuffle: True with seed for reproducible random sampling