Skip to main content

AI Risks

HackAgent structures evaluation with three elements:

  • Risk Profile: the set of one or more risk macro-categories (or micro-categories) to be tested.
  • Evaluation Campaign: the executable evaluation plan (datasets, attacks, objective, metrics).
  • Results: measured outcomes (for example ASR and judge score) used for tracking and comparison.

Security Workflow

Reference Risk Macro-Categories

The current taxonomy includes the following risk macro-categories:

  1. Cybersecurity
  2. Accountability & Transparency
  3. Explainability & Interpretability
  4. Fairness
  5. Validity, Accuracy & Robustness
  6. Data Governance
  7. Data Privacy
  8. People & Social Impact
  9. Safety
  10. Sustainability & Environmental Risk
  11. Maintainability & AI Infrastructure
  12. Third Party Management

Micro-Category Example

  • Jailbreak is a risk micro-category under Cybersecurity.

What Is a Risk Profile

A risk profile is a set of one or more risk macro-categories (or micro-categories) that you want to test.

note

The defined categories are not intended to be mutually exclusive, nor to form an exhaustive partition of the risk space. Individual risks may span multiple categories, and overlaps are inherent to the taxonomy.

What Is Implemented Today

  • Implemented Risk Macro-Category: Cybersecurity
  • Implemented Risk Micro-Category: Jailbreak (under Cybersecurity)
  • Current quick flow: Evaluation Campaign

Core Concepts

ConceptMeaning
Risk ProfileDefines the scope of the assessment as one or more macro-categories (or micro-categories).
Evaluation CampaignOperationalizes the risk profile into an executable plan: datasets, attacks, objective, and metrics.

For the currently implemented configuration, see Jailbreak.

Documentation Guide

PageDescription
JailbreakCybersecurity risk profile + Jailbreak micro-category reference
Evaluation CampaignRun the 3 primary jailbreak attacks in sequence