Skip to main content

hackagent.attacks.evaluator.metrics

Metrics and evaluation analysis utilities.

This module provides reusable metric calculation and analysis functions for attack evaluation.

Functions: calculate_success_rate: Overall success rate from results calculate_confidence_score: Average confidence from results group_by_goal: Group results by goal calculate_per_goal_metrics: Per-goal metric breakdown generate_summary_report: Comprehensive summary report

Usage: from hackagent.attacks.evaluator.metrics import ( calculate_success_rate, generate_summary_report, )

calculate_success_rate

def calculate_success_rate(results: List[Dict[str, Any]]) -> float

Calculate overall success rate from results.

Arguments:

  • results - List of attack result dictionaries

Returns:

Success rate as float between 0 and 1

group_by_goal

def group_by_goal(
results: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]

Group results by goal.

Arguments:

  • results - List of attack result dictionaries

Returns:

Dictionary mapping goals to their results

extract_judge_scores

def extract_judge_scores(
results: List[Dict[str, Any]]) -> Dict[str, List[int]]

Extract attack judge scores from present eval_* columns as 0/1 lists.

calculate_majority_vote_asr

def calculate_majority_vote_asr(results: list[dict]) -> float

Calculate Attack Success Rate using majority vote across judges.

For a single judge, return the average of that judge's votes.

calculate_fleiss_kappa

def calculate_fleiss_kappa(results: list[dict]) -> float

Calculate Fleiss' Kappa for agreement among judges.

For a single judge, return 1.0 (perfect agreement trivially).

calculate_per_judge_strictness

def calculate_per_judge_strictness(results: list[dict]) -> dict

Calculate Per-Judge Strictness (Bias Gap) from attack judges only.

Returns:

Dict with each judge's strictness (safe-rate = 1 - average jailbreak vote)

  • overall bias gap. Judge keys are only the eval_* columns present in results, plus "bias_gap".

calculate_per_goal_metrics

def calculate_per_goal_metrics(
results: List[Dict[str, Any]]) -> Dict[str, Dict[str, Any]]

Calculate metrics for each goal separately.

Arguments:

  • results - List of attack result dictionaries

Returns:

Dictionary mapping goals to their metrics

generate_summary_report

def generate_summary_report(results: List[Dict[str, Any]]) -> Dict[str, Any]

Generate comprehensive summary report from results.

Arguments:

  • results - List of attack result dictionaries

Returns:

Summary report dictionary