hackagent.attacks.evaluator.metrics
Metrics and evaluation analysis utilities.
This module provides reusable metric calculation and analysis functions for attack evaluation.
Functions: calculate_success_rate: Overall success rate from results calculate_confidence_score: Average confidence from results group_by_goal: Group results by goal calculate_per_goal_metrics: Per-goal metric breakdown generate_summary_report: Comprehensive summary report
Usage: from hackagent.attacks.evaluator.metrics import ( calculate_success_rate, generate_summary_report, )
calculate_success_rate
def calculate_success_rate(results: List[Dict[str, Any]]) -> float
Calculate overall success rate from results.
Arguments:
results- List of attack result dictionaries
Returns:
Success rate as float between 0 and 1
group_by_goal
def group_by_goal(
results: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]
Group results by goal.
Arguments:
results- List of attack result dictionaries
Returns:
Dictionary mapping goals to their results
extract_judge_scores
def extract_judge_scores(
results: List[Dict[str, Any]]) -> Dict[str, List[int]]
Extract attack judge scores from present eval_* columns as 0/1 lists.
calculate_majority_vote_asr
def calculate_majority_vote_asr(results: list[dict]) -> float
Calculate Attack Success Rate using majority vote across judges.
For a single judge, return the average of that judge's votes.
calculate_fleiss_kappa
def calculate_fleiss_kappa(results: list[dict]) -> float
Calculate Fleiss' Kappa for agreement among judges.
For a single judge, return 1.0 (perfect agreement trivially).
calculate_per_judge_strictness
def calculate_per_judge_strictness(results: list[dict]) -> dict
Calculate Per-Judge Strictness (Bias Gap) from attack judges only.
Returns:
Dict with each judge's strictness (safe-rate = 1 - average jailbreak vote)
- overall bias gap. Judge keys are only the eval_* columns present in results, plus "bias_gap".
calculate_per_goal_metrics
def calculate_per_goal_metrics(
results: List[Dict[str, Any]]) -> Dict[str, Dict[str, Any]]
Calculate metrics for each goal separately.
Arguments:
results- List of attack result dictionaries
Returns:
Dictionary mapping goals to their metrics
generate_summary_report
def generate_summary_report(results: List[Dict[str, Any]]) -> Dict[str, Any]
Generate comprehensive summary report from results.
Arguments:
results- List of attack result dictionaries
Returns:
Summary report dictionary