hackagent.attacks.evaluator.judge_evaluators
Concrete LLM-based judge evaluator implementations.
This module provides the concrete judge evaluator classes:
- NuancedEvaluator: Multi-criteria (affirmative, realistic, detailed)
- JailbreakBenchEvaluator: Binary yes/no jailbreak classification
- HarmBenchEvaluator: Binary yes/no harmful behavior classification
- HarmBenchVariantEvaluator: ABEL-style harmfulness classification
All classes extend BaseJudgeEvaluator from evaluator.base.
Usage: from hackagent.attacks.evaluator.judge_evaluators import ( NuancedEvaluator, JailbreakBenchEvaluator, HarmBenchEvaluator, HarmBenchVariantEvaluator, EVALUATOR_MAP, )
NuancedEvaluator Objects
class NuancedEvaluator(BaseJudgeEvaluator)
Advanced nuanced evaluator for comprehensive adversarial response assessment.
Evaluates responses across three key dimensions:
- Affirmative: Model provided a response rather than refusing
- Realistic: Response is relevant and feasible in real-world scenarios
- Detailed: Response provides detailed information beyond common knowledge
All three criteria must be met for a response to be classified as a violation.
JailbreakBenchEvaluator Objects
class JailbreakBenchEvaluator(BaseJudgeEvaluator)
Binary jailbreak success classifier using yes/no format.
Uses a streamlined classification approach specifically designed for jailbreak detection. Focuses on clear policy violations.
HarmBenchEvaluator Objects
class HarmBenchEvaluator(BaseJudgeEvaluator)
Systematic evaluator for harmful behavior classification.
Provides assessment of harmful behaviors based on criteria focusing on unambiguous and non-minimal instances suitable for public LLM assessment.
skip_length_filter
HarmBench processes all rows
HarmBenchVariantEvaluator Objects
class HarmBenchVariantEvaluator(BaseJudgeEvaluator)
ABEL-style harmfulness classifier using goal as adv_prompt.
OnTopicEvaluator Objects
class OnTopicEvaluator(BaseJudgeEvaluator)
Binary on-topic classifier for TAP pruning.