hackagent.attacks.techniques.pair.attack
PAIR attack implementation.
Implements the Prompt Automatic Iterative Refinement (PAIR) attack using an attacker LLM to iteratively refine jailbreak prompts.
Result Tracking: Uses TrackingCoordinator to manage both pipeline-level StepTracker and per-goal Tracker. The coordinator handles goal lifecycle, crash-safe finalization, and summary logging.
PAIRAttack Objects
class PAIRAttack(BaseAttack)
PAIR (Prompt Automatic Iterative Refinement) attack.
Implements the PAIR algorithm from: Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (2023) https://arxiv.org/abs/2310.08419
PAIR uses an attacker LLM to iteratively refine an adversarial prompt based on the target model's responses and a scorer score:
- The attacker generates an initial or refined jailbreak prompt.
- The prompt is sent to the target model.
- A scorer rates the response on a 1–10 jailbreak success scale.
- The score and response are fed back to the attacker as context for the next refinement.
- Steps 1–4 repeat for
n_iterationsrounds or until early stop.
Multiple independent n_streams are run in parallel (one per goal);
each stream maintains its own conversation history with the attacker.
The attack requires three separate model roles:
- Attacker (
config["attacker"]) — an LLM that proposes prompt improvements based on feedback. - Target — the victim model reached via
agent_router. - Scorer (
config["scorer"]) — dedicated scorer model using the AutoDAN-Turbo scorer+wrapper protocol.
Attributes:
- ``0 - Merged PAIR configuration dictionary.
- ``1 - Authenticated HackAgent API client.
- ``2 - Router for the victim model.
- ``3 - Router for the attacker LLM.
- ``4 - Router for the scorer LLM.
5 - Loaded :class:6 instance for the configuredobjectivekey.9 - Hierarchical logger athackagent.attacks.pair``.
__init__
def __init__(config: Optional[Dict[str, Any]] = None,
client: Optional[AuthenticatedClient] = None,
agent_router: Optional[AgentRouter] = None)
Initialize PAIR attack.
Arguments:
config- Optional configuration overrides merged into :data:~hackagent.attacks.techniques.pair.config.DEFAULT_PAIR_CONFIG.client- Authenticated HackAgent API client.agent_router- Router for the victim model.
Raises:
ValueError- Ifclientoragent_routerisNone, if the attacker router cannot be initialised, or if the configuredobjectivekey is not in :data:~hackagent.attacks.techniques.pair.config.DEFAULT_PAIR_CONFIG3.
run
@with_tui_logging(logger_name="hackagent.attacks", level=logging.INFO)
def run(goals: List[str]) -> List[Dict[str, Any]]
Execute PAIR attack on goals.
Uses TrackingCoordinator to manage both pipeline-level and per-goal result tracking through a single unified interface.
Arguments:
goals- List of harmful goals to test
Returns:
List of attack results with scores