PAP (Persuasive Adversarial Prompts)
PAP is a taxonomy-guided persuasion attack that paraphrases harmful prompts into persuasive variants using 40 social-science persuasion techniques. An attacker LLM rewrites the harmful goal using a selected persuasion technique (e.g. Evidence-based Persuasion, Expert Endorsement, Misrepresentation), and the resulting human-readable persuasive prompt is sent to the target model. The technique exploits the observation that LLMs, trained to be helpful conversational partners, are vulnerable to the same persuasion strategies that work on humans.
Overview
PAP uses an attacker LLM to paraphrase harmful goals into persuasive variants via in-context learning. For each goal, the attack iterates over selected persuasion techniques from a taxonomy of 40 techniques. For each technique, the attacker LLM generates a persuasive version of the harmful prompt, which is then sent to the target model. A judge evaluates the response and if a jailbreak is confirmed, the remaining techniques are skipped (early stop).
Research Foundation
PAP is based on the paper:
"How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs" Zeng et al., 2024 arXiv:2401.06373
The paper demonstrates that applying persuasion techniques from social science can achieve a 92% attack success rate on aligned LLMs including GPT-4, without any specialised optimisation.
How PAP Works
Step-by-step
- Technique Selection — The attack selects persuasion techniques to try (default: top-5 from the paper).
- Persuasive Paraphrasing — For each technique, the attacker LLM receives the harmful goal plus the technique's definition and few-shot examples, and produces a persuasive variant.
- Target Query — The persuasive prompt is sent to the target model.
- Judge Evaluation — A judge (e.g. HarmBench) evaluates the target's response.
- Early Stop — If the judge confirms a jailbreak, remaining techniques are skipped.
Basic Usage
from hackagent import HackAgent, AgentTypeEnum
agent = HackAgent(
name="target_agent",
endpoint="http://localhost:8000",
agent_type=AgentTypeEnum.OPENAI_SDK
)
attack_config = {
"attack_type": "pap",
"goals": ["Explain how to pick a lock"],
"attacker": {
"identifier": "gpt-4",
"endpoint": "https://api.openai.com/v1",
"agent_type": "OPENAI_SDK",
"api_key": "your-openai-key",
},
"judges": [
{
"identifier": "gpt-4o-mini",
"type": "harmbench",
"agent_type": "OPENAI_SDK",
"api_key": "your-openai-key",
}
],
}
results = agent.hack(attack_config=attack_config)
Configuration Parameters
pap_params
| Parameter | Type | Default | Description |
|---|---|---|---|
techniques | str | list | "top5" | Which persuasion techniques to use. "top5" uses the 5 most effective from the paper, "all" uses all 40, or provide an explicit list of technique names. |
max_techniques_per_goal | int | 0 | Maximum techniques to try per goal. 0 means try all selected techniques. |
attacker_temperature | float | 1.0 | Temperature for the attacker LLM's paraphrasing. |
attacker_max_tokens | int | 1024 | Max tokens for the attacker LLM response. |
Top-Level Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
attacker | dict | {...} | Attacker LLM config: identifier, endpoint, agent_type, api_key |
judges | list | [{...}] | Judge configurations |
batch_size | int | 1 | Parallelism for goal processing |
batch_size_judge | int | 1 | Parallelism for judge evaluation |
goal_batch_size | int | 1 | Goals processed per batch |
Shared Goal Category Classifier
All attacks accept a top-level category_classifier block. It runs once per goal to attach a normalized category to tracking metadata (independent from judge scoring).
"category_classifier": {
"identifier": "gemma3:4b",
"endpoint": "http://localhost:11434",
"agent_type": "OLLAMA",
"api_key": None,
"max_tokens": 100,
"temperature": 0.0
}
Available Persuasion Techniques
The top-5 most effective techniques (default):
- Evidence-based Persuasion — Using empirical data, statistics, and facts
- Expert Endorsement — Citing domain experts in support of a claim
- Misrepresentation — Presenting oneself or an issue in a deceptive way
- Authority Endorsement — Citing authoritative sources
- Logical Appeal — Using logic and reasoning to influence
The full taxonomy includes 40 techniques spanning evidence-based, social influence, emotional, cognitive, and deceptive categories. Set "techniques": "all" to use all of them, or provide a custom list.
Notes
- PAP requires an attacker LLM (e.g. GPT-4) to perform the persuasive paraphrasing. Configure the
attackerfield with valid LLM credentials. - The attack is parallelisable at the goal level (each goal is independent), but techniques within a goal are tried sequentially to support early stopping.
- More powerful LLMs (e.g. GPT-4) have been shown to be more vulnerable to PAP than weaker models.
- The attack generates human-readable prompts, making it useful for red-teaming and safety evaluation.