Jailbreaking and prompt injection get conflated constantly, and the distinction matters for defense. Prompt injection hijacks an application's task execution — it makes the AI do something the application didn't intend. Jailbreaking targets the model's own safety alignment — it makes the model say or generate something its training tried to prevent. Different target, different defenses, often the same underlying weakness: alignment is a statistical tendency, not a hard rule, and statistical tendencies have edges.
Technique Family 1: Persona and Role-Play Attacks
The original "DAN" (Do Anything Now) style attacks: ask the model to role-play a character without restrictions, frame the harmful request as fiction, or claim a special "developer mode" that disables safety training. These work because instruction-tuned models are trained to be helpful and accommodating to framing — and a role-play frame is, syntactically, just another instruction.
Technique Family 2: Multi-Turn and Many-Shot Escalation
Rather than asking directly, an attacker builds context across many turns — establishing rapport, asking adjacent questions, gradually narrowing toward the target request. "Many-shot jailbreaking" pushes this further: stuffing the context window with dozens or hundreds of example exchanges showing the model "agreeing" to similar requests, exploiting in-context learning to shift the model's effective behavior away from its trained defaults within a single long conversation.
Technique Family 3: Encoding and Obfuscation
Base64-encoding the harmful request, spelling it with substituted characters, splitting it across multiple message fragments, or asking the model to translate a request through several languages before responding to it. Each layer of obfuscation exploits the gap between what a content filter pattern-matches on and what the model itself will decode and act on.
Technique Family 4: Adversarial Suffixes
Gradient-based attacks (in the academic literature, GCG and its successors) computationally search for a suffix string that, appended to a harmful prompt, suppresses the model's refusal behavior — often nonsensical-looking token sequences that happen to shift the model's internal representation away from "this looks like a request I should refuse." These transfer surprisingly well across models and are the most "engineering" of the technique families, less reliant on social framing.
Technique Family 5: Cognitive Overload and Prompt Stuffing
Burying the actual request inside a large amount of benign-looking text, complex formatting, or a long list of unrelated instructions — exploiting the fact that safety behavior can be less reliable when the model's attention is spread across a long, dense context.
Why Defenses Lag
Every defense proposed so far is itself statistical: a classifier trained to detect jailbreak attempts is trained on known technique families and generalizes imperfectly to new ones. Refusal training is reinforcement on examples, not a formal guarantee. There is no jailbreak-proof model in the same sense there's no injection-proof LLM — the mitigations available reduce the success rate and raise the cost of an attack; they don't close the gap categorically.
What Actually Helps in Practice
- Layered defense, not a single filter — input classification, output classification, and system-prompt hardening together catch more than any one layer alone.
- Context-length and turn-count monitoring — many-shot and cognitive-overload attacks rely on volume; flagging unusually long contexts or rapid topic-narrowing conversations is a cheap signal.
- Output-side review for high-risk use cases — for applications where a successful jailbreak has real consequences, validate output before it reaches an end user or downstream system, not just input before it reaches the model.
- Red team continuously, not once — technique families evolve faster than a point-in-time security review can track; periodic adversarial testing against your specific deployment matters more than trusting a vendor's general safety claims.
The Honest Summary
None of this is solved. Treat "the model refused" as a speed bump an attacker pays for, not a wall they can't get past, and design the surrounding application so that a successful jailbreak — which will eventually happen — doesn't translate into a successful attack on your system or your users.
Back to Blog