Red Team Notes: Jailbreaks We Actually See

Context

Jailbreak attempts are no longer theoretical. In production we routinely see prompt injection, instruction collisions, tool abuse, and output escalation. This memo groups what we actually see, and the least painful mitigations that reduce incident frequency without stalling teams.

Core Framework

Prompt Injection (Content & Link-borne): Inputs that smuggle instructions to override system or policy prompts.
- Signals: “Ignore previous,” “as a system message,” hidden base64, CSS/HTML comments, PDF alt text.
- Mitigations: Dual-channel parsing (data vs. instructions), instruction sealing (re-assert policy each turn), retrieval sanitization (strip HTML/script), and a prompt-inject classifier that routes to hardened refusal UX.
Instruction Collisions (Multi-pack): System, policy, tool, and app prompts disagree; the model chooses the “looser” one.
- Signals: Refusals in safe flows; contradictory rationales in the same session.
- Mitigations: Prompt registry with precedence (system > policy > tool > app), versioned “prompt packs,” and CI tests that diff outputs across packs.
Tool Abuse (Function Calls): Coaxing the agent to call tools with risky arguments or in unsafe order.
- Signals: Repeated calls with near-identical args; boundary values; escalating scopes (“download all,” “delete *”).
- Mitigations: Strong schemas with validators, pre-flight guards (policy filter before tool), per-tool rate limits, and allow-lists on fields.
Output Escalation: Getting the system to produce text that another system interprets as commands (Slack slash, SQL, shell).
- Signals: Suggested replies that look like executable commands; code blocks with destructive ops.
- Mitigations: Neutralization (render commands inert), secondary “unsafe-output” classifier, and confirm/override gates when outputs could be executed downstream.
Context Overflow & Mask Slips: Long contexts push policy out; role/tenant tags lost in merges.
- Signals: Policy mentions vanish as tokens grow; cross-tenant terms show up.
- Mitigations: Policy pinning (prepend & repeat), tenant scoping in retrieval, and max-context budgets with back-pressure.

Recommended Actions

Ship a Prompt Pack Registry: Single source for system/policy/tool prompts with precedence, changelog, owners.
Add Two Cheap Classifiers: prompt-inject and unsafe-output. Route positives to hardened refusal UX and HITL flows.
Harden Tools: JSON schemas + strict validators, argument allow-lists, per-tool rate limits, and “dry-run” mode for high-risk ops.
Sanitize Retrieval: Strip scripts, hidden text, and links; tag sources with trust levels; down-rank low-trust content.
Feature Flags + Evals: Guardrail changes land behind flags; refusal precision/recall and override rate become blocking CI checks.

Common Pitfalls

Binary Blocks Only: Refuse or allow with no graded responses,kills adoption and workarounds bloom.
Unowned Prompts: Multiple teams edit prompts; no precedence, collisions guaranteed.
No Telemetry: Guardrails drift silently; incidents repeat until a public failure forces action.
Trusting Sanitization Alone: Attackers move to tools or output channels; defense must be layered.

Quick Win Checklist

Publish the prompt pack registry with precedence and owners.
Turn on the two classifiers and wire refusal UX with confirm/override.
Validate tool args strictly; add dry-run for high-risk actions.
Strip scripts/hidden text in retrieval; tag low-trust sources.
Add refusal precision/recall to CI; block regressions.

Closing

Jailbreaks evolve, but so can your defenses. Treat safety as product: own the prompt stack, harden tools, route via lightweight classifiers, and measure the real burden (overrides, latency, abandonment). Layered controls with clear UX will cut incidents without slowing delivery.

Essay by OneMind Strata Team