> echo 'ignore previous instructions' | curl -X POST /api/chat && echo '...it worked'_
The brief arrives: "We've built a customer-facing chatbot powered by GPT-4. It handles insurance quotes, policy queries, and claims triage. It has access to our customer database via function calling. We've implemented guardrails. Please test it."
The tester opens the chat interface. There are no ports to scan. No services to fingerprint. No HTTP parameters to inject into. The entire attack surface is a text box. The target is a non-deterministic system that processes natural language, interprets intent, and generates responses that are different every time — even with identical input. The traditional pen testing methodology, refined over two decades of deterministic systems with predictable inputs and outputs, doesn't quite apply.
And yet, within 40 minutes, the tester has convinced the chatbot to reveal its system prompt (including the database schema it was told not to disclose), extracted another customer's policy details by embedding the request inside a fictional scenario, bypassed the content filter by encoding the prohibited request in a way the guardrail didn't anticipate, and caused the model to generate a claims approval recommendation for a fabricated incident that would have triggered a £15,000 payment if a human hadn't reviewed it.
Welcome to AI penetration testing — where the attack surface is language, the vulnerability is ambiguity, and the guardrails are suggestions that a sufficiently creative prompt can walk around.
AI security testing doesn't have a mature methodology equivalent to OWASP Testing Guide or PTES. The OWASP Top 10 for LLM Applications (published 2023, updated 2025) provides a taxonomy. MITRE ATLAS maps adversarial techniques. But the field is evolving faster than the frameworks — and every engagement teaches us something that wasn't in any guide.
There is a particular joy in AI pen testing that traditional infrastructure and web application testing rarely provides. The feedback loop is instant — type a prompt, see the result, iterate. The creativity required is linguistic rather than technical. And the moment a carefully crafted prompt causes a model to do something its developers were certain it couldn't do is, professionally speaking, deeply satisfying.
The fun is real — and it matters, because it's a symptom of something important. These bypasses work because LLMs process meaning, not rules. A guardrail that says "don't do X" is a constraint in the system prompt — but the model doesn't obey constraints the way a firewall obeys rules. It weighs them against everything else in the context. A sufficiently creative prompt tips the scales.
For every entertaining jailbreak, there are hours of frustration rooted in the fundamental nature of LLMs as testing targets. AI systems break assumptions that traditional security testing has relied on for decades — and the resulting challenges are not just inconvenient. They're structural.
| Frustration | Why It's a Problem | How We Deal With It |
|---|---|---|
| Non-determinism | The same input produces different outputs every time. A prompt that bypasses a guardrail at 10am may fail at 10:05am. A technique that works on the first attempt may not reproduce on the second. Traditional pen testing relies on reproducibility — if the exploit works once, it works every time. LLMs don't offer that guarantee. | Run every significant test multiple times. Document the success rate, not just the success. Report findings as probabilistic: "This bypass succeeded in 7 of 10 attempts" rather than "This bypass works." Use temperature=0 where possible for more consistent reproduction. |
| No clear vulnerability boundary | In traditional testing, there's a binary: either the SQL injection works or it doesn't. Either you have a shell or you don't. With LLMs, the boundary between intended behaviour and vulnerability is blurred. Is a model that provides a slightly-too-detailed answer about a sensitive topic a vulnerability or a calibration issue? Where does "helpful" end and "data leakage" begin? | Define the boundary before testing. Work with the development team to establish what the model should and shouldn't do — the equivalent of a permission matrix for a web application. Test against this specification, not against an undefined notion of "should the model have said that?" |
| Guardrails are probabilistic, not absolute | A firewall rule either blocks the packet or it doesn't. An LLM guardrail reduces the probability of a harmful output — it doesn't eliminate it. A content filter that catches 99% of harmful requests still passes 1%. And the 1% are the requests the filter didn't anticipate, which is exactly what a pen tester is paid to find. | Test at volume. If the guardrail is probabilistic, test it probabilistically. Send 100 variations of a prohibited request and measure how many get through. Report the bypass rate, not just the bypass. |
| The moving target | The underlying model is updated by the provider. A bypass that works against GPT-4-0613 may not work against GPT-4-turbo. A guardrail calibrated for one model version may be too loose or too tight for the next. The application hasn't changed, but the model underneath it has — and nobody was notified. | Record the exact model version and timestamp for every test. Recommend that the organisation pins the model version in production and retests after every model update. Treat model updates as infrastructure changes that require regression testing. |
| Reporting ambiguity | "I convinced the model to reveal its system prompt by pretending to be a debugging assistant" doesn't have a CVE number, a CVSS score, or a standard remediation. The OWASP Top 10 for LLMs provides categories (LLM01: Prompt Injection, LLM06: Sensitive Information Disclosure) but the scoring and remediation frameworks are immature compared to traditional web application findings. | Map findings to OWASP LLM Top 10 categories. Provide a business impact assessment rather than a technical severity score. Describe what the attacker achieved in terms the business understands: "extracted another customer's policy details" is more meaningful than "achieved indirect prompt injection via narrative embedding." |
| Scope creep into philosophy | AI testing has a tendency to drift into questions about alignment, ethics, and what the model "should" think — territory that belongs in AI safety research, not in a penetration test. The tester's job is to assess whether the application's security controls work, not to solve the alignment problem. | Stay grounded in the business impact. The question isn't "can the model be convinced to say something controversial?" It's "can the model be convinced to leak customer data, bypass authorisation, or perform an action that causes financial, legal, or reputational harm to the organisation?" |
AI application security isn't just about jailbreaking the model. The model is one component in a system — and the most exploitable weaknesses are often in the architecture around it: the system prompt, the function calling mechanism, the retrieval-augmented generation pipeline, the output handling, and the integration with backend systems.
| Attack Surface | What We Test | What We Typically Find |
|---|---|---|
| System prompt | Can the system prompt be extracted through direct questioning, roleplay scenarios, encoding tricks, or multi-turn manipulation? Does it contain sensitive information (API keys, database schemas, internal business logic, customer data formats)? | System prompts disclosed in approximately 70% of assessments. Contents frequently include: database table names, API endpoint structures, internal business rules, and explicit instructions that reveal the application's capabilities — information the attacker can use to craft more targeted attacks. |
| Direct prompt injection | Can the user override the system prompt's instructions through conversational manipulation? Can they change the model's persona, disable safety guidelines, or alter its behaviour through input that the application passes directly to the model? | Guardrails bypassed in some form in virtually every assessment. The bypass difficulty varies — some implementations require sophisticated multi-turn approaches, others fall to "ignore previous instructions" — but complete guardrail impermeability has not been achieved in any assessment we've conducted. |
| Indirect prompt injection | If the model processes external data (web pages, documents, emails, database records), can an attacker embed instructions in that data that the model executes? Can a malicious document uploaded for "summarisation" contain hidden instructions that cause the model to exfiltrate data or perform unintended actions? | The most dangerous attack vector in production AI applications. A support chatbot that reads customer emails can be attacked by a customer who embeds instructions in their email. A document analysis tool can be attacked through a malicious PDF. The model can't distinguish data from instructions — everything is text. |
| Function calling and tool use | If the model can call functions (database queries, API calls, email sending), can the user manipulate the model into calling functions it shouldn't, with parameters it shouldn't use, against targets it shouldn't access? Are function call permissions enforced at the application layer or only in the system prompt? | Models instructed "only query the customer's own records" that can be convinced to query other customers' records through narrative reframing. Models with email-sending capability that can be manipulated into sending emails to arbitrary addresses. Function-level authorisation enforced only by the system prompt — which the tester has already bypassed. |
| RAG pipeline | If the application uses retrieval-augmented generation (fetching relevant documents to include in the context), can the retrieval be manipulated to return documents the user shouldn't access? Can the user influence which documents are retrieved through their query? Are access controls enforced at the retrieval layer? | RAG systems that retrieve documents based on semantic similarity without checking whether the user is authorised to see them. A user asks about "executive compensation" and the RAG pipeline retrieves the board's salary spreadsheet because it's semantically relevant — regardless of the user's access level. |
| Output handling | Does the application sanitise the model's output before rendering it? Can the model be manipulated into generating HTML, JavaScript, or markdown that the front-end renders as executable content? Can the model output be used to perform stored XSS, CSRF, or other injection attacks through the display layer? | Model output rendered as raw HTML in the chat interface, allowing the tester to inject JavaScript via a prompt that causes the model to generate a script tag. Markdown rendering that follows links embedded in model output, enabling SSRF through the display layer. |
Returning to the insurance chatbot from the introduction, here's a condensed timeline of the assessment — showing the progression from reconnaissance through to demonstrated business impact.
Forty minutes. System prompt fully disclosed. Cross-customer data accessed. Fraudulent claim created in the production database. Four of six guardrails bypassed. No traditional vulnerability was exploited — no injection, no authentication bypass, no misconfiguration. The attack surface was language, the exploit was persuasion, and the vulnerability was the fundamental inability of an LLM to distinguish between a legitimate request and a creative reframing of a prohibited one.
After testing dozens of AI applications, patterns emerge in what makes some implementations more resilient than others. No defence is absolute — but the gap between a well-architected AI application and a naive one is enormous.
| Defence | Effectiveness | Limitation |
|---|---|---|
| System prompt hardening — clear, specific instructions with explicit boundaries and refusal patterns | Moderate. A well-written system prompt significantly raises the difficulty of bypass. Specific instructions ("never call get_policy() with a policy_id that doesn't belong to the authenticated user") are harder to override than vague ones ("be helpful but safe"). | The system prompt is a suggestion, not a firewall. A sufficiently creative prompt can override any instruction. System prompts should be the first line of defence, not the only one. |
| Input filtering — scanning user input for known attack patterns before passing it to the model | Low to moderate. Catches naive attacks: "ignore previous instructions," known jailbreak templates, and obvious prompt injections. Useful as a first-pass filter. | Trivially bypassed by paraphrasing, encoding, language switching, or narrative embedding. Input filters suffer from the same fundamental problem as WAFs: they can't understand meaning, only pattern-match against known attacks. |
| Output filtering — scanning model output for sensitive data patterns (PII, credentials, internal identifiers) before returning it to the user | Moderate to high. Catches data leakage regardless of how the model was manipulated into producing it. A regex that detects policy numbers in the output doesn't care whether the bypass was a roleplay or a narrative embedding. | Requires knowing what sensitive data looks like. Misses novel data formats, partial disclosures, and information that's sensitive in context but doesn't match a pattern. |
| Application-layer authorisation — enforcing access controls on function calls and data retrieval outside the model, in the application code | High. The most effective defence we've seen. If the function get_policy() checks the authenticated user's customer ID against the requested policy before returning data, no amount of prompt manipulation can access another customer's records. The model can be convinced to ask for the data. The application refuses to provide it. |
Requires treating the model as an untrusted input source — which is architecturally sound but counterintuitive for developers who think of the model as a trusted component. |
| LLM-as-judge — a second model that evaluates the first model's output for policy compliance before it's returned to the user | Moderate. Can catch subtle violations that pattern-matching misses, because the judge model understands natural language. Particularly effective for detecting off-topic responses, persona breaks, and soft policy violations. | Adds latency and cost (every response requires a second inference). The judge model itself can be manipulated in some implementations. And if both models share the same weaknesses, the judge may approve the same bypass it would have produced. |
| Minimal authority principle — giving the model access to the fewest functions, the narrowest data scope, and the least privileged credentials possible | High. The model can only abuse capabilities it has. A chatbot that answers FAQs and has no database access cannot leak customer records regardless of how creative the prompt is. Limiting capability limits blast radius. | Reduces functionality. The business value of an AI application often comes from its integrations — and every integration is an attack surface. The tension between capability and security is the core design challenge. |
The two most effective defences — application-layer authorisation and minimal authority — share a common principle: don't trust the model. Treat the LLM as an untrusted user whose requests must be validated, whose output must be sanitised, and whose access must be restricted. The model is a text-processing engine, not a security boundary. Every security decision must be enforced in code, not in a system prompt.
AI penetration testing is in its infancy. The tools are immature, the methodologies are evolving with every engagement, and the attack surface changes with every model update. But the trajectory is clear: as organisations deploy AI in increasingly sensitive contexts — healthcare triage, financial advice, legal analysis, customer service with database access — the need for systematic security testing of these systems will grow exponentially.
| Action | Why It Matters |
|---|---|
| Define what the model should and shouldn't do — in writing | Without a specification, there's nothing to test against. Document the model's permitted functions, prohibited responses, data access boundaries, and acceptable behaviour. This is the AI equivalent of a web application's permission matrix. |
| Enforce authorisation outside the model | Every function the model can call should enforce access controls in the application code — not in the system prompt. If the model can query customer data, the application layer must verify that the authenticated user is authorised to see the specific records the model is requesting. |
| Scan outputs, not just inputs | Input filtering catches known attacks. Output filtering catches the consequences of unknown attacks. Scan the model's output for sensitive data patterns before returning it to the user — regardless of how the model was manipulated into producing it. |
| Commission an AI-specific security assessment before launch | A standard web application pen test will assess the API, the authentication, and the front-end. It won't test prompt injection, guardrail bypass, function calling abuse, or RAG pipeline manipulation. These require AI-specific testing by testers who understand LLM behaviour. |
| Retest after every model update | A model update changes the behaviour of the system without changing the application code. Guardrails calibrated for one model version may be too loose for the next. Treat model updates as infrastructure changes that require regression testing. |
| Assume the guardrails will fail | No guardrail implementation we have tested has been completely impervious. Design the system on the assumption that the model will, at some point, be convinced to do something it shouldn't — and ensure that the damage is contained by application-layer controls, not prevented by prompt-layer instructions. |
Pentesting AI systems is fun in ways that traditional testing rarely is — the creativity, the instant feedback, the absurdity of a jailbreak that works because you asked the model to respond as a pirate. It's frustrating in ways that traditional testing never was — the non-determinism, the probabilistic guardrails, the absence of a clean vulnerability boundary, and the philosophical questions that creep in when your testing target can hold a conversation about its own limitations.
But beneath the novelty, the security fundamentals haven't changed. The model is an untrusted component. Input must be validated. Output must be sanitised. Authorisation must be enforced in code, not in conversation. Data access must be controlled at the application layer, not the prompt layer. The attack surface is new. The principles are old.
The organisations that deploy AI safely will be the ones that treat the model as they'd treat any untrusted input source — powerful, useful, and not to be relied upon for security decisions. The organisations that discover this the hard way will learn it from their pen testers, their customers, or the Information Commissioner's Office. The first option is considerably cheaper.
Our AI penetration testing covers prompt injection, guardrail bypass, function calling abuse, RAG pipeline manipulation, data leakage, and output injection — the attack surface that standard pen tests don't touch.