How to Measure the Real Value of Penetration Testing Beyond Pass or Fail

The Pass/Fail Trap

Why binary thinking misses the point entirely.

The board meeting follows a familiar pattern. The CISO presents the pen test results. The first question from the non-executive director: "So, did we pass?" The CISO explains that the tester achieved Domain Admin. The follow-up question: "So we failed?" The CISO tries to explain that it's not that simple — that the tester's time to objective increased by 300%, that the SOC detected four of seven actions, that recurring findings dropped from fourteen to three. But the framing is already set. The board heard "the tester got in" and filed it as a failure.

The pass/fail framing is destructive because it reduces a rich data set — attack paths, detection performance, remediation effectiveness, architectural maturity — to a single binary that tells the board almost nothing. An organisation where the tester achieves DA in two hours with zero detection has not "failed" in the same way as an organisation where the tester achieves DA in five days after bypassing three detection layers. Both "failed" the DA test. One is catastrophically insecure. The other has a maturing security programme with specific, identifiable gaps.

Equally, an organisation where the tester does not achieve DA has not necessarily "passed." The tester may have run out of time. The scope may have excluded the systems where the vulnerability exists. The rules of engagement may have prevented social engineering — the most likely real-world entry point. A "pass" based on the tester not reaching the objective provides false assurance that the binary framing can't distinguish from genuine security.

What to Measure Instead

Metrics that demonstrate genuine security improvement.

The value of penetration testing is best measured through metrics that track change over time — not through the outcome of a single engagement. These metrics, presented as trends across successive engagements, tell the story of a security programme that is (or isn't) improving.

Metric	What It Measures	How to Track It	What Good Looks Like
Time to objective	How long the tester takes to achieve their primary objective — typically Domain Admin, access to sensitive data, or compromise of a critical system.	Record the time from initial access to objective achievement for each engagement. Plot as a trend across years.	Increasing. Year 1: 2 hours. Year 2: 2 days. Year 3: 4 days. Year 4: not achieved within the 10-day window. The environment is getting measurably harder to compromise.
Detection rate	The percentage of tester actions detected by the SOC, EDR, SIEM, or other monitoring systems.	The tester documents each significant action (initial access, credential harvesting, lateral movement, escalation, data access). The SOC reviews its logs post-engagement. The detection rate is the ratio of detected to total actions.	Increasing. Year 1: 0 of 7 (0%). Year 2: 3 of 8 (38%). Year 3: 6 of 9 (67%). Year 4: 8 of 9 (89%). The detection capability is maturing.
Mean time to detect	For the actions that were detected, how quickly the SOC identified them.	Record the timestamp of each tester action and the timestamp of each corresponding SOC alert or investigation. The difference is the detection latency.	Decreasing. Year 2: 14 hours average. Year 3: 3 hours. Year 4: 47 minutes. The SOC is detecting faster — reducing the attacker's operational window.
Recurring finding rate	The percentage of findings from the previous engagement that reappear in the current one.	Compare each engagement's findings to the previous report. Count the findings that recur. Calculate as a percentage of the previous engagement's total findings.	Decreasing. Year 1→2: 41% recur. Year 2→3: 18%. Year 3→4: 7%. Findings are being remediated permanently rather than temporarily.
Remediation velocity	How quickly the organisation remediates critical and high findings after report delivery.	Record the date the report is delivered and the date each critical/high finding is confirmed as remediated (by self-verification or retest). Calculate the mean.	Decreasing. Year 1: 94 days mean. Year 2: 42 days. Year 3: 18 days. Year 4: 11 days. The organisation is responding faster to identified risk.
Chain viability	Whether the attack chains from the previous engagement are still viable after remediation.	During each retest and subsequent engagement, the tester specifically validates whether previously identified chains are still exploitable.	Chains broken. If the three chains from the previous engagement are all confirmed as broken by the retest, the specific paths to critical assets no longer exist. New chains may emerge — but the previous ones have been permanently addressed.
Remediation success rate	The percentage of findings marked as "remediated" in the tracker that are confirmed as actually fixed during retesting.	Compare the remediation tracker status to the retest results. Calculate the ratio of confirmed fixes to claimed fixes.	Increasing. Year 1: 72%. Year 2: 85%. Year 3: 94%. The gap between "claimed fixed" and "confirmed fixed" is narrowing — implementation quality is improving.
Architectural vs configuration ratio	The proportion of findings that are architectural weaknesses versus configuration issues.	Classify each finding as architectural (requires design change) or configuration (requires setting change). Track the ratio across engagements.	Shifting toward configuration. Year 1: 40% architectural. Year 3: 15% architectural. The fundamental design is improving — remaining findings are configuration drift rather than systemic weakness.

The Longitudinal View

Why single-engagement metrics tell half the story.

Any individual metric from a single engagement is a data point. It becomes meaningful when it's part of a trend. A detection rate of 44% means nothing in isolation — is that good? Bad? Improving? Deteriorating? But a detection rate that was 0% two years ago, 44% last year, and 67% this year tells a clear story: the detection capability is maturing, the investment in SIEM and SOC is producing returns, and the trajectory is positive.

Building the longitudinal view requires consistency: consistent metrics across engagements, consistent methodology (or documented methodology changes), and consistent reporting formats that allow comparison. This doesn't mean using the same provider forever — but it does mean ensuring that whoever conducts the test records the metrics the programme needs to track its trajectory.

Example: Three-Year Programme Dashboard
SECURITY PROGRAMME METRICS — ANNUAL BOARD REPORT

Metric                      2023      2024      2025     Trend
─────────────────────────────────────────────────────────────────
Time to DA                  2h 15m    2d 4h     Not       ▲
                                                achieved
Detection rate              0/7 (0%)  4/9 (44%) 7/8 (88%) ▲
Mean time to detect          N/A       14h       47min     ▲
Recurring findings           N/A       14 (41%)  3 (11%)   ▲
Remediation velocity (C/H)   94 days   37 days   14 days   ▲
Remediation success rate     72%       85%       94%       ▲
Previous chains viable       N/A       2 of 3    0 of 4    ▲
Architectural findings       40%       25%       12%       ▲

Overall: programme producing compounding improvement across
all tracked metrics. Recommend continued investment per roadmap.

This dashboard doesn't ask "did we pass?" It answers a more useful question: "Is the security programme working?" Every metric shows improvement. The trajectory is positive across all dimensions. The board can see that the investment is producing measurable returns — and the CISO has the evidence to justify continued funding.

Value Beyond Metrics

The returns that don't fit in a dashboard.

Not every form of value from a pen test is quantifiable. Some of the most important returns are qualitative — and they're worth acknowledging because they contribute to the organisation's security posture in ways that metrics don't capture.

Knowledge Transfer

The debrief — where the tester walks the IT team through the attack chain, explains the techniques, and discusses detection opportunities — is a training session that no course can replicate. The defenders learn how their specific environment was compromised by a skilled adversary. That knowledge changes how they configure, monitor, and respond.

Architectural Influence

Pen test findings that change design standards produce value that compounds indefinitely. A finding that leads to a network segmentation standard, a gMSA adoption policy, or a conditional access architecture affects every future system the organisation builds. The value isn't in the finding — it's in the design change it triggers.

Cultural Impact

An organisation that tests regularly — and acts on the findings — develops a security culture where continuous improvement is the norm. The IT team expects to be tested. The CISO expects to present metrics. The board expects to see improvement. Pen testing becomes part of how the organisation operates, not something it endures.

Stakeholder Confidence

Clients, partners, insurers, and regulators all value evidence of proactive security testing. The pen test programme — especially with longitudinal metrics showing improvement — provides confidence that the organisation takes security seriously. This confidence has commercial value: it wins contracts, reduces premiums, and satisfies regulatory expectations.

Measuring Provider Value

How to assess whether your testing provider is delivering.

The metrics above measure the organisation's security improvement. But the organisation should also assess whether the testing provider is delivering value — and whether the engagement quality justifies the investment.

Quality Indicator	High-Value Provider	Low-Value Provider
Findings depth	Findings include specific evidence, reproduction steps, business impact analysis, and remediation guidance tailored to the organisation's environment.	Generic findings copied from a template with no environmental context. Remediation guidance is "apply the latest patches" regardless of the specific vulnerability.
Attack narrative	A clear narrative explaining the attack chain — how each finding connects, which combinations produced escalation, and where the chain could be broken most cheaply.	A list of findings sorted by CVSS score with no narrative connecting them. The reader can't see the chain — only isolated issues.
Honest limitations	The report clearly states what was and wasn't tested, time constraints, scope limitations, inconclusive results, and residual risk from untested areas.	No limitations mentioned. The reader assumes comprehensive coverage that didn't exist.
Effective controls	The report acknowledges what worked — the EDR that caught the payload, the MFA that prevented credential abuse, the SOC that detected lateral movement.	Only failures reported. No acknowledgement of working controls. The reader gets a skewed view that ignores the organisation's effective defences.
Year-on-year comparison	The report includes a comparison section showing how metrics have changed since the previous engagement — recurring findings, time to objective, detection rate.	Each report stands alone with no connection to previous engagements. No longitudinal tracking. No way to demonstrate improvement.

For Your Organisation

Building a measurement framework that demonstrates real value.

Define the Metrics Before the Engagement

Decide which metrics you want to track across engagements and communicate them to the provider before the test begins. Time to objective, detection rate, recurring findings, and remediation velocity are the essential four. Add chain viability, architectural ratio, and remediation success rate as the programme matures.

Store the Raw Data

Keep every report, every remediation tracker, every retest result, and every set of metrics in a single repository. This archive is the evidence base for longitudinal analysis. Without it, each engagement is an isolated snapshot. With it, the programme tells a story of compounding improvement.

Present Trends, Not Snapshots

When reporting to the board, present metrics as trends across engagements — never as single numbers from the most recent test. A detection rate of 67% means nothing alone. A detection rate that rose from 0% to 44% to 67% over three years demonstrates that the security programme is working.

Reject the Pass/Fail Question

When the board asks "did we pass?" redirect to the metrics: "The tester's time to objective increased from two hours to four days. Our detection rate improved from 0% to 67%. Recurring findings dropped from 41% to 7%. The programme is producing measurable improvement across every dimension." This answers the real question — "is the investment working?" — without the distortion of a binary.

Measure What Changed as a Result

The most important metric isn't in the report — it's what happened afterwards. How many findings were remediated? How quickly? Were the architectural recommendations adopted? Did the detection rules get deployed? The pen test's value is measured by the improvement it produces, not by the findings it contains.

Summary

The bottom line.

A pen test doesn't pass or fail. It produces evidence — evidence that, when tracked across engagements, demonstrates whether the security programme is improving. Time to objective increasing. Detection rate rising. Recurring findings declining. Remediation velocity accelerating. Chains broken. Architectural findings decreasing. Each metric tells part of the story. Together, they tell the whole story: the organisation is getting measurably harder to compromise.

The pass/fail question is comforting because it offers certainty. The metrics framework is more demanding because it requires continuous measurement, longitudinal tracking, and honest assessment of progress. But the metrics framework is also more useful — because it answers the question the board actually needs answered: not "are we secure?" (a question no pen test can answer definitively) but "is the money we're spending on security producing measurable risk reduction?" The metrics say yes. Or they say where to invest next.

The most valuable pen test is not the one that produces a clean report. It's the one that changes something — a configuration, a detection rule, an architecture, a design standard, a budget decision. Value is measured in what improves afterwards, not in what the report contains.

Testing That Produces Measurable Value

Pen test programmes with longitudinal metrics that demonstrate compounding security improvement.

Our engagements are designed to produce the metrics your board needs: time to objective, detection rates, recurring finding trends, and remediation validation — tracked across every engagement to demonstrate that the security investment is producing measurable, compounding returns.

Discuss Metrics-Driven Testing Read: Transparency Builds Trust

All Posts Get in Touch