> ls evidence/ | wc -l && echo 'every finding needs proof'_
A penetration test report arrives containing a critical finding: SQL injection in the customer portal's search function, returning all 12,847 customer records. The development team lead opens the finding and sees a title, a severity rating, a two-sentence description, and a remediation recommendation. No screenshot. No request/response pair. No payload. No reproduction steps.
The development team lead responds: "Which search function? There are four. What parameter? What payload did you use? Are you sure it returned all records, or just the ones matching the query? We can't reproduce this." The finding enters a dispute cycle — three emails and a meeting to establish whether it's real, which endpoint is affected, and what the tester actually observed. Two weeks pass before anyone starts working on the fix.
Three desks away, the infrastructure team receives a different finding from the same report: LLMNR poisoning enabled network-wide. This one includes a screenshot of Responder capturing an NTLMv2 hash, a timestamp, the username and source IP of the captured credential, the hashcat command that cracked it, and the time to crack. The infrastructure engineer reads it, opens Group Policy Management, and disables LLMNR that afternoon.
Same report. Same provider. Two findings. One had evidence. One didn't. The finding with evidence was fixed in hours. The finding without evidence took two weeks to even confirm — and the fix hasn't started yet.
Evidence isn't decoration. It's the mechanism through which the tester's observation becomes the engineer's action. Without evidence, the finding is a claim that requires trust. With evidence, it's a fact that requires a response. The difference determines whether findings get fixed or get disputed.
Evidence in a penetration test report isn't a single-purpose artefact — it serves multiple functions for multiple audiences at different points in the report's lifecycle. Each function justifies the investment in capturing and presenting evidence properly.
| Function | Who Benefits | What Happens Without It |
|---|---|---|
| Proves the finding exists | The engineering team receiving the finding. They need to know the vulnerability is real — not a scanner false positive, not a tester's misunderstanding, not an artefact of the testing environment. | The finding is treated as unverified. Engineers deprioritise it, request additional information, or challenge its validity. The dispute cycle consumes days or weeks before remediation begins. |
| Enables reproduction | The engineer who needs to see the vulnerability in action before they can fix it. Reproduction steps let them trigger the vulnerability themselves, understand its mechanics, and test their fix against the same conditions. | The engineer guesses at the vulnerability's trigger. They implement a fix based on the description, can't confirm it works, and mark the finding as "remediated" on faith rather than evidence. |
| Establishes severity | The CISO and board. Evidence of 12,847 customer records returned by a SQL injection is more compelling than a description that says "data exposure." The screenshot communicates the scale of impact in a way no severity score can. | Severity is abstract. "Critical" means different things to different readers. A screenshot showing the tester logged in as Domain Admin on the domain controller makes severity concrete and indisputable. |
| Provides a remediation baseline | The engineer verifying the fix. Evidence of the vulnerability before remediation provides a baseline: "This is what the output looked like when the vulnerability was present. After the fix, the output should look like this instead." | No before-and-after comparison is possible. The engineer implements the fix but has no reference point to confirm the vulnerability no longer exists. |
| Serves as a legal and compliance artefact | Internal audit, the regulator, the insurer, and — in the event of a breach — legal counsel. Evidence demonstrates what was found, when it was found, and what was recommended. | The report claims vulnerabilities were found but provides no proof. An auditor or regulator may question the rigour of the assessment. An insurer may challenge the quality of the testing. |
A screenshot is only as useful as the information it contains and the context that accompanies it. A cropped, unlabelled screenshot with no explanation is marginally better than no evidence at all — and sometimes worse, because it implies proof without delivering it.
| Evidence Type | Bad Example | Good Example |
|---|---|---|
| Screenshot | A cropped terminal window showing coloured text with no labels, no timestamp, no indication of which system was targeted, and no explanation of what the output means. The reader must be a penetration tester to interpret it. | A full-width screenshot with: the target hostname or IP visible in the command prompt, a timestamp visible in the terminal or annotated on the image, key output highlighted or annotated to draw attention to the significant result, and a caption that explains what the screenshot demonstrates in plain English. |
| Request/response pair | The HTTP request only — no response. Or the response only — no request. Or both, but with so much irrelevant content (headers, cookies, boilerplate) that the significant part is invisible. | Both request and response, trimmed to show the relevant parts. The malicious payload highlighted in the request. The evidence of exploitation highlighted in the response. Irrelevant headers and boilerplate removed for clarity, with a note that the full request/response is available in the appendix. |
| Command output | A wall of unformatted terminal output pasted into the report. Hundreds of lines. The significant line is somewhere in the middle, unmarked. | The relevant portion of the output, clearly formatted, with the significant line or value highlighted. If the full output is relevant, it's in the appendix — the finding body contains only the evidence that proves the point. |
| Credential evidence | The full plaintext password displayed in the report: Summer2025!. Sensitive information visible to anyone who reads the document. |
Password partially redacted: Sum*****!. Or the hash shown with a note that it was cracked in 4 seconds, without displaying the plaintext. Evidence that the password was weak is demonstrated without creating a new security risk in the report itself. |
| Data exposure evidence | A screenshot showing real customer names, addresses, and account numbers. The pen test report is now a data breach in PDF form. | A screenshot showing the volume of records returned ("12,847 rows") with personally identifiable data redacted or blurred. The column headers are visible to prove the type of data exposed. The actual data is not — because the report will be read by people who shouldn't see it. |
A pen test report that contains unredacted customer PII, plaintext passwords, or production credentials is itself a security risk. The report will be emailed, stored on file servers, uploaded to compliance portals, and shared with third parties. Every piece of sensitive data in the report is a piece of sensitive data that's now in the report's threat model. Redact responsibly: prove the finding without creating a new exposure.
A screenshot proves the vulnerability existed at the moment the screenshot was taken. Reproduction steps let the engineer verify it still exists, understand how it works, and — critically — confirm that their fix resolves it. Without reproduction steps, the engineer is fixing a problem they can't see.
The first finding generates questions. The second finding generates a fix. The developer reads the reproduction steps, opens the template file, changes one line, and verifies the fix by repeating the same steps. The entire remediation — from reading the finding to confirming the fix — takes less time than the email exchange the first finding would have required.
Insufficient evidence doesn't just slow remediation — it creates a trust deficit between the pen test provider and the client that erodes the value of the entire engagement. The dispute cycle is predictable and expensive.
| Stage | What Happens | Time Cost |
|---|---|---|
| 1. Finding delivered | The engineer receives a finding with insufficient evidence or unclear reproduction steps. They attempt to verify it and fail — either because they can't find the vulnerable endpoint, can't reproduce the condition, or don't understand the tester's description. | 1–2 hours of the engineer's time, wasted. |
| 2. Clarification requested | The engineer emails the provider asking for clarification: which endpoint, which parameter, what payload, can you reproduce it again? The provider assigns the query to the tester — who may be on a different engagement and takes 24–48 hours to respond. | 2–3 days elapsed. Remediation hasn't started. |
| 3. Tester responds | The tester provides additional detail from memory or notes. Sometimes the detail is sufficient. Sometimes the tester can't fully recall the specific conditions — they've tested three other environments since. | 3–5 days elapsed. The finding may or may not be reproducible now. |
| 4. Dispute or acceptance | The engineer either reproduces the finding (remediation begins, 5+ days late) or can't reproduce it and disputes its validity. The finding is marked "disputed" or "unable to reproduce" and enters a limbo where it may never be resolved. | 5–10 days elapsed. Trust eroded. Future findings from this provider are received with scepticism. |
The dispute cycle is entirely preventable. Every hour spent in clarification emails, reproduction attempts, and trust-rebuilding meetings could have been avoided by the tester spending an additional five minutes capturing clear evidence and writing precise reproduction steps before submitting the report.
Evidence doesn't serve a single audience — it serves everyone who reads the report, at every stage of its lifecycle. The evidence strategy should account for all of them.
| Finding Type | Minimum Evidence Required | Gold Standard Evidence |
|---|---|---|
| Web application vulnerability (XSS, SQLi, IDOR, auth bypass) | The HTTP request containing the payload. The HTTP response showing the result. The endpoint URL and parameter name. | Full request/response pair (trimmed). Annotated screenshot of the rendered result. Reproduction steps numbered 1–N. Root cause (code line or template). Verification step showing expected output after fix. |
| Infrastructure misconfiguration (LLMNR, SMB signing, weak GPO) | Tool output showing the misconfiguration. The target hostname or IP. The specific setting or protocol identified. | Annotated screenshot with the vulnerable configuration highlighted. The GPO or registry path where the setting lives. Nmap or equivalent output showing the current state. Command to verify the fix. |
| Credential compromise (password cracking, hash capture) | Proof of hash capture (hash value with plaintext redacted). The tool and wordlist used. Time to crack. | Screenshot of hash capture with source IP and username visible. Hashcat output showing the rule, wordlist, and crack time. The password pattern ("Season+Year+Symbol") without full plaintext. Comparison against the domain password policy showing why it was accepted. |
| Privilege escalation (Kerberoasting, token manipulation, GPO abuse) | Proof of the initial privilege level. Proof of the escalated privilege level. The technique and tool used. | Before-and-after screenshots showing privilege context (whoami /all or equivalent). The specific escalation path. Timestamps showing time to escalate. Evidence of access obtained with the escalated privileges. |
| Data access (file share access, database query, document retrieval) | Proof that data was accessible. The type and volume of data. The access path used. | Screenshot showing the data listing with content redacted but column headers/file types visible. Record count. The credential or access path used to reach the data. Timestamp. Evidence that access was not authorised for the account used. |
Evidence is the mechanism through which trust is established between the penetration tester and the team that must act on their findings. Without evidence, a finding is a claim — subject to dispute, deprioritisation, and the corrosive doubt that comes from being asked to fix something you can't see, can't reproduce, and aren't sure exists.
Good evidence is annotated, timestamped, redacted where necessary, and accompanied by reproduction steps precise enough that the engineer can trigger the vulnerability themselves, implement a fix, and verify the fix works — all without contacting the tester. It serves every audience: the engineer who needs to reproduce and fix, the CISO who needs to understand the impact and the chain, the board who needs to see the business risk made tangible, the auditor who needs to confirm rigorous testing, and the legal team who may one day need to prove what was known and when.
The five minutes the tester spends capturing a clear, annotated screenshot and writing precise reproduction steps saves days of dispute, weeks of delay, and the trust deficit that poor evidence creates between providers and clients. Evidence isn't an afterthought. It's the proof that the engagement was worth commissioning — and the foundation on which every remediation decision is built.
Our reports include annotated screenshots, request/response pairs, reproduction steps, and verification methods for every finding — because a finding without evidence is a finding waiting to be disputed.