Evidence, Screenshots, and Reproduction Steps: Building Trust in Pen Test Findings

The Trust Problem

If you can't prove it, it didn't happen.

A penetration test report arrives containing a critical finding: SQL injection in the customer portal's search function, returning all 12,847 customer records. The development team lead opens the finding and sees a title, a severity rating, a two-sentence description, and a remediation recommendation. No screenshot. No request/response pair. No payload. No reproduction steps.

The development team lead responds: "Which search function? There are four. What parameter? What payload did you use? Are you sure it returned all records, or just the ones matching the query? We can't reproduce this." The finding enters a dispute cycle — three emails and a meeting to establish whether it's real, which endpoint is affected, and what the tester actually observed. Two weeks pass before anyone starts working on the fix.

Three desks away, the infrastructure team receives a different finding from the same report: LLMNR poisoning enabled network-wide. This one includes a screenshot of Responder capturing an NTLMv2 hash, a timestamp, the username and source IP of the captured credential, the hashcat command that cracked it, and the time to crack. The infrastructure engineer reads it, opens Group Policy Management, and disables LLMNR that afternoon.

Same report. Same provider. Two findings. One had evidence. One didn't. The finding with evidence was fixed in hours. The finding without evidence took two weeks to even confirm — and the fix hasn't started yet.

The Principle

Evidence isn't decoration. It's the mechanism through which the tester's observation becomes the engineer's action. Without evidence, the finding is a claim that requires trust. With evidence, it's a fact that requires a response. The difference determines whether findings get fixed or get disputed.

What Evidence Does

Five functions that screenshots and proof serve.

Evidence in a penetration test report isn't a single-purpose artefact — it serves multiple functions for multiple audiences at different points in the report's lifecycle. Each function justifies the investment in capturing and presenting evidence properly.

Function	Who Benefits	What Happens Without It
Proves the finding exists	The engineering team receiving the finding. They need to know the vulnerability is real — not a scanner false positive, not a tester's misunderstanding, not an artefact of the testing environment.	The finding is treated as unverified. Engineers deprioritise it, request additional information, or challenge its validity. The dispute cycle consumes days or weeks before remediation begins.
Enables reproduction	The engineer who needs to see the vulnerability in action before they can fix it. Reproduction steps let them trigger the vulnerability themselves, understand its mechanics, and test their fix against the same conditions.	The engineer guesses at the vulnerability's trigger. They implement a fix based on the description, can't confirm it works, and mark the finding as "remediated" on faith rather than evidence.
Establishes severity	The CISO and board. Evidence of 12,847 customer records returned by a SQL injection is more compelling than a description that says "data exposure." The screenshot communicates the scale of impact in a way no severity score can.	Severity is abstract. "Critical" means different things to different readers. A screenshot showing the tester logged in as Domain Admin on the domain controller makes severity concrete and indisputable.
Provides a remediation baseline	The engineer verifying the fix. Evidence of the vulnerability before remediation provides a baseline: "This is what the output looked like when the vulnerability was present. After the fix, the output should look like this instead."	No before-and-after comparison is possible. The engineer implements the fix but has no reference point to confirm the vulnerability no longer exists.
Serves as a legal and compliance artefact	Internal audit, the regulator, the insurer, and — in the event of a breach — legal counsel. Evidence demonstrates what was found, when it was found, and what was recommended.	The report claims vulnerabilities were found but provides no proof. An auditor or regulator may question the rigour of the assessment. An insurer may challenge the quality of the testing.

Good Evidence vs Bad Evidence

Not all screenshots are created equal.

A screenshot is only as useful as the information it contains and the context that accompanies it. A cropped, unlabelled screenshot with no explanation is marginally better than no evidence at all — and sometimes worse, because it implies proof without delivering it.

Evidence Type	Bad Example	Good Example
Screenshot	A cropped terminal window showing coloured text with no labels, no timestamp, no indication of which system was targeted, and no explanation of what the output means. The reader must be a penetration tester to interpret it.	A full-width screenshot with: the target hostname or IP visible in the command prompt, a timestamp visible in the terminal or annotated on the image, key output highlighted or annotated to draw attention to the significant result, and a caption that explains what the screenshot demonstrates in plain English.
Request/response pair	The HTTP request only — no response. Or the response only — no request. Or both, but with so much irrelevant content (headers, cookies, boilerplate) that the significant part is invisible.	Both request and response, trimmed to show the relevant parts. The malicious payload highlighted in the request. The evidence of exploitation highlighted in the response. Irrelevant headers and boilerplate removed for clarity, with a note that the full request/response is available in the appendix.
Command output	A wall of unformatted terminal output pasted into the report. Hundreds of lines. The significant line is somewhere in the middle, unmarked.	The relevant portion of the output, clearly formatted, with the significant line or value highlighted. If the full output is relevant, it's in the appendix — the finding body contains only the evidence that proves the point.
Credential evidence	The full plaintext password displayed in the report: `Summer2025!`. Sensitive information visible to anyone who reads the document.	Password partially redacted: `Sum*****!`. Or the hash shown with a note that it was cracked in 4 seconds, without displaying the plaintext. Evidence that the password was weak is demonstrated without creating a new security risk in the report itself.
Data exposure evidence	A screenshot showing real customer names, addresses, and account numbers. The pen test report is now a data breach in PDF form.	A screenshot showing the volume of records returned ("12,847 rows") with personally identifiable data redacted or blurred. The column headers are visible to prove the type of data exposed. The actual data is not — because the report will be read by people who shouldn't see it.

Evidence Hygiene

A pen test report that contains unredacted customer PII, plaintext passwords, or production credentials is itself a security risk. The report will be emailed, stored on file servers, uploaded to compliance portals, and shared with third parties. Every piece of sensitive data in the report is a piece of sensitive data that's now in the report's threat model. Redact responsibly: prove the finding without creating a new exposure.

Reproduction Steps

The bridge between the tester's claim and the engineer's action.

A screenshot proves the vulnerability existed at the moment the screenshot was taken. Reproduction steps let the engineer verify it still exists, understand how it works, and — critically — confirm that their fix resolves it. Without reproduction steps, the engineer is fixing a problem they can't see.

Finding Without Reproduction Steps
Finding:Cross-Site Scripting (XSS) in User Profile
Severity:High
Description:A reflected XSS vulnerability was identified in the user
            profile page. JavaScript can be injected and executed in the
            context of another user's session.
Evidence:[screenshot of alert('XSS') popup]

# The developer sees this and asks:
#  - Which profile page? View or edit?
#  - Which parameter? Name? Bio? Location? Avatar URL?
#  - What payload did you use?
#  - Is this GET or POST?
#  - Does it require authentication?
#  - I can't reproduce this. Are you sure?

Finding With Reproduction Steps
Finding:Reflected XSS in User Profile Bio Field
Severity:High
Endpoint:POST /account/profile/update
Parameter:"bio" (body, URL-encoded form)

Reproduction:
  1.Log in as any authenticated user.
  2.Navigate to /account/profile/edit.
  3.In the 'Bio' field, enter:
     <script>alert(document.cookie)</script>
  4.Click 'Save Profile'.
  5.Navigate to /account/profile/view.
  6.Observe: JavaScript executes and the user's session
     cookie is displayed in the alert.

Root cause:
  Thebio field value is rendered in the profile view page
  withoutHTML encoding. The template uses {!! $user->bio !!}
  (unescapedoutput) instead of {{ $user->bio }} (escaped).

Impact:
  Anattacker can inject JavaScript that executes when any user
  viewsthe attacker's profile. This enables session hijacking,
  credentialtheft, and account takeover — including admin accounts
  thatvisit user profiles during support ticket review.

Evidence:[annotated screenshot showing the bio field in the DOM with
         the unescaped script tag, and the alert popup displaying the
         session cookie — cookie value redacted]

Remediation:
  Replace{!! $user->bio !!} with {{ $user->bio }} in
  resources/views/profile/view.blade.php(line ~47).
  Auditall other uses of {!! !!} in the template directory.

Verify:Repeat steps 1-6. At step 6, the script tag should render
       as visible text, not execute as JavaScript.

The first finding generates questions. The second finding generates a fix. The developer reads the reproduction steps, opens the template file, changes one line, and verifies the fix by repeating the same steps. The entire remediation — from reading the finding to confirming the fix — takes less time than the email exchange the first finding would have required.

The Dispute Cycle

What happens when evidence is missing or insufficient.

Insufficient evidence doesn't just slow remediation — it creates a trust deficit between the pen test provider and the client that erodes the value of the entire engagement. The dispute cycle is predictable and expensive.

Stage	What Happens	Time Cost
1. Finding delivered	The engineer receives a finding with insufficient evidence or unclear reproduction steps. They attempt to verify it and fail — either because they can't find the vulnerable endpoint, can't reproduce the condition, or don't understand the tester's description.	1–2 hours of the engineer's time, wasted.
2. Clarification requested	The engineer emails the provider asking for clarification: which endpoint, which parameter, what payload, can you reproduce it again? The provider assigns the query to the tester — who may be on a different engagement and takes 24–48 hours to respond.	2–3 days elapsed. Remediation hasn't started.
3. Tester responds	The tester provides additional detail from memory or notes. Sometimes the detail is sufficient. Sometimes the tester can't fully recall the specific conditions — they've tested three other environments since.	3–5 days elapsed. The finding may or may not be reproducible now.
4. Dispute or acceptance	The engineer either reproduces the finding (remediation begins, 5+ days late) or can't reproduce it and disputes its validity. The finding is marked "disputed" or "unable to reproduce" and enters a limbo where it may never be resolved.	5–10 days elapsed. Trust eroded. Future findings from this provider are received with scepticism.

The dispute cycle is entirely preventable. Every hour spent in clarification emails, reproduction attempts, and trust-rebuilding meetings could have been avoided by the tester spending an additional five minutes capturing clear evidence and writing precise reproduction steps before submitting the report.

Evidence for Every Audience

Different readers need different proof.

Evidence doesn't serve a single audience — it serves everyone who reads the report, at every stage of its lifecycle. The evidence strategy should account for all of them.

The Board Needs Impact Evidence

A screenshot showing "12,847 rows returned" from the customer database communicates business impact more effectively than any severity score. The screenshot of the tester logged in as Domain Admin on the domain controller makes the risk tangible. Board-facing evidence should be annotated in plain English and stripped of technical jargon — the screenshot speaks, but the caption translates.

The Engineer Needs Technical Evidence

Request/response pairs with payloads highlighted. Command output showing the exact tool and syntax used. The vulnerable line of code or configuration setting. This is the evidence that enables reproduction and remediation — it needs to be precise, complete, and in the engineer's technical vocabulary.

The CISO Needs Chain Evidence

Evidence that shows the progression: credential capture → privilege escalation → lateral movement → data access. Each step evidenced with timestamps that demonstrate the timeline. This evidence communicates the coherence of the attack path and the speed with which the environment was compromised — context that drives prioritisation.

The Auditor Needs Completeness Evidence

The auditor doesn't need to understand every finding — they need to confirm that testing was performed rigorously and findings are substantiated. Timestamped evidence, consistent formatting, methodology references, and evidence that covers the full scope of the assessment demonstrates that the engagement was thorough, not superficial.

Legal Needs Defensibility Evidence

If the organisation suffers a breach exploiting a vulnerability the pen test identified, the evidence in the report becomes legal evidence. It must demonstrate: the vulnerability was identified (evidence), it was communicated with a recommended fix (remediation), and the timeline is clear (timestamps). Vague findings with no evidence are indefensible. Precise findings with clear evidence demonstrate due diligence.

Evidence Standards

What every finding should include — as a minimum.

Finding Type	Minimum Evidence Required	Gold Standard Evidence
Web application vulnerability (XSS, SQLi, IDOR, auth bypass)	The HTTP request containing the payload. The HTTP response showing the result. The endpoint URL and parameter name.	Full request/response pair (trimmed). Annotated screenshot of the rendered result. Reproduction steps numbered 1–N. Root cause (code line or template). Verification step showing expected output after fix.
Infrastructure misconfiguration (LLMNR, SMB signing, weak GPO)	Tool output showing the misconfiguration. The target hostname or IP. The specific setting or protocol identified.	Annotated screenshot with the vulnerable configuration highlighted. The GPO or registry path where the setting lives. Nmap or equivalent output showing the current state. Command to verify the fix.
Credential compromise (password cracking, hash capture)	Proof of hash capture (hash value with plaintext redacted). The tool and wordlist used. Time to crack.	Screenshot of hash capture with source IP and username visible. Hashcat output showing the rule, wordlist, and crack time. The password pattern ("Season+Year+Symbol") without full plaintext. Comparison against the domain password policy showing why it was accepted.
Privilege escalation (Kerberoasting, token manipulation, GPO abuse)	Proof of the initial privilege level. Proof of the escalated privilege level. The technique and tool used.	Before-and-after screenshots showing privilege context (whoami /all or equivalent). The specific escalation path. Timestamps showing time to escalate. Evidence of access obtained with the escalated privileges.
Data access (file share access, database query, document retrieval)	Proof that data was accessible. The type and volume of data. The access path used.	Screenshot showing the data listing with content redacted but column headers/file types visible. Record count. The credential or access path used to reach the data. Timestamp. Evidence that access was not authorised for the account used.

For Your Organisation

Getting the evidence you need from every engagement.

Specify Evidence Requirements in the SoW

State that every finding must include evidence: screenshots, request/response pairs, or command output. State that reproduction steps must be included for every finding above informational severity. State that sensitive data in evidence must be redacted. If it's in the SoW, the provider is contractually committed to delivering it.

Review the Sample Report's Evidence Quality

When evaluating a provider's sample report, look at the evidence — not just the findings. Are screenshots annotated? Are request/response pairs trimmed to the relevant content? Are reproduction steps numbered and precise? Is sensitive data redacted? The evidence quality in the sample report predicts the evidence quality you'll receive.

Use Evidence for Fix Verification

When the engineering team remediates a finding, have them repeat the reproduction steps from the report and capture their own evidence showing the fix works. The before evidence is in the report. The after evidence is in the remediation record. Together, they create an auditable trail from finding to fix.

Treat the Report as a Sensitive Document

A report containing evidence of vulnerabilities, attack paths, and exploitable conditions is itself a sensitive document. Store it with appropriate access controls, encryption, and retention policies. Share it only with the audiences that need it. The evidence that proves your vulnerabilities is also the roadmap for exploiting them.

Challenge Findings Without Evidence

If a finding arrives with no screenshot, no command output, and no reproduction steps — send it back. Don't dispute the finding's existence. Request the evidence that proves it. A good provider will supply the evidence immediately. A provider who can't supply evidence may not have found the vulnerability through manual testing at all.

Summary

The bottom line.

Evidence is the mechanism through which trust is established between the penetration tester and the team that must act on their findings. Without evidence, a finding is a claim — subject to dispute, deprioritisation, and the corrosive doubt that comes from being asked to fix something you can't see, can't reproduce, and aren't sure exists.

Good evidence is annotated, timestamped, redacted where necessary, and accompanied by reproduction steps precise enough that the engineer can trigger the vulnerability themselves, implement a fix, and verify the fix works — all without contacting the tester. It serves every audience: the engineer who needs to reproduce and fix, the CISO who needs to understand the impact and the chain, the board who needs to see the business risk made tangible, the auditor who needs to confirm rigorous testing, and the legal team who may one day need to prove what was known and when.

The five minutes the tester spends capturing a clear, annotated screenshot and writing precise reproduction steps saves days of dispute, weeks of delay, and the trust deficit that poor evidence creates between providers and clients. Evidence isn't an afterthought. It's the proof that the engagement was worth commissioning — and the foundation on which every remediation decision is built.

Every Finding, Every Proof

Penetration test reports where every finding is evidenced, reproducible, and verifiable.

Our reports include annotated screenshots, request/response pairs, reproduction steps, and verification methods for every finding — because a finding without evidence is a finding waiting to be disputed.

Discuss Evidence-Based Reporting Read: Writing Findings Teams Can Remediate

All Posts Get in Touch