The Evidence Contract
New in 0.4.8.
A scanner verdict you can't verify is a verdict you can't automate. Since 0.4.8, doc-firewall enforces a hard contract on every finding that can drive a decision:
Every finding with severity
HIGH/CRITICALorverdict_class == BLOCKcarries one of:
evidence["malicious_text"]— the actual offending content: the injected prompt, the text hidden in a 0-size font, the resolved/OpenActionJavaScript body, the launch command, the suspicious URI.evidence["evidence_unavailable_reason"]— why the content could not be extracted (encrypted stream, unsupported filter, compressed object region), plusevidence["debug_steps"]— concrete commands you can run to dig the content out of the document yourself (pdf-parser.py -o 12 -f file.pdf,olevba file.doc,unzip -p file.docx word/document.xml, …).
This means you can safely automate on it:
report = scanner.scan("incoming.pdf")
if report.verdict == Verdict.BLOCK:
for f in report.findings:
if f.verdict_class == VerdictClass.BLOCK:
print(f.title)
print(" evidence:", f.evidence.get("malicious_text")
or f.evidence.get("evidence_unavailable_reason"))
for step in f.evidence.get("debug_steps", []):
print(" debug:", step)
What 0.4.8 changed under the hood
PDF actions are resolved, not just counted. Before 0.4.8 an /OpenAction token produced {"token": "/OpenAction", "count": 2} at HIGH severity — the reviewer couldn't see what the action did. The scanner now follows the object reference and classifies the target:
| Resolved action | Result |
|---|---|
/S /JavaScript | script body extracted into malicious_text |
/S /URI | target URL extracted; scheme-tiered |
/S /Launch | command line extracted; HIGH |
/S /GoTo / bare destination | benign — "open at page N" is standard in exported PDFs; reported as INFO, never flags the document |
/S /Named | viewer command extracted; NextPage-class actions are benign |
| unresolvable (encrypted / unsupported filter) | evidence_unavailable_reason + debug_steps |
/AA (additional actions) and /Next action chains get the same treatment.
Hidden text findings carry the hidden text. DOCX gained this in 0.4.6; 0.4.8 adds parity for PPTX (the run following the invisible styling) and ODF (the styles.xml hidden style is joined to the content.xml spans that use it). XLSX white-cell styling still reports contract-only evidence (the style→cell→sharedStrings join is not implemented); the finding explains that and tells you which part to dump.
Snippets are centered on the match. Evidence used to be the first 250 characters of the field — often containing none of the matched content. Every regex-based finding now centers malicious_text on the match and includes the exact match text separately.
Misleading labels were renamed. "SQL Injection in Metadata" (HIGH, confidence 0.9) is now "SQL-like Syntax in Metadata" (MEDIUM, 0.6): it is a heuristic that needs either two distinct SQL tokens or statement punctuation to fire at all, and the evidence shows the actual SQL.
Incomplete scans are never silent. If a scan stage exceeds its budget (10 minutes per stage by default, doubled in 0.4.8), the report gains a scan_timeout finding that escalates the verdict to at least FLAG — on_timeout_verdict="block" fails closed. The finding states explicitly that this is an operational signal, not a malice claim, and its debug_steps show how to re-run with a bigger budget.
Coverage transparency — know what's actually running
Several of the strongest detectors are opt-in and depend on optional packages: YARA and an antivirus engine (T1 malware signatures), and sentence-transformers / BERT / OCR / perplexity (the ML layers of T4 prompt injection). With a default pip install doc-firewall and the default config they are off, so the scanner runs on baseline regex/structural checks only — and previously it did so silently.
Now every report tells you what was active:
report = scanner.scan("incoming.pdf")
print(report.coverage["degraded"]) # True in reduced-coverage mode
print(report.coverage["degraded_threats"]) # e.g. ["T1", "T4"]
print(report.coverage["threat_status"]) # {"T1": "baseline-only", ...}
A Scanner built in reduced-coverage mode also logs one loud warning naming each inactive capability and exactly how to turn it on.
To fail closed when promised detection is inactive:
cfg = ScanConfig(profile="strict")
cfg.require_full_coverage = True # any T1/T4 with no active detector → >= FLAG
# or require specific capabilities:
cfg.required_capabilities = ["yara", "semantic_nn"]
A document scanned with a required capability missing can no longer return ALLOW — it carries a reduced_coverage finding and escalates to FLAG.
Note: a format-parsing enabler (e.g. olefile, which lets the scanner read legacy .doc/.xls/.ppt) is not counted as malware detection — T1 still reports baseline-only unless YARA or an AV engine is active.
Content the scanner can't read
New in 0.4.8. Encrypted PDFs (/Encrypt) and password-protected Office files are a blind spot — the scanner cannot decrypt them. on_unscannable_verdict chooses the policy:
| value | behaviour |
|---|---|
warn (default) | FLAG, with a debug_steps recipe (qpdf --decrypt, msoffcrypto-tool) |
block | fail closed — un-inspectable content is BLOCKed |
allow | recorded as INFO only |
Guarantee
make benchmark measures contract compliance across the adversarial corpus on every release; scripts/benchmark_gate.py fails the release if compliance is below 100 %.