Performance & Latency Targets

Name: DocFirewall
Author: DocFirewall

DocFirewall uses a dual-stage scan architecture. Fast scan runs on raw bytes without full parsing (sub-millisecond to ~20 ms). Deep scan runs structural analysis, ML detectors, and YARA rules (latency depends on which detectors are enabled).

Latency Targets by Profile

Targets were measured on a four-core CPU (no GPU) with synthetic representative documents. Run scripts/benchmark_throughput.py to reproduce on your hardware.

Profile	Target P95	Detectors active
`lenient`	< 100 ms	YARA + Aho-Corasick only
`balanced` (default)	< 500 ms	YARA + Aho-Corasick + all structural checks
`strict`	< 2,000 ms	Above + BERT classifier + steganography checks
`strict` + semantic NN	< 3,000 ms	Above + sentence-transformer cosine NN

Note

BERT (enable_advanced_bert=True) adds ~50 ms per document on CPU at the default bert_max_chunks=20 setting. Reduce bert_max_chunks or run on GPU to lower this.

P50 / P95 / P99 by Format (balanced profile, no ML)

Format	P50	P95	P99	Notes
HTML	< 5 ms	< 15 ms	< 25 ms	Fastest — no ZIP overhead
RTF	< 8 ms	< 20 ms	< 35 ms
XLSX	< 12 ms	< 40 ms	< 70 ms	ZIP + XML parse
PPTX	< 12 ms	< 40 ms	< 70 ms
DOCX	< 15 ms	< 50 ms	< 90 ms	CustomXML + embeddings scan
PDF	< 20 ms	< 80 ms	< 150 ms	FlateDecode decompression adds variance

Archive scanning (enable_archive_scan=True) adds latency proportional to the number and size of archive members. A 50-member ZIP with DOCX files adds roughly 50 × single-file scan time.

Throughput Estimates

Profile	Target throughput	Notes
`balanced` (heuristics only)	≥ 20 docs/s per CPU core	No ML loading overhead
`strict` (BERT, CPU)	≥ 1–2 docs/s per core	BERT inference dominates
`strict` (BERT, GPU)	≥ 10–15 docs/s	Batch inference (`bert_max_chunks` tuned up)

Tuning for Throughput

Pre-compile at init time

All YARA rules and regex patterns are compiled once at Scanner.__init__ — not per scan. Instantiate one Scanner per process/thread and reuse it across all scans.

scanner = Scanner(config=ScanConfig(profile="balanced"))
for path in document_queue:
    report = scanner.scan(path)   # reuses compiled rules and models

Constructing a Scanner re-runs the expensive one-time setup (compiling the Aho-Corasick automata, loading the bundled ML classifier), so building one per call is dramatically slower. As of 0.5.1 the module-level from doc_firewall import scan; scan(path) helper reuses a cached default Scanner for the default-config path, so the common one-liner is no longer ~34× slower than a reused instance — but for a custom config/policy_engine, still construct one Scanner and reuse it as above.

Limit BERT chunks

config = ScanConfig(
    enable_advanced_bert=True,
    bert_max_chunks=10,   # default 20 — lower = faster, misses mid-doc injections
)

Reduce archive depth

config = ScanConfig(
    enable_archive_scan=True,
    limits=Limits(max_archive_depth=1, max_archive_members=20),
)

Disable OCR

enable_ocr_injection_scan is False by default. Leave it off unless your pipeline processes documents from multimodal sources where injection text may be embedded in screenshots.

Use strict profile selectively

Apply strict profile only to untrusted sources. Use balanced or lenient for documents from known-good internal pipelines where throughput is critical.

engine = PolicyEngine("policy.yaml")  # hr-intake → strict, internal → balanced
scanner = Scanner(config=ScanConfig(), policy_engine=engine)

Running the Benchmark

# Default: 20 iterations × 6 formats × 2 profiles
python scripts/benchmark_throughput.py

# More iterations for stable P99 estimates
python scripts/benchmark_throughput.py --n 100 --output results.json

# Only benchmark the balanced profile (no ML model downloads needed)
python scripts/benchmark_throughput.py --profiles balanced

Sample output:

── Profile: balanced (n=20 iterations per format) ──
  ✅ html    P50=   3.2ms  P95=   8.1ms  P99=  11.4ms  287 docs/s
  ✅ rtf     P50=   6.8ms  P95=  17.3ms  P99=  24.6ms  135 docs/s
  ✅ xlsx    P50=  11.4ms  P95=  32.7ms  P99=  49.2ms   82 docs/s
  ✅ pptx    P50=  12.1ms  P95=  38.4ms  P99=  56.0ms   77 docs/s
  ✅ docx    P50=  14.3ms  P95=  47.2ms  P99=  71.8ms   65 docs/s
  ✅ pdf     P50=  18.9ms  P95=  62.5ms  P99= 104.3ms   49 docs/s

Latency Breakdown (balanced profile, DOCX)

Stage	Typical time	Driver
Magic-byte detection	< 0.1 ms	File header read
Fast scan	2–8 ms	Zip member iteration, regex
Deep parse (Docling)	10–40 ms	XML parse + text extraction
Detectors (heuristics)	3–15 ms	ATS, obfuscation, embedding checks
YARA (built-in ruleset)	1–5 ms	Pre-compiled binary match
Audit log write	< 1 ms	Append-only JSONL
Total (balanced)	~20–70 ms
BERT inference (strict)	+40–100 ms	DeBERTa v3, 20 windows
Semantic NN (strict+NN)	+15–30 ms	MiniLM-L6, cosine similarity