Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.5.1] - 2026-07-11

Theme: harden the surfaces around detection. A hardening release that fixes the advertised-but-broken layers around the (solid) detection core — the REST API, sanitize(), archive/zip-bomb defense, and CLI output/exit behavior — so what the docs promise matches what a caller actually gets.

Added

REST API microservice (doc_firewall.api:app). The documented FastAPI service now exists: POST /scan (multipart upload, profile + enable_ml query params) and GET /health. Ships the advertised controls — X-API-Key auth against the api_keys_path SHA-256 key store, per-key sliding-window rate limiting (api_rate_limit_rpm), and a streamed upload cap (api_max_upload_bytes, enforced by Content-Length and byte count). Server deps live in a new [api] extra (fastapi, uvicorn, python-multipart).
--fail-on {none,flag,block} CLI flag. doc-firewall scan can now gate CI / shell pipelines by exiting non-zero (code 2, distinct from the exit-1 used for usage/operational errors) when a scanned document's verdict meets the threshold. Default none preserves the historical always-0 behavior.
[test] install extra. pytest, hypothesis, pyyaml, pyahocorasick, striprtf, html5lib — so pip install -e ".[test]" && pytest is green from the documented command. pyyaml + pyahocorasick are also in [dev].
sanitize_verify_rescan config flag (default True) — re-scans the cleaned copy and enforces the residual-threat contract (see Fixed).
limits.max_archive_total_uncompressed_mb config (default 200) — total expansion budget across a nested-archive tree.
evidence_max_chars config (default 250) — one authoritative, configurable cap for evidence["malicious_text"], applied uniformly to every finding (see Fixed).
Optional keyed audit-log chain. Set DOC_FIREWALL_AUDIT_HMAC_KEY to make the audit chain an HMAC-SHA256 (tamper-resistant: unforgeable without the key) instead of the default unkeyed SHA-256 (tamper-evident). Each entry now also carries a monotonic seq counter, and verify-chain accepts --expected-count N to detect tail-truncation against an external anchor.
In-memory / stream scanning API. Scanner.scan_bytes(data, filename=...), Scanner.scan_stream(fileobj, filename=...), and the module-level scan_bytes(...) scan documents held in memory (RAG / web uploads) without the caller managing a temp file. The internal temp path is never exposed; content-hash caching still applies.
Effective configuration in the coverage report. report.coverage now includes profile and an effective_config summary (profile, fast_only, and the active ML layers), so a caller can confirm what actually ran.
Documented stable evidence schema. The evidence keys SIEM consumers can rely on (malicious_text, malicious_text_source, subtype, evidence_unavailable_reason, debug_steps, archive_member) are now documented as stable.

Fixed

Recursive archive scanning was exponential and the depth limit never fired (DoS). Archive members were scanned through the public scan() — which reset archive depth to 0 and also recursed explicitly, so every nested archive was scanned twice per level (2^N growth, max_archive_depth bypassed). Members are now scanned as leaves with depth threaded through a single recursion path, de-duplicated by content SHA-256, and capped by the new total-expansion budget. Growth is now linear and the depth-limit finding fires; a 6-layer nest that previously timed out (>2 min) completes in ~0.01 s.
sanitize() could green-light an un-neutralized document. It reported sanitized=True for threats it can't strip (e.g. a visible-body prompt injection), returning a byte-identical "cleaned" copy that still failed the scan. Scanner.sanitize() now re-scans the output and returns sanitized=False (with a residual-threat reason) whenever the copy doesn't re-scan ALLOW, so the trojan→BLOCK / sanitized→ALLOW round-trip actually holds.
--json / --siem-format output was corrupted by logs on stdout. Logging now goes to stderr and is quiet (WARNING) by default — library-safe; verbosity is opt-in via DOC_FIREWALL_LOG_LEVEL. stdout is reserved for machine-readable output, so --json is valid JSON.
CLI scan always exited 0. See --fail-on above.
Invalid profile values were silently accepted and quietly weakened the scan. ScanConfig now raises ValueError for any profile outside {lenient, balanced, strict} instead of falling through to base defaults.
Module-level scan() was ~34× slower than a reused Scanner. It built a fresh Scanner (recompiling automata, loading the ML classifier) on every call; it now reuses a cached default Scanner for the default-config path.
Audit log stamped a hardcoded 0.4.0. library_version is now derived from installed package metadata.
Result-cache aliasing. Cache hits copied the report but shared the same findings list; the list is now copied so a caller can't corrupt the cache.
Inconsistent evidence truncation. Detectors capped malicious_text at 120/200/250/300 while the docs promised 250; a single configurable evidence_max_chars (default 250) is now applied uniformly.
Overstated "immutable" audit log. The trust model is now stated precisely (unkeyed = tamper-evident; keyed HMAC = tamper-resistant) and backed by a seq counter + optional keyed chain rather than marketing language.
REST API keys were hashed with unsalted SHA-256 (flagged by CodeQL as a weak algorithm for credential hashing). doc-firewall audit keygen now generates salted PBKDF2-HMAC-SHA256 hashes (600,000 iterations); legacy unsalted SHA-256 key-store entries are no longer accepted and must be rotated. The per-key rate-limit bucket id is now the key store's own id label for the matched key instead of any digest derived from the raw key.

Removed

pip as a runtime dependency. Pinning pip as a package dependency is an install-wedging anti-pattern; dropped it. pytest floor lowered to a real release.

[0.5.0] - 2026-06-24

Theme: detect injection in any language, act on what you find, and catch what the patterns miss. Adds multilingual threat detection, transparent PDF decryption, document sanitization for safe RAG ingestion, and a default-on ML classifier — all with no extra setup.

Added

Non-English threat detection (default install, no ML). Always-on keyword layers now catch prompt injection in 15 languages (multilingual_injection) and RAG-poisoning + social-engineering lures (multilingual_threats, T11/T12) over body and metadata, plus a language-agnostic script-mixing detector for hidden non-dominant-script text. The report.coverage["languages"] axis reports exactly which languages and layers are active, so the scanner never claims coverage it lacks.
Bundled ML injection classifier (default-on, no download). A ~8.8 KB logistic-regression model over hashed char/word n-grams ships in the wheel and runs on numpy alone, generalising to paraphrased/novel injections the keyword layers miss, multilingually. REVIEW-class (can FLAG, never BLOCK alone); zero benign-corpus FP. Disable via enable_injection_classifier. Synthetic-trained — retrain on a real corpus before primary reliance.
Sanitization output (Scanner.sanitize). Produces a cleaned copy safe for RAG ingestion: strips hidden text, dangerous metadata, macros, and active content while preserving visible content, with an auditable removed[] list. Non-destructive; per-format DOCX/PPTX/XLSX (stdlib), PDF (pikepdf-gated), CSV, HTML. Round-trip verified. Config enable_sanitization / sanitize_remove_categories; new docs page + examples/14_sanitize_for_rag.py.
Transparent PDF decryption (optional [crypto] extra). Encrypted PDFs are decrypted and scanned instead of flagged blind — the common empty-user-password case with no password, real protection via ScanConfig.pdf_passwords. Graceful no-op without pikepdf. Flags enable_pdf_decryption / pdf_passwords.
Measured font/ToUnicode divergence (T3). Detects the "rendered ≠ extracted" PDF attack — glyphs render one string while /ToUnicode (what extraction and the LLM read) yields another. Compares per font and flags a confirmed mismatch HIGH with both strings as evidence; covers both the /Differences and the standard-base-encoding variants. Config enable_font_divergence.
Image-based-injection advisory (T3). A no-OCR heuristic flags image-heavy / low-extractable-text documents (a screenshot-of-text "résumé") for OCR review. Config enable_image_text_ratio.
Honest coverage + per-language benchmark. make benchmark reports per-language / per-surface recall over an in-tree 15-language corpus and gates below 90% default-install recall.

Changed

Hidden-surface + metadata extraction. PDF non-rendered text (annotation /Contents, form /V, outlines, compressed /ObjStm) and DOCX core/app/ custom OOXML properties are now extracted and scanned by the injection layers.
Multilingual matcher robust to extraction noise. Separator canonicalisation + a despaced fallback defeat punctuation/whitespace spliced between words (Latin) or characters (CJK) by PDF/OCR extraction.
High-throughput fast_only mode. Skips the deep parse + detector loop (byte-level scan only); records metadata["fast_only"] so a shallow scan is never mistaken for a full one. Plus an opt-in content-hash result cache (enable_result_cache) and a calibrate-to-your-documents tool (scripts/calibrate_to_corpus.py).
Hardened parsers + red-team gate. A property-based suite fuzzes every new raw-bytes parser + the decryption path (~1500 inputs; no raise/hang/OOM), and make redteam asserts 100% malicious recall with zero benign T4 false positives across obfuscation/edit chains and their cross-products.
Plain-language evidence for every threat. Each finding now carries a clear, non-technical "what we found and why it matters" explanation for all 12 threat types — a per-threat fallback replaces raw detector jargon (e.g. "Score 7.0 >= 2.0"), with the original text preserved in technical_detail. Non-English evidence is made readable too: multilingual findings add evidence["plain_english"] (what the flagged foreign text actually says) and a language_name, so a reviewer who can't read the language still understands the threat.
Persistent Docling worker (much faster bulk PDF scanning). Docling conversion previously spawned a fresh subprocess per PDF, which re-imported docling+torch (~5 s) and rebuilt the converter every file — pure overhead that dominated bulk-scan time. A single long-lived worker per process now imports and builds the converter once and reuses it across all PDFs, removing ~5 s/file. Hang-isolation is preserved: a conversion that exceeds the per-file timeout, or a worker that crashes, tears the worker down and the next request transparently respawns a clean one.
Lower memory footprint under multi-process bulk scanning. The package now sets conservative thread/parallelism defaults at import (OMP_NUM_THREADS=1, TOKENIZERS_PARALLELISM=false, LOKY_MAX_CPU_COUNT=1, …, all via setdefault so callers can override), so each worker process no longer spawns an OpenMP/BLAS thread and a tokenizers fork-pool per CPU core. The Docling conversion subprocess's result Queue is now explicitly closed (close() + join_thread()), fixing a per-PDF semaphore/FD/feeder-thread leak (the "resource_tracker / semaphore might leak" warnings) on a long run.

Security

Bumped torch to >=2.12.1 (memory corruption via torch.jit.script in torch <= 2.12.0). Raised the floor in pyproject.toml and properly regenerated the pinned hash set in requirements-docker.txt: the lockfile carries no Linux CUDA transitive deps (it is compiled off-Linux) and torch's platform-independent requirements are unchanged between 2.11.0 and 2.12.1, so the torch line + its 24 PyPI-2.12.1 distribution hashes were the only required change (torchvision 0.26.0 already matches). tests/fuzz-requirements.txt pins no torch and needed no change.
HTML sanitizer <script> removal hardened (CodeQL js/bad-tag-filter, HIGH). The block regex closed on </script\s*> only, so a script whose end tag carried trailing characters — <script>evil()</script foo>, </script\t\n bar> — was not stripped. The closing tag now matches `