Changelog
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.4.8] - 2026-06-10
Added
- Evidence contract. Every HIGH/CRITICAL/BLOCK finding now carries either
evidence["malicious_text"](the actual offending content) orevidence["evidence_unavailable_reason"]+evidence["debug_steps"](why it couldn't be extracted, and the commands to dig it out). Enforced and gated by a benchmark that fails the release below 100 % compliance. - Coverage transparency. Every report carries
report.coverageshowing which optional detectors (YARA/AV for T1; semantic-NN/BERT/OCR for T4) are actually active; a degraded scanner logs a loud warning.require_full_coverage/required_capabilitiesfail closed when a promised capability is missing.
Changed
- PDF actions are resolved, not just counted.
/OpenActionand/AAare followed through the object graph (including FlateDecode and/ObjStmcompressed streams) and the target — JavaScript body,/Launchcommand, URI — is extracted intomalicious_text. A benign "open at page N" action is INFO and no longer flags the document. - Fewer false positives. T12 social-engineering no longer fires on executive/résumé language (tighter window, sentence-scoped, narrative-aware);
file://links baked in by Office→PDF export are INFO, while remote/executablefile://and UNC still BLOCK; "SQL Injection in Metadata" is now the more honest MEDIUM "SQL-like Syntax in Metadata". - Incomplete and un-inspectable scans never pass silently. Stage timeouts (
on_timeout_verdict) and encrypted/password-protected content (on_unscannable_verdict) escalate to FLAG by default and can fail closed (block).
Fixed
- Unicode-normalizer injection evasions (homoglyphs produced by NFKC; CR-separated obfuscation) — found by re-enabling the property-based tests, which previously failed to collect.
Security
- PYSEC-2026-196:
pipraised to ≥ 26.1.2.
[0.4.7] - 2026-06-06
Added
- Resume-scanning example + recommended config. New
examples/13_scan_resumes.py(a resume-focused companion to12_scan_folder.py) scans a single resume or a folder recursively, loading its entire detection policy from YAML. Restricted to resume formats (.pdf/.docx/.docm/.doc/.odt/.rtf), prints per-file verdicts plus a summary, supports an optional--jsonreport, and returns a non-zero exit code on anyBLOCK/ERRORfor CI gating. examples/resume.yaml— recommended settings for screening attacker-controlled resumes: turns up the content-manipulation detectors that matter (T9 ATS / hidden text, T3 obfuscation, T4 prompt injection, T5 ranking, T10 indirect injection, OCR injection, T2 active content) and disables the T8 PII detector (name/email/phone/address are expected in a resume and only generate noise), while keeping secret scanning on.
[0.4.6] - 2026-06-01
Fixed
- DOCX hidden-text findings now carry the actual hidden text. Fast-scan
tiny_font/white_color/vanish/offpagetechniques walk back to the enclosing<w:r>and emit the run's<w:t>content asevidence["hidden_text"](andevidence["malicious_text"]). Previously the only value carried was the technique description ("font size 0.5pt") with no way to see what the invisible text actually said.
[0.4.5] - 2026-05-27
Changed
- Verdict model: class-based, not score-based. New
VerdictClass(BLOCK/REVIEW/INFO) on everyFinding.BLOCKnow requires definitive evidence (YARA, EICAR, AV-infected,javascript:/data:/file:/vbscript:URIs, CSV DDE pipes, ODFmacro://, RTF\javascript, embedded PE/ELF/Mach-O/ISO, JBIG2-oversized, XLM+veryHidden, etc.); heuristic findings cap atFLAG.risk_scoreis still computed for analytics but no longer gates the verdict.
Added
- Plain-language explanations.
Finding.explainis rewritten to plain prose; the original technical text is preserved in the newFinding.technical_detailfield. Driven by a central mapping indetectors/explanations.pycovering the 15 most-common finding types. SIEM consumers should key ontechnical_detail(ortitle) instead ofexplain.
Fixed
- Real-world FP cluster (8 detector tightenings). Fast-scan
/URIduplicate, T10 imperative-at-agent rule, T8 SQL-in-metadata binary-content guard, T7 JPEG/PNG file-type guard, fast-scan T4 keyword pruning (system prompt/reveal youretc.), T8 PII VIN/IBAN format validation, T5/T9 Docling artifact stripping (<!-- image -->), T12 "call us at" pruning. Drops verdict on legitimate resumes, IRS notices, and edited PDFs from BLOCK/FLAG to ALLOW/FLAG.
Documentation
- Rewrote
concepts/risk-scoring.mdfor the class-based model; reframedrisk_model.mdas analytics bands; added newconcepts/policies.md(four bundled policies + schema reference); updatedquickstart.mdFinding-fields table; updatedexamples/doc_firewall_config.yamlfor 5-minute timeouts anddocling_device.
[0.4.4] - 2026-05-25
Fixed
- Resume / real-world FP cluster —
/URIandTargetMode="External"no longer flag plainhttp(s)/mailto/telhyperlinks (onlyjavascript:/data:/file:/vbscript:/jar:/IP-literal targets fire T2). PDF structural tokens (endobj,endstream,xref, …) added to_STOP_WORDSso they no longer count as keyword stuffing;repeated_seqnow rejects pure-numeric and single-char runs (PDF coordinate matrices like0 0 0 0 …) and emits richer evidence (repeated_token,repeat_count,context). act as amatched partial-wordimp[act as a]— Aho-Corasick hits now respect word boundaries when the phrase itself starts/ends with a word char; structural markers (<tool_call>,[inst],{{system}}) still match as substrings.
Changed
- All per-stage scan timeouts raised to 5 minutes (
docling_subprocess_timeout_s= 270 s) to absorb large benign documents under the strict profile. - Docling device is platform-aware by default. New
limits.docling_deviceconfig field (env:DOC_FIREWALL_LIMITS_DOCLING_DEVICE) acceptscpu|auto|cuda|cuda:N|mps|xpu. Default iscpuon macOS (Docling's auto-detection would pick MPS, whose float64 limitation crashes the layout model with"Cannot convert a MPS Tensor to float64 dtype") andautoeverywhere else so Linux/Windows CUDA/XPU boxes get GPU acceleration automatically. Override per process with the env var or viaScanConfig(limits={"docling_device": "..."}).
[0.4.3] - 2026-05-23
Fixed
- PDF text false-negatives (~40 documents) — when Docling returned truncated/partial text for a PDF, the regex-fallback extraction was discarded. The PDF parser now unions the fallback text with the Docling output (preferring the longer / non-empty result), so injection and embedded-payload content past Docling's truncation point is no longer missed.
- T7 base64-embedded payloads silently undetected —
embedded_payload.pywas missingimport base64; everybase64.b64decodecall raisedNameErrorthat a bareexceptswallowed, making the entire decode-and-flag path dead code. Import restored. - T9 / T3 homoglyph detection silently disabled —
ats_manipulation.pyraisedUnboundLocalError: counterin the homoglyph branch (counter/totalreferenced before assignment). Hoisted above the guarding block. - First scan bypassed all deep-scan detectors — one-time cold-start model/automaton initialization pushed the first document past the 5 s detector-stage budget, so it returned with
detectors_timed_outand zero deep findings. The detector-stage timeout default is raised to absorb warm-up (see Changed).
Changed
- Short base64 segments now decoded before T4 / T3 matching —
advanced_prompt_injection.pydecodes embedded base64 tokens and appends the plaintext to the normalized text before matching, closing a standard-mode T3 obfuscation gap (previously only the ML / defense-in-depth path caught it). Reuses the existing tuned matchers — no new false-positive heuristic. limits.detectors_timeout_msdefault raised 5000 → 30000 ms.
Documentation
- Corrected all bundled
examples/scripts — invalidFinding.rule_id, and rebuilt the examples index for T1–T12. - Corrected the published docs: invalid
profile="fast", JSON"file"→"file_path", non-existentT7_SENSITIVE_PIIpolicy weight →T8_METADATA_INJECTION, wrong custom-phrase YAML key (phrases:→custom_phrases:), defaultflagthreshold (0.35 → 0.25),black/mypy→ruff, stale CLI output sample, and YARA rule count (30+ → 53).
[0.4.2] - 2026-05-17
Fixed
- T6 false-positive on slow benign documents — detector-stage timeout no longer emits a
T6_DOSfinding; recordsreport.metadata["detectors_timed_out"]and logs a warning instead. Real DoS is still caught by fast-scan / parse-stage T6 paths. - Docling subprocess spawned unnecessarily for non-PDF formats —
convert_with_doclingnow skips the subprocess for non-.pdfsources; DOCX is handled by the fallback parser and was never a valid Docling input.
[0.4.1] - 2026-05-16
Added
- 3 new formats (9 total) — legacy OLE
.doc/.xls/.ppt(VBA-stomping /vbaProject.bin), CSV/TSV (formula injection, DDE), OpenDocument.odt/.ods/.odp(macro:// CVE-2023-2255). - PDF
/JBIG2Decode(CVE-2021-30860),/RichMedia,/3D,/GoToE; ExcelveryHidden+ inline XLM; HTML SVG/MathML/CSS-javascript:/atob+Blob smuggling; Mach-O/WASM/ISO/RAR/7z embedded-binary signatures; PDF annotation subtypes + AcroForm/V//DVfield defaults; embedded media metadata (ID3/MP4/RIFF). - Evasion resistance — math-script + reversed-text matching, expanded Unicode confusables, separator normalization, edit-distance-1 fuzzy matching, multilingual phrase set expanded to 22 languages.
- Broader indirect-injection URI vocabulary (
data:/smb:/UNC/raw-GitHub fire HIGH); RAG chunk-boundary split detection; crypto / gift-card / tech-support social-engineering patterns; opt-in QR-code decoding (quishing) + PDF/ODF image OCR. - Page-tree & slide-master cycle DoS detection; PDF
/ActualTextoverlay density; per-section ATS keyword check; risk-model calibration script. - Detector regex/automaton now pre-compiled at
Scannerconstruction (first scan no longer slower than steady-state); 220-document benign corpus with SHA-256 manifest and CI false-positive gate (≤1% balanced, ≤3% strict). Test suite 192 → 301.
Changed
- PII detector now wired into the Scanner (was defined but unused); threat ID corrected
T2→T8; HIPAA Safe-Harbor identifier subset + XMP metadata scanning added. - Precision hardening (benign-corpus FP rate 78.6% → 0.00%) — perplexity-based GCG-suffix detection is now opt-in / default off (character statistics cannot separate adversarial suffixes from dense legal formatting); fuzzy matching restricted to longer multi-word phrases; social-engineering urgency+authority pair now also requires an action demand.
- YARA ruleset 38 → 53 rules with
meta.cve/meta.mitre.
Fixed
- Built-in YARA ruleset was uncompilable on yara-python ≥ 4.5 (
(?:…),/m,($a or $b) in (range)) — silently disabling YARA. Rewritten to valid syntax.
[0.4.0] - 2026-05-10
Added
- New format support — RTF (OLE objects,
\binstreams,\fldinstrmacros,\vhidden text) and HTML (<script>, inline event handlers, CSS hidden text) added alongside existing PDF/DOCX/PPTX/XLSX. Macro-enabled Office templates (.dotm,.xltm,.potm,.xlsm,.pptm) now accepted and flagged T2 by default. - T10/T11/T12 — New threat codes completing T1–T12 coverage — Indirect/Multi-Hop Injection (T10): URL + fetch-instruction co-occurrence + tool-call schema detection. RAG/Knowledge-Base Poisoning (T11): authority-assertion patterns, sentence-duplication flooding, false citation detection. Social Engineering (T12): tri-signal urgency/authority/action-demand co-occurrence with HIGH overrides for credential harvesting and fake legal threats.
- Detection hardening — Closed 13 concrete bypass vectors: mid-document T4 blind spot (full-doc overlapping windows), zero-width character T4 suppression bypass, FlateDecode-compressed active content evasion, hex-encoded/split PDF token evasion, compressed ToUnicode CMap obfuscation, XObject cycle + XML entity depth DoS (T6), CMYK white text, RTF
\vhidden text, PDF clipping-path hidden text, homoglyph ATS stuffing, and base64 entropy / multi-level decode hardening. - ML pipeline improvements — Four-layer T4 pipeline (normalization → Aho-Corasick → fuzzy regex → BERT sliding window). Multilingual phrase set expanded to 145+ (13 languages). BERT recall improved to ≥ 90% (removed early-exit gate; threshold 0.99999 → 0.85). Semantic NN paraphrase-stuffing detection (cosine clustering). 38+ built-in YARA rules covering malware families, CVEs, polyglots, and prompt-injection indicators.
- Policy engine — Named YAML policies with
applies_toglob matching, per-policydeny_list/allow_list(SHA-256),custom_threat_weights,required_detectors, andprofileoverrides. Hot-reload viaengine.reload(). CLI--policy-file/--policy-nameflags added. - Resilience and security — Tamper-evident append-only JSONL audit log (SHA-256 hash chain). REST API key auth with per-key rate limiting. Recursive archive scanning (ZIP/tar, depth 3). Password-protected document detection (T1 MEDIUM early return). Docling subprocess isolation with hard-kill timeout (bomb PDF DoS protection). Model integrity SHA-256 manifest. Docker seccomp/cap_drop hardening.
- False positive hardening — 113-document benign corpus (
pytest -m benign); stop-word filter + minimum absolute-count gates on T5/T9 detectors eliminate FPs on resumes, SEO documents, and academic papers discussing ATS/ranking vocabulary.
[0.3.10] - 2026-05-09
Security
- Bumped
python-multipart0.0.26 → 0.0.27 (DoS, GitHub Advisory #22),lxml6.0.2 → 6.1.0 (CVE-2026-41066),pygments2.19.2 → 2.20.0 (CVE-2026-4539),python-dotenv1.2.1 → 1.2.2 (CVE-2026-28684),pytestfloor → ≥ 9.0.3 (CVE-2025-71176).
Added
- Four-layer prompt-injection pipeline (T4): normalization (homoglyph/BIDI stripping) → Aho-Corasick → regex fuzzy matching → sliding-window BERT (
ProtectAI/deberta-v3-base-prompt-injection-v2, threshold 0.85) → optional semantic NN (enable_semantic_nn). Replaces the previous single-pass exact matcher. - Adversarial benchmark suite:
scripts/benchmark_prompt_injection.py(36 OWASP LLM01 probes, CI gate),scripts/fetch_adversarial_dataset.py,scripts/calibrate_thresholds.py(AUC = 1.0 on 1 185 records). - 40-test adversarial suite (
tests/test_adversarial.py) covering all threat categories, homoglyph/BIDI mutation bypasses, and benign-resume FP regressions.
Fixed
NameErrorcrash inembedded_payload.py—contentvariable undefined in suspicious-script evidence dict; renamed totext.- Attacker-exploitable bypass in
advanced_prompt_injection.py— hardcoded early-exit on"override all evaluations"+"score: 10"allowed suppression of the entire detector; removed. - Obfuscation silently suppressed injection detection — detector returned immediately on any zero-width/BIDI content; now normalizes and continues scanning.
- BERT threshold was dead code — hardcoded at
0.99999; lowered to0.85and exposed asScanConfig.bert_confidence_threshold. - BERT only scanned first 2 000 chars — replaced with full-document sliding-window chunking (
bert_max_chunks, default 20). - ATS keyword list false positives — removed 20 common resume-skill words (
python,java,sql, etc.) from the default list; retained only injection-style command tokens. - Risk scores inflated —
Finding.confidencedefault changed from1.0→0.5; duplicate findings perthreat_idnow take max confidence instead of stacking multiplicatively. - Docling OCR warning on every Docker scan —
format_optionsdict was keyed by class object instead ofInputFormat.PDFenum, silently ignoringdo_ocr=False.
Changed
- Hidden-text detection expanded across all four formats: DOCX (near-white color, tiny font, vanish, off-page), XLSX (near-white fill,
;;;format, hidden rows/cols), PPTX (near-white color, tiny font, hidden shapes, off-slide EMU), PDF (1.0 1.0 1.0 rg,3 Trinvisible mode, sub-1ptTf). - FLAG/BLOCK thresholds (0.35/0.70) confirmed empirically via ROC sweep; documented in
docs/risk_model.md. - Pydantic V2 migration: all
Settingsclasses usemodel_config = SettingsConfigDict(...). - Benchmark (real-world, 500 probes): L1+L2 recall 49 %, precision 100 %; +BERT recall 62.5 %, precision 99.1 %, 51 ms avg.
[0.3.8] - 2026-05-02
Fixed
- T1 EICAR detection (PDF/XLSX):
YaraDetectorEICAR signature check was gated behindenable_yara=False(the default), causing 100% miss rate on T1 malware in PDF and XLSX. Moved the EICAR check before theenable_yaraguard so it always runs. - T3/T5/T6/T9 detection in PDF: Added detection for PDF white-on-white stealth text (
1 1 1 rgcolor operator) infast_scan_pdf. Attackers rendered adversarial content in white text on a white background — invisible to humans but extractable by parsers. - T7 detection in XLSX (hex-encoded payloads):
EmbeddedPayloadDetectoronly matched large base64 blobs; added pattern matching for hex-encoded binary file magic numbers (4D5APE,7F454C46ELF) in document text and metadata. - T9 detection in XLSX/PDF (ATS manipulation):
ATSManipulationDetectornow includeskeywordsanddescriptionmetadata fields in the token frequency analysis; attacks routed through metadata were not counted. Lowered minimum token threshold from 50 → 25 to handle short documents with clear keyword stuffing.
[0.3.0] - 2026-03-28
Added
- Advanced Local ML Scanners: Introduced powerful offline Machine Learning / NLP modules.
- Aho-Corasick Algorithm: Implemented finite-state automaton for O(n) exact string matching on known
T4_PROMPT_INJECTIONpayloads. - Local BERT Pipeline: Embedded zero-day deep learning text-classification (
huggingface,sentence-transformers) for detecting polymorphic prompt and ATS manipulations. - TF-IDF & Jaccard Similarity: Leveraged
scikit-learnto identify keyword stuffing and statistical term deviations (T5_RANKING_MANIPULATIONandT9_ATS_MANIPULATION). - Shannon Entropy Scoring: Integrated structured mathematical calculations to detect hardcoded API Keys, Passwords, and Data Exfiltration streams.
- Dynamic Feature Flags: Added granular explicit opt-ins via
ScanConfig(enable_advanced_ahocorasick,enable_advanced_bert, etc.) safely defaulting to False for backwards compatibility. - Examples: Included isolated feature scripts (
08_advanced_ml_scanners.py) and fully stacked maximum security scripts (09_recommended_advanced_scan.py).
Changed
- Shifted project distribution state to
Development Status :: 5 - Production/Stable. - Fixed several legacy test expectations that failed under optimized false-positive bounds tuning.
- Resolved top-level GitHub Actions scorecard vulnerability by adopting strict job-level
contentspermissions on PyPI build matrix. Atherispipeline dependencies synchronized/bumped to3.0.0.
[0.2.0] - 2026-03-08
Added
- PPTX Support: Full layout mapping, recursive embedded object tracking, and metadata extraction for Microsoft PowerPoint presentations.
- XLSX Support: Full spreadsheet parsing, cell value extraction, and DDE link (Active Content) detection for Microsoft Excel files.
- T2 (Active Content): Refined scanning capabilities to natively track dynamic external payload queries in
.pptxand.xlsx. - T3 (Obfuscation): Added dynamic ratio thresholding for hidden zero-width unicode characters specific to nested cells/slides.
- T8 (Metadata Injection): Injected deep inspection support to flag embedded SQL queries and malicious command strings hidden in format properties.
- Overlapping Threat Architecture: Allowed internal detection schemas to transparently track dual-state threat classifications (i.e.,
T9_ATS_MANIPULATIONwhen utilizingT3_OBFUSCATION).
Changed
- Refactored
Scanner()initialization to consistently load the complete suite of detector arrays globally (resolving missing isolated threat models). - Enhanced exact threshold scaling across
text_obfuscation.pyto heavily reduce False Negatives.
[0.1.0] - 2026-02-22
Added
- Initial Open Source release of the
doc_firewallscanning engine. - Supported core structures: Microsoft Word (
.docx) and Adobe Standard (.pdf). - Configured 9 Primary Threat Models (
T1throughT9). - Incorporated ClamAV integration functionality.
- Shipped MkDocs documentation bindings.