Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.3.0] - 2026-03-28

Advanced Local ML Scanners: Introduced powerful offline Machine Learning / NLP modules.
Aho-Corasick Algorithm: Implemented finite-state automaton for O(n) exact string matching on known T4_PROMPT_INJECTION payloads.
Local BERT Pipeline: Embedded zero-day deep learning text-classification (huggingface, sentence-transformers) for detecting polymorphic prompt and ATS manipulations.
TF-IDF & Jaccard Similarity: Leveraged scikit-learn to identify keyword stuffing and statistical term deviations (T5_RANKING_MANIPULATION and T9_ATS_MANIPULATION).
Shannon Entropy Scoring: Integrated structured mathematical calculations to detect hardcoded API Keys, Passwords, and Data Exfiltration streams.
Dynamic Feature Flags: Added granular explicit opt-ins via ScanConfig (enable_advanced_ahocorasick, enable_advanced_bert, etc.) safely defaulting to False for backwards compatibility.
Examples: Included isolated feature scripts (08_advanced_ml_scanners.py) and fully stacked maximum security scripts (09_recommended_advanced_scan.py).

Shifted project distribution state to Development Status :: 5 - Production/Stable.
Fixed several legacy test expectations that failed under optimized false-positive bounds tuning.
Resolved top-level GitHub Actions scorecard vulnerability by adopting strict job-level contents permissions on PyPI build matrix.
Atheris pipeline dependencies synchronized/bumped to 3.0.0.

PPTX Support: Full layout mapping, recursive embedded object tracking, and metadata extraction for Microsoft PowerPoint presentations.
XLSX Support: Full spreadsheet parsing, cell value extraction, and DDE link (Active Content) detection for Microsoft Excel files.
T2 (Active Content): Refined scanning capabilities to natively track dynamic external payload queries in .pptx and .xlsx.
T3 (Obfuscation): Added dynamic ratio thresholding for hidden zero-width unicode characters specific to nested cells/slides.
T8 (Metadata Injection): Injected deep inspection support to flag embedded SQL queries and malicious command strings hidden in format properties.
Overlapping Threat Architecture: Allowed internal detection schemas to transparently track dual-state threat classifications (i.e., T9_ATS_MANIPULATION when utilizing T3_OBFUSCATION).

Refactored Scanner() initialization to consistently load the complete suite of detector arrays globally (resolving missing isolated threat models).
Enhanced exact threshold scaling across text_obfuscation.py to heavily reduce False Negatives.

Initial Open Source release of the doc_firewall scanning engine.
Supported core structures: Microsoft Word (.docx) and Adobe Standard (.pdf).
Configured 9 Primary Threat Models (T1 through T9).
Incorporated ClamAV integration functionality.
Shipped MkDocs documentation bindings.