Architecture
DocFirewall uses a pipeline architecture to process documents efficiently.
graph TD
A["Input File"] --> B["Pre-Flight Checks"]
B --> C{"Fast Scan"}
C -->|"Critical Threat Found"| D["Block"]
C -->|"Safe"| E["Deep Scan"]
E --> F["Parsing (Docling)"]
F --> G["Detector Pipeline"]
subgraph Detectors ["Detectors"]
H["T2 Active Content"]
I["T4 Prompt Injection"]
J["T8 Metadata"]
end
G --> H
G --> I
G --> J
H --> K["Risk Scoring"]
I --> K
J --> K
K --> L["Final Verdict"] 1. Input Interface
Documents enter via Python function calls (scan()), CLI, or REST API wrappers.
2. Pre-Flight
- Structure Check: Verify PDF/DOCX/PPTX/XLSX magic bytes.
- Size Check: Enforce
max_mblimits. - Hashing: Compute SHA256 for caching/logging.
3. Fast Scan (Byte-Level)
Scans the raw binary stream without parsing the document structure. - Speed: < 20ms. - Goal: Reject obvious malware, zip bombs, or known signatures immediately.
4. Deep Scan (Parsed)
If the file passes Fast Scan, it is parsed into a standardized logical representation (text blocks, key-value metadata). - Parsers: docling (default), pypdf, python-docx. - OCR: Optionally enabled for scanned PDFs using RapidOCR.
5. Offline Intelligence (Zero-API Execution)
All processing—including advanced NLP chunking, deep learning BERT sequence classification (Zero-Day Prompt Injection), Aho-Corasick automaton generation, and mathematical metrics (Shannon Entropy) runs strictly locally on CPU/GPU architecture. There are zero external API calls, protecting PII immediately by default and maintaining strict data residency compliance without relying on third-party LLMs like OpenAI or Anthropic.