DocFirewall
DocFirewall is a zero-trust compliance layer designed to protect Large Language Model (LLM) pipelines, Retrieval-Augmented Generation (RAG) capabilities, and AI Agents from malicious payloads.
Whether you are using LangChain, LlamaIndex, Haystack, or custom agentic workflows, DocFirewall performs strict static analysis and heuristic scanning on PDF, DOCX, PPTX, and XLSX files to neutralize threats—such as Prompt Injection, Data Exfiltration, and Zip Bombs—before they reach your document parsers, vector databases, or inference engines.
Key Capabilities
Multi-Layered Defense
DocFirewall implements a defense-in-depth strategy covering 9 distinct threat vectors, including Prompt Injection, Malware, and Resource Exhaustion.
-
Malware & Virus (T1) --- Integration with ClamAV, VirusTotal, and Yara for signature-based detection.
-
Active Content (T2) --- Detects executable JavaScript, Macros (VBA), OLE objects, and PDF Actions.
-
Obfuscation (T3) --- Identifies homoglyphs, invisible text, and encryption used to bypass filters.
-
Prompt Injection (T4) --- Flags hidden instructions targeting LLM behavior (e.g., "Ignore previous instructions").
-
Ranking Manipulation (T5) --- Detects keyword stuffing and statistical anomalies to artificially boost ranking.
-
DoS Attacks (T6) --- Prevents resource exhaustion via Zip bombs (expansion ratios), excessive page counts, and recursion.
-
Parser Security & Anti-Overfitting --- Native protections against XXE / SSRF (
defusedxml) and fully dynamic constraint injection viaLimitsso rules adapt to your specific data, thwarting statically-overfitted evasion. -
Embedded Payloads (T7) --- Scans for embedded binaries (PE, ELF) and malicious object streams.
-
Metadata Injection (T8) --- Sanitizes metadata fields against buffer overflows and syntax injection.
-
ATS Manipulation (T9) --- Detects SEO poisoning and white-on-white text used to game ranking algorithms.
Performance
DocFirewall is optimized for high-throughput environments using a dual-stage scanning architecture:
- Fast Scan: 10ms-range byte-level analysis for known signatures and structural anomalies.
- Deep Scan: Full document parsing (powered by Docling) for semantic analysis.
Benchmark Results
- Precision: 100% (Zero False Positives)
- Speed: \(O(n)\) complexity (milliseconds per document for ML heuristic exact-match)
- Dataset: Validated against over 1,000 document artifacts
(Performance assessed on v3 Holdout Dataset containing 70+ adversarial samples and 100+ clean benign baseline files)
