Configuration
DocFirewall is highly configurable via PyYAML or direct Python object configuration. The configuration is controlled by the ScanConfig class.
Loading Configuration
You can load configuration from a YAML file or instantiate it in code.
Core Settings
| Setting | Type | Default | Description |
|---|---|---|---|
profile | str | "balanced" | One of balanced, strict, lenient. Adjusts default thresholds. |
enable_pdf | bool | True | enable PDF parsing/scanning. |
enable_docx | bool | True | enable DOCX parsing/scanning. |
enable_pptx | bool | True | enable PPTX parsing/scanning. |
enable_xlsx | bool | True | enable XLSX parsing/scanning. |
Threat Modules (T1-T9)
You can granularly enable or disable specific detection modules.
config = ScanConfig(
# T1: Malware / Virus (Requires AV Setup)
enable_antivirus=False,
# T2: Active Content (Macros, JS)
enable_active_content_checks=True,
# T3: Obfuscation (Hidden content)
enable_obfuscation_checks=True,
# T4: Prompt Injection (Jailbreaks)
enable_prompt_injection=True,
# T5: Ranking Manipulation (Keyword stuffing)
enable_ranking_abuse=True,
# T6: Resource Exhaustion (DoS)
enable_dos_checks=True,
# T7: Embedded Payloads (Binaries in streams)
enable_embedded_content_checks=True,
# T8: Metadata Injection
enable_metadata_checks=True,
# T9: ATS Manipulation (White text)
enable_ats_manipulation_checks=True
)
Antivirus Configuration
To use the T1 Malware protection, you must configure a provider.
Reliable, open-source integration.
Thresholds & Limits
Adjust sensitivity and resource constraints.
thresholds:
deep_scan_trigger: 0.20 # Risk score to trigger deep parsing (0.0-1.0)
flag: 0.35 # Return VERDICT=FLAG
block: 0.70 # Return VERDICT=BLOCK
limits:
max_mb: 10 # Max file size in MB
max_pages: 1000 # PDF page limit
parse_timeout_ms: 15000 # Parsing timeout
min_embedded_object_size_bytes: 20000 # Min size for embedded payload detection
Advanced Threat Configuration
Customizing ATS & Ranking Keywords
To prevent heuristic overfitting, developers can define custom ATS target keywords depending on the job domain or specific organizational ranking manipulation vulnerabilities:
from doc_firewall import ScanConfig, Limits
# Pass a custom list to detect domain-specific keyword stuffing
custom_config = ScanConfig(
ats_keywords=["nursing", "medical", "registered", "certified", "healthcare"],
limits=Limits(min_embedded_object_size_bytes=50000)
)
False Positive Management
Watermarks
Enterprise documents often contain "hidden" watermarks (e.g., "Confidential" in a hidden text layer). By default, DocFirewall employs a smart bypass.