Sanitization — safe copies for RAG ingestion
New in 0.5.0.
Detection answers "is this dangerous?". For an LLM/RAG ingestion pipeline the more useful question is often "give me a safe version I can ingest." Scanner.sanitize() produces a cleaned copy of a document — hidden text removed, dangerous metadata emptied, active content stripped, located injections neutralised — while preserving the visible content, plus an auditable record of everything it removed.
from doc_firewall import Scanner
scanner = Scanner()
result = scanner.sanitize("incoming_resume.docx")
if result.sanitized:
for r in result.removed:
print("removed:", r.kind, "—", r.detail, r.excerpt or "")
ingest(result.output_path) # safe to feed your RAG/LLM pipeline
import os; os.remove(result.output_path) # caller owns the temp copy
else:
block(result.reason) # no safe copy could be produced
What gets removed, per format
| Format | Removed |
|---|---|
| DOCX / PPTX / XLSX | hidden runs (vanish, white-on-white, ≤2pt font, off-page); VBA macro parts (vbaProject.bin); injection-bearing metadata (keywords/description/subject/custom props) |
PDF (needs [crypto]) | document & page /OpenAction//AA, the JavaScript name tree, annotation actions, AcroForm /XFA, embedded files, /Info + XMP metadata |
| CSV / TSV | formula-injection neutralised — cells starting with = + - @ get a leading apostrophe so a spreadsheet treats them as text |
| HTML | <script> blocks, inline on* event handlers, javascript: URLs, hiding styles (display:none, font-size:0, …) |
Formats without a sanitizer return sanitized=False with a reason, so a caller can fall back to BLOCK.
Configuration
Sanitization is opt-in by call (nothing is sanitized unless you call sanitize()) and non-destructive (the original file is never modified). Three controls:
cfg = ScanConfig()
cfg.enable_sanitization = False # master off-switch → sanitize() returns sanitized=False
cfg.sanitize_remove_categories = [ # restrict what's stripped (default: all)
"hidden_text", "macro", # e.g. strip these but KEEP metadata
]
result = Scanner(cfg).sanitize("doc.docx", output_path="cleaned.docx") # choose destination
Categories: hidden_text, metadata, macro, active_content, embedded_file, formula_injection. output_path defaults to a temp file the caller owns.
Guarantees & limits
- Visible content is preserved — only invisible / unsafe constructs are removed. A sanitized résumé keeps the candidate's real text.
- The result re-scans clean — by design, feeding
output_pathback intoscan()returns ALLOW (covered by the round-trip test). - It is conservative, not a full document rebuild: it targets the constructs the detectors flag. For maximum assurance, re-scan the sanitized copy before ingesting (one line, and it should be ALLOW).
- PDF sanitization requires
pikepdf(pip install doc-firewall[crypto]); without it PDFs returnsanitized=False(fall back to BLOCK). pikepdf also decrypts permissions-encrypted PDFs in passing.