Paperless-NGX Deployment & Ingestion Guide¶
This document captures the agreed architecture and operational decisions for running Paperless-NGX on Kubernetes while keeping long-term document storage on unRAID. It is intended as a living runbook / wiki page.
Goals¶
- Centralized, declarative management of Paperless via Kubernetes / GitOps
- Reliable, fast ingestion and OCR performance
- Centralized, durable document storage on unRAID
- Clear separation between managed application data and bulk document storage
- A simple, repeatable workflow for scanning and importing documents
High-Level Architecture¶
- Kubernetes cluster: Runs Paperless-NGX and its dependencies
- Rook Ceph (RBD): Used for low-latency, high-IOPS workloads
- unRAID (NFS via CSI): Used for bulk, long-term document storage
The key principle is:
Chatty, performance-sensitive data lives on Ceph. Large, mostly write-once documents live on unRAID.
Storage Layout Decisions¶
Paperless Directories¶
Paperless-NGX uses several important directories. They are intentionally split across storage backends.
| Directory | Purpose | Storage Backend | Rationale |
|---|---|---|---|
data |
Index, ML models, internal state | Ceph (RBD) | Lots of small reads/writes; benefits from low latency |
consume |
Intake directory for new documents | Ceph (RBD) | Reliable file watching and atomic moves |
media |
Final, managed document storage | unRAID (NFS) | Large files, append-mostly, centralized backup |
export (optional) |
Bulk exports | unRAID (NFS) | Convenience + large files |
Why consume is not on NFS¶
- Network filesystems can be unreliable with file event notifications
- Paperless ingestion is more predictable when
consumeis on fast, local-ish storage - Ceph-backed
consumeavoids polling hacks and race conditions
Kubernetes Storage Objects¶
StorageClasses (conceptual)¶
ceph-block- Default for stateful workloads
-
Used by Postgres, Redis, Paperless
data, andconsume -
nfs-apollo-data - Backed by unRAID NFS exports
- Used by Paperless
mediaandexport
Persistent Volume Claims¶
Expected PVCs for Paperless:
paperless-data→ Ceph RBDpaperless-consume→ Ceph RBDpaperless-media→ NFS (unRAID)paperless-export(optional) → NFS (unRAID)
Permissions & Identity¶
To avoid permission issues with NFS:
- Run the Paperless container with a fixed UID/GID
- Set
runAsUser,runAsGroup, andfsGroupconsistently - Ensure the unRAID NFS export is writable by that UID/GID
Avoid relying on root inside containers to "fix" permissions.
Database & Supporting Services¶
- PostgreSQL
- Runs in-cluster
- Backed by Ceph RBD
-
Must be backed up regularly
-
Redis
- Runs in-cluster
- Can be ephemeral or persistent
Backups must cover:
- PostgreSQL database
- Paperless
mediadirectory
Both are required for a full restore.
Document Ingestion Workflow¶
Scanning (Mac workstation)¶
- Use NAPS2 instead of vendor scanner software
- Scanner profile recommendations:
- 300 DPI
- Grayscale
- Duplex enabled
- Deskew + blank page removal
- OCR disabled (Paperless handles OCR)
Scan Destination¶
Two common patterns:
Option A: Direct scan to unRAID - Mac scans to an SMB-mounted unRAID "incoming" folder
Option B: Local scan + sync - Scan locally - Periodically move files to unRAID incoming
Intake into Paperless¶
Recommended pattern:
- Files arrive in unRAID
incomingfolder - A small Kubernetes job or script copies/moves them into the Ceph-backed
consumedirectory - Paperless ingests and moves final documents into the NFS-backed
mediadirectory
This preserves convenience while keeping ingestion reliable.
Importing Existing Document Archives¶
Important Behavior¶
- Paperless does not preserve directory structure
- Folder hierarchy is discarded during ingestion
- Metadata (tags, correspondents, document types) replaces folders
Recommended Import Strategy¶
- Back up the existing document archive
- Create a temporary import mirror (do not ingest originals)
- Optionally improve filenames to include:
- Dates
- Vendors / correspondents
- Document intent
- Import in batches by category, not all at once
- After each batch:
- Review documents
- Fix metadata
- Create rules
Rules compound quickly and reduce cleanup later.
Getting Started Checklist¶
Before First Ingest¶
- [ ] Kubernetes cluster stable
- [ ] Rook Ceph healthy
- [ ] NFS CSI mounts working
- [ ] Paperless deployed with split storage
- [ ] Postgres backups configured
First Scan Session¶
- [ ] Scan a small batch (10–20 docs)
- [ ] Verify OCR quality
- [ ] Manually correct metadata
- [ ] Create initial rules
Scaling Up¶
- [ ] Increase batch sizes gradually
- [ ] Monitor CPU during OCR-heavy runs
- [ ] Avoid manual changes to the
mediadirectory
Operational Guidelines¶
- Treat the
mediadirectory as Paperless-owned - Do not reorganize or rename files outside Paperless
- Prefer fixing metadata in the UI + rules
- Keep imports observable and reversible
Future Enhancements (Out of Scope for Phase 1)¶
- LLM-assisted auto-tagging
- Semantic search / summaries
- Email ingestion
- Barcode-based document splitting
These can be added later without changing the storage architecture.
End of document