Paperless-NGX Deployment & Ingestion Guide¶

This document captures the agreed architecture and operational decisions for running Paperless-NGX on Kubernetes while keeping long-term document storage on unRAID. It is intended as a living runbook / wiki page.

Goals¶

Centralized, declarative management of Paperless via Kubernetes / GitOps
Reliable, fast ingestion and OCR performance
Centralized, durable document storage on unRAID
Clear separation between managed application data and bulk document storage
A simple, repeatable workflow for scanning and importing documents

High-Level Architecture¶

Kubernetes cluster: Runs Paperless-NGX and its dependencies
Rook Ceph (RBD): Used for low-latency, high-IOPS workloads
unRAID (NFS via CSI): Used for bulk, long-term document storage

The key principle is:

Chatty, performance-sensitive data lives on Ceph. Large, mostly write-once documents live on unRAID.

Storage Layout Decisions¶

Paperless Directories¶

Paperless-NGX uses several important directories. They are intentionally split across storage backends.

Directory	Purpose	Storage Backend	Rationale
`data`	Index, ML models, internal state	Ceph (RBD)	Lots of small reads/writes; benefits from low latency
`consume`	Intake directory for new documents	Ceph (RBD)	Reliable file watching and atomic moves
`media`	Final, managed document storage	unRAID (NFS)	Large files, append-mostly, centralized backup
`export` (optional)	Bulk exports	unRAID (NFS)	Convenience + large files

Why `consume` is not on NFS¶

Network filesystems can be unreliable with file event notifications
Paperless ingestion is more predictable when consume is on fast, local-ish storage
Ceph-backed consume avoids polling hacks and race conditions

Kubernetes Storage Objects¶

StorageClasses (conceptual)¶

ceph-block
Default for stateful workloads
Used by Postgres, Redis, Paperless data, and consume
nfs-apollo-data
Backed by unRAID NFS exports
Used by Paperless media and export

Persistent Volume Claims¶

Expected PVCs for Paperless:

paperless-data → Ceph RBD
paperless-consume → Ceph RBD
paperless-media → NFS (unRAID)
paperless-export (optional) → NFS (unRAID)

Permissions & Identity¶

To avoid permission issues with NFS:

Run the Paperless container with a fixed UID/GID
Set runAsUser, runAsGroup, and fsGroup consistently
Ensure the unRAID NFS export is writable by that UID/GID

Avoid relying on root inside containers to "fix" permissions.

Database & Supporting Services¶

PostgreSQL
Runs in-cluster
Backed by Ceph RBD
Must be backed up regularly
Redis
Runs in-cluster
Can be ephemeral or persistent

Backups must cover:

PostgreSQL database
Paperless media directory

Both are required for a full restore.

Document Ingestion Workflow¶

Scanning (Mac workstation)¶

Use NAPS2 instead of vendor scanner software
Scanner profile recommendations:
300 DPI
Grayscale
Duplex enabled
Deskew + blank page removal
OCR disabled (Paperless handles OCR)

Scan Destination¶

Two common patterns:

Option A: Direct scan to unRAID - Mac scans to an SMB-mounted unRAID "incoming" folder

Option B: Local scan + sync - Scan locally - Periodically move files to unRAID incoming

Intake into Paperless¶

Recommended pattern:

Files arrive in unRAID incoming folder
A small Kubernetes job or script copies/moves them into the Ceph-backed consume directory
Paperless ingests and moves final documents into the NFS-backed media directory

This preserves convenience while keeping ingestion reliable.

Importing Existing Document Archives¶

Important Behavior¶

Paperless does not preserve directory structure
Folder hierarchy is discarded during ingestion
Metadata (tags, correspondents, document types) replaces folders

Recommended Import Strategy¶

Back up the existing document archive
Create a temporary import mirror (do not ingest originals)
Optionally improve filenames to include:
Dates
Vendors / correspondents
Document intent
Import in batches by category, not all at once
After each batch:
Review documents
Fix metadata
Create rules

Rules compound quickly and reduce cleanup later.

Getting Started Checklist¶

Before First Ingest¶

[ ] Kubernetes cluster stable
[ ] Rook Ceph healthy
[ ] NFS CSI mounts working
[ ] Paperless deployed with split storage
[ ] Postgres backups configured

First Scan Session¶

[ ] Scan a small batch (10–20 docs)
[ ] Verify OCR quality
[ ] Manually correct metadata
[ ] Create initial rules

Scaling Up¶

[ ] Increase batch sizes gradually
[ ] Monitor CPU during OCR-heavy runs
[ ] Avoid manual changes to the media directory

Operational Guidelines¶

Treat the media directory as Paperless-owned
Do not reorganize or rename files outside Paperless
Prefer fixing metadata in the UI + rules
Keep imports observable and reversible

Future Enhancements (Out of Scope for Phase 1)¶

LLM-assisted auto-tagging
Semantic search / summaries
Email ingestion
Barcode-based document splitting

These can be added later without changing the storage architecture.

End of document