Skip to content

Paperless-NGX Deployment & Ingestion Guide

This document captures the agreed architecture and operational decisions for running Paperless-NGX on Kubernetes while keeping long-term document storage on unRAID. It is intended as a living runbook / wiki page.


Goals

  • Centralized, declarative management of Paperless via Kubernetes / GitOps
  • Reliable, fast ingestion and OCR performance
  • Centralized, durable document storage on unRAID
  • Clear separation between managed application data and bulk document storage
  • A simple, repeatable workflow for scanning and importing documents

High-Level Architecture

  • Kubernetes cluster: Runs Paperless-NGX and its dependencies
  • Rook Ceph (RBD): Used for low-latency, high-IOPS workloads
  • unRAID (NFS via CSI): Used for bulk, long-term document storage

The key principle is:

Chatty, performance-sensitive data lives on Ceph. Large, mostly write-once documents live on unRAID.


Storage Layout Decisions

Paperless Directories

Paperless-NGX uses several important directories. They are intentionally split across storage backends.

Directory Purpose Storage Backend Rationale
data Index, ML models, internal state Ceph (RBD) Lots of small reads/writes; benefits from low latency
consume Intake directory for new documents Ceph (RBD) Reliable file watching and atomic moves
media Final, managed document storage unRAID (NFS) Large files, append-mostly, centralized backup
export (optional) Bulk exports unRAID (NFS) Convenience + large files

Why consume is not on NFS

  • Network filesystems can be unreliable with file event notifications
  • Paperless ingestion is more predictable when consume is on fast, local-ish storage
  • Ceph-backed consume avoids polling hacks and race conditions

Kubernetes Storage Objects

StorageClasses (conceptual)

  • ceph-block
  • Default for stateful workloads
  • Used by Postgres, Redis, Paperless data, and consume

  • nfs-apollo-data

  • Backed by unRAID NFS exports
  • Used by Paperless media and export

Persistent Volume Claims

Expected PVCs for Paperless:

  • paperless-data → Ceph RBD
  • paperless-consume → Ceph RBD
  • paperless-media → NFS (unRAID)
  • paperless-export (optional) → NFS (unRAID)

Permissions & Identity

To avoid permission issues with NFS:

  • Run the Paperless container with a fixed UID/GID
  • Set runAsUser, runAsGroup, and fsGroup consistently
  • Ensure the unRAID NFS export is writable by that UID/GID

Avoid relying on root inside containers to "fix" permissions.


Database & Supporting Services

  • PostgreSQL
  • Runs in-cluster
  • Backed by Ceph RBD
  • Must be backed up regularly

  • Redis

  • Runs in-cluster
  • Can be ephemeral or persistent

Backups must cover:

  1. PostgreSQL database
  2. Paperless media directory

Both are required for a full restore.


Document Ingestion Workflow

Scanning (Mac workstation)

  • Use NAPS2 instead of vendor scanner software
  • Scanner profile recommendations:
  • 300 DPI
  • Grayscale
  • Duplex enabled
  • Deskew + blank page removal
  • OCR disabled (Paperless handles OCR)

Scan Destination

Two common patterns:

Option A: Direct scan to unRAID - Mac scans to an SMB-mounted unRAID "incoming" folder

Option B: Local scan + sync - Scan locally - Periodically move files to unRAID incoming

Intake into Paperless

Recommended pattern:

  • Files arrive in unRAID incoming folder
  • A small Kubernetes job or script copies/moves them into the Ceph-backed consume directory
  • Paperless ingests and moves final documents into the NFS-backed media directory

This preserves convenience while keeping ingestion reliable.


Importing Existing Document Archives

Important Behavior

  • Paperless does not preserve directory structure
  • Folder hierarchy is discarded during ingestion
  • Metadata (tags, correspondents, document types) replaces folders
  1. Back up the existing document archive
  2. Create a temporary import mirror (do not ingest originals)
  3. Optionally improve filenames to include:
  4. Dates
  5. Vendors / correspondents
  6. Document intent
  7. Import in batches by category, not all at once
  8. After each batch:
  9. Review documents
  10. Fix metadata
  11. Create rules

Rules compound quickly and reduce cleanup later.


Getting Started Checklist

Before First Ingest

  • [ ] Kubernetes cluster stable
  • [ ] Rook Ceph healthy
  • [ ] NFS CSI mounts working
  • [ ] Paperless deployed with split storage
  • [ ] Postgres backups configured

First Scan Session

  • [ ] Scan a small batch (10–20 docs)
  • [ ] Verify OCR quality
  • [ ] Manually correct metadata
  • [ ] Create initial rules

Scaling Up

  • [ ] Increase batch sizes gradually
  • [ ] Monitor CPU during OCR-heavy runs
  • [ ] Avoid manual changes to the media directory

Operational Guidelines

  • Treat the media directory as Paperless-owned
  • Do not reorganize or rename files outside Paperless
  • Prefer fixing metadata in the UI + rules
  • Keep imports observable and reversible

Future Enhancements (Out of Scope for Phase 1)

  • LLM-assisted auto-tagging
  • Semantic search / summaries
  • Email ingestion
  • Barcode-based document splitting

These can be added later without changing the storage architecture.


End of document