Skip to content

Paperless-NGX Document Management System

Deployment Date: December 31, 2025 - January 1, 2026

Current Status: ✅ Operational

Namespace: paperless-ngx

Overview

Paperless-NGX is a document management system that scans, indexes, and archives all of your documents with full-text search and automatic OCR processing. The deployment uses a hybrid storage architecture combining Ceph for performance-critical workloads and NFS for bulk document storage.

Architecture

Core Components

Component Version Purpose Storage Backend
Paperless-NGX 2.14.7 Document management application Hybrid (Ceph + NFS)
PostgreSQL CloudNativePG Application database Ceph RBD
Redis Latest Task queue and caching Ephemeral
File-Mover Sidecar Alpine 3.21 NFS → Ceph file transfer N/A

Infrastructure Dependencies

  • Database: CloudNativePG cluster (10Gi on Ceph)
  • Storage: Hybrid architecture (details below)
  • Ingress: Traefik IngressRoute with TLS
  • DNS: External-DNS for automatic DNS management
  • Authentication: Authelia one-factor authentication
  • Certificate: Let's Encrypt TLS certificate via cert-manager

Storage Architecture

Design Principle

Performance-critical data lives on Ceph. Large, write-once documents live on NFS.

The deployment uses a split storage architecture to optimize for both performance and capacity:

Volume Layout

Volume Size Storage Class Access Mode Purpose
paperless-ngx-data 10Gi ceph-block RWO ML models, search index, internal state
paperless-ngx-consume 5Gi ceph-block RWO Document intake directory (fast, reliable inotify)
paperless-ngx-incoming 10Gi Static PV (NFS) RWX User drop zone for new documents
paperless-ngx-media 500Gi Static PV (NFS) RWX Final document archive storage
paperless-ngx-export 100Gi Static PV (NFS) RWX Bulk export directory
paperless-postgresql-1 10Gi ceph-block RWO PostgreSQL database

Static PersistentVolumes (NFS)

Unlike dynamic provisioning, the NFS volumes use static PersistentVolumes for human-readable, stable paths:

# Example: paperless-media PV
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-paperless-media
spec:
  capacity:
    storage: 500Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: ""
  nfs:
    server: 172.16.103.30  # Apollo unRAID server
    path: /mnt/user/data/paperless/media

Benefits: * Human-readable paths on unRAID: /mnt/user/data/paperless/{export,incoming,media}/ * Retain reclaim policy protects data from accidental deletion * Easy direct file access for backups and troubleshooting * Better alignment with GitOps principles

Document Ingestion Workflow

The NFS inotify Problem

Network file systems (NFS/SMB) don't support reliable file system event notifications (inotify). This prevents Paperless from automatically detecting new files dropped on NFS shares.

Solution: File-Mover Sidecar

The deployment includes a lightweight Alpine-based sidecar container that solves this limitation:

┌─────────────────────────────────────────────────────────┐
│  Paperless Pod                                          │
│  ┌──────────────────────┐  ┌────────────────────────┐   │
│  │ Paperless Container  │  │ File-Mover Sidecar     │   │
│  │                      │  │ (Alpine 3.21)          │   │
│  │ Watches /consume via │  │                        │   │
│  │ inotify (Ceph)       │  │ Every 15 minutes:      │   │
│  │                      │  │ mv /incoming/* /consume│   │
│  └──────────────────────┘  └────────────────────────┘   │
│         ▲                            │                  │
│         │ inotify works!             │                  │
│         │                            ▼                  │
│  ┌──────────────┐          ┌──────────────┐             │
│  │ /consume     │◀─────────│ /incoming    │             │
│  │ (Ceph RBD)   │   move   │ (NFS)        │             │
│  └──────────────┘          └──────────────┘             │
└─────────────────────────────────────────────────────────┘
                                     │ Users drop files
                           ┌─────────┴─────────┐
                           │ Apollo NFS Share  │
                           │ /mnt/user/data/   │
                           │ paperless/incoming│
                           └───────────────────┘

Workflow:

  1. Users drop scanned documents into /mnt/user/data/paperless/incoming/ on Apollo (via SMB/NFS)
  2. File-mover sidecar runs every 15 minutes, checking for new files
  3. Any files found are moved to /consume (Ceph)
  4. Paperless inotify detects files immediately on Ceph
  5. Documents are processed (OCR, indexing) and moved to /media (NFS)

Implementation:

sidecars:
  file-mover:
    image: alpine:3.21
    command:
      - /bin/sh
      - -c
      - |
        echo "File mover sidecar started - checking /incoming every 15 minutes"
        while true; do
          if [ -n "$(ls -A /incoming 2>/dev/null)" ]; then
            echo "$(date): Found files in /incoming, moving to /consume"
            mv -v /incoming/* /consume/
          fi
          sleep 900  # 15 minutes
        done

Benefits:

  • Combines ease of NFS drop zone with reliable Ceph inotify
  • Minimal resource usage (Alpine container)
  • Simple, auditable bash script
  • 15-minute interval sufficient for document workflow

Resource Allocation

Application Resources

resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 4Gi

Rationale:

  • OCR processing is CPU-intensive
  • Machine learning models require memory
  • Document ingestion involves transcoding and analysis

PostgreSQL Resources

CloudNativePG cluster configuration:

  • 10Gi storage on Ceph RBD
  • Automated backups (configuration TBD)

File-Mover Sidecar

Minimal resources (not explicitly limited):

  • Runs simple bash loop every 15 minutes
  • Negligible CPU/memory footprint

Network Configuration

Ingress

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: paperless-ngx
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`paperless.skaggsfamily.us`)
      kind: Rule
      services:
        - name: paperless-ngx
          port: 8000
      middlewares:
        - name: authelia-forwardauth
          namespace: authelia
  tls:
    secretName: paperless-tls

DNS: Managed by external-dns via DNSEndpoint CRD

Certificate: Let's Encrypt TLS via cert-manager ClusterIssuer

Authentication: Authelia forwardauth middleware (one-factor)

Deployment Configuration

Helm Chart

Chart: gabe565/paperless-ngx v0.24.1

Source: Based on bjw-s common library

Environment Variables (Key Settings)

env:
  # Database
  PAPERLESS_DBHOST: paperless-postgresql-rw
  PAPERLESS_DBNAME: paperless

  # URL
  PAPERLESS_URL: https://paperless.skaggsfamily.us

  # OCR
  PAPERLESS_OCR_LANGUAGE: eng
  PAPERLESS_OCR_MODE: skip  # OCR disabled (can be enabled later)

  # Consumption
  PAPERLESS_CONSUMER_RECURSIVE: "true"
  PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS: "false"

Security Context

podSecurityContext:
  fsGroup: 100  # unRAID users group for NFS permissions

Operational Procedures

Scanning Documents

Recommended Scanner Settings (NAPS2):

  • Resolution: 300 DPI
  • Color: Grayscale
  • Duplex: Enabled
  • Deskew: Enabled
  • Blank page removal: Enabled
  • OCR: Disabled (Paperless handles this)

Scan Destination:

  • Mac workstation → Mount Apollo SMB/NFS share
  • Scan to: /mnt/user/data/paperless/incoming/
  • Files automatically processed within 15 minutes

Monitoring File Transfer

Check file-mover sidecar logs:

kubectl logs -n paperless-ngx -l app.kubernetes.io/name=paperless-ngx -c file-mover

Expected output:

File mover sidecar started - checking /incoming every 15 minutes
Thu Jan  1 22:38:38 UTC 2026: Found files in /incoming, moving to /consume
'/incoming/document.pdf' -> '/consume/document.pdf'

Checking Consumption Status

View Paperless consumer logs:

kubectl logs -n paperless-ngx -l app.kubernetes.io/name=paperless-ngx -c paperless-ngx | grep consumer

Accessing NFS Volumes Directly

From Apollo unRAID server:

# Incoming directory
ls -la /mnt/user/data/paperless/incoming/

# Media archive
ls -la /mnt/user/data/paperless/media/

# Export directory
ls -la /mnt/user/data/paperless/export/

Data Protection

Critical Data

  1. PostgreSQL Database (10Gi Ceph)
  2. Application metadata, tags, correspondents, document types
  3. Search index mappings

  4. Media Directory (500Gi NFS)

  5. Final archived documents
  6. Represents all consumed documents

Backup Strategy

Required for Full Restore:

  • PostgreSQL database backup (via CloudNativePG or volsync)
  • NFS media directory backup (planned via volsync + Garage S3)

Not Critical:

  • /data directory (ML models can be re-downloaded)
  • /consume directory (temporary staging, should be empty)
  • /export directory (regeneratable exports)

Important Operational Notes

Media Directory Ownership

⚠️ The /media directory is Paperless-owned

  • Do not reorganize or rename files outside Paperless
  • Do not edit document content directly
  • Prefer fixing metadata in Paperless UI + rules

Storage Class Migration

The deployment previously used dynamic NFS provisioning with UUID-based directories. Migration to static PVs completed January 1, 2026:

Old (dynamic): /mnt/user/data/pvc-<uuid>/

New (static): /mnt/user/data/paperless/{export,incoming,media}/

Known Limitations

  1. OCR Currently Disabled (PAPERLESS_OCR_MODE: skip)

  2. Can be enabled by changing to redo or force_ocr

  3. Increases CPU usage significantly

  4. 15-Minute File Transfer Delay

  5. Files dropped in /incoming are moved every 15 minutes

  6. Acceptable for document workflow, not instant

  7. No Email Ingestion

  8. Not configured in initial deployment

  9. Can be added later

Future Enhancements

  • Enable OCR processing
  • Configure PostgreSQL automated backups
  • Implement volsync backup to Garage S3
  • Email ingestion configuration
  • LLM-assisted auto-tagging
  • Barcode-based document splitting

References


Last Updated: January 1, 2026