Rook-Ceph Operations¶

This document covers operational procedures for managing Rook-Ceph storage in the production Kubernetes cluster.

Overview¶

Cluster: Production Kubernetes cluster on Fuji
Storage Nodes: prod-storage-01, prod-storage-02
Total Capacity: 8.7 TiB (16 OSDs, 8 per node)
Storage Network: VLAN 104 (172.16.104.0/24) - 10GbE dedicated storage network
Rook Version: v1.18.5
Ceph Version: v18.2.0 (Reef)

Accessing the Ceph Dashboard¶

The Ceph dashboard provides a web-based interface for monitoring and managing the Ceph cluster.

Quick Access via Port Forward¶

To access the dashboard locally:

export KUBECONFIG=/Users/dskaggs/Projects/homelab/infra-talos/cluster-configs/production/kubeconfig
kubectl -n rook-ceph port-forward svc/rook-ceph-mgr-dashboard 8443:7000

Then browse to: https://localhost:8443

Login Credentials:

Username: admin
Password: Retrieve from password manager (stored as "Rook-Ceph Dashboard - Production")

Note: The dashboard uses a self-signed certificate. You'll need to accept the browser security warning.

Future: Ingress Access (After Split DNS Setup)¶

Once internal split DNS is configured, an ingress resource can be created for permanent access via an internal-only hostname with proper TLS certificates from cert-manager.

Common Operations¶

Check Cluster Health¶

export KUBECONFIG=/Users/dskaggs/Projects/homelab/infra-talos/cluster-configs/production/kubeconfig
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

Expected output for healthy cluster:

cluster:
    id:     a19f3007-4cc6-4c9f-923a-764e6ea06150
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum a
    mgr: b(active), standbys: a
    osd: 16 osds: 16 up, 16 in

View OSD Topology¶

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree

Check Storage Capacity¶

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df

View OSD Performance Statistics¶

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf

List All Pools¶

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd pool ls detail

Storage Classes¶

The following storage classes are available for PVC provisioning:

ceph-block: RWO (ReadWriteOnce) block storage for single-node access
ceph-filesystem: RWX (ReadWriteMany) filesystem storage for multi-node access

Example: Creating a PVC¶

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-data
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: ceph-block

Troubleshooting¶

OSD Not Starting¶

Check OSD pod logs:

kubectl -n rook-ceph logs <rook-ceph-osd-X-pod-name>

Check OSD prepare job logs:

kubectl -n rook-ceph logs <rook-ceph-osd-prepare-pod-name>

Check OSD Status on Node¶

Get shell access to a storage node:

talosctl -n 172.16.103.170 shell

List OSD disks:

ls -l /dev/disk/by-path/ | grep scsi

Verify Storage Network¶

Confirm OSDs are using the storage network (VLAN 104):

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd dump | grep -E "^osd\." | grep 172.16.104

All OSDs should show 172.16.104.x addresses for both public and cluster networks.

Maintenance¶

Safely Draining an OSD Node¶

Before performing maintenance on a storage node:

Set the noout flag to prevent rebalancing:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd set noout

Perform maintenance (reboot, updates, etc.)
After the node is back online, unset the noout flag:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd unset noout

Adding a New OSD¶

To add additional OSDs, update the storage node configuration in:

File: flux-repo/infrastructure/controllers/rook-ceph/helmrelease-rook-ceph-cluster.yaml
Section: spec.values.storage.nodes

Add new device entries and commit to trigger Flux deployment.