Skip to content

Rook-Ceph Operations

This document covers operational procedures for managing Rook-Ceph storage in the production Kubernetes cluster.

Overview

  • Cluster: Production Kubernetes cluster on Fuji
  • Storage Nodes: prod-storage-01, prod-storage-02
  • Total Capacity: 8.7 TiB (16 OSDs, 8 per node)
  • Storage Network: VLAN 104 (172.16.104.0/24) - 10GbE dedicated storage network
  • Rook Version: v1.18.5
  • Ceph Version: v18.2.0 (Reef)

Accessing the Ceph Dashboard

The Ceph dashboard provides a web-based interface for monitoring and managing the Ceph cluster.

Quick Access via Port Forward

To access the dashboard locally:

export KUBECONFIG=/Users/dskaggs/Projects/homelab/infra-talos/cluster-configs/production/kubeconfig
kubectl -n rook-ceph port-forward svc/rook-ceph-mgr-dashboard 8443:7000

Then browse to: https://localhost:8443

Login Credentials:

  • Username: admin
  • Password: Retrieve from password manager (stored as "Rook-Ceph Dashboard - Production")

Note: The dashboard uses a self-signed certificate. You'll need to accept the browser security warning.

Future: Ingress Access (After Split DNS Setup)

Once internal split DNS is configured, an ingress resource can be created for permanent access via an internal-only hostname with proper TLS certificates from cert-manager.

Common Operations

Check Cluster Health

export KUBECONFIG=/Users/dskaggs/Projects/homelab/infra-talos/cluster-configs/production/kubeconfig
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status

Expected output for healthy cluster:

cluster:
    id:     a19f3007-4cc6-4c9f-923a-764e6ea06150
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum a
    mgr: b(active), standbys: a
    osd: 16 osds: 16 up, 16 in

View OSD Topology

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree

Check Storage Capacity

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df

View OSD Performance Statistics

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf

List All Pools

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd pool ls detail

Storage Classes

The following storage classes are available for PVC provisioning:

  • ceph-block: RWO (ReadWriteOnce) block storage for single-node access
  • ceph-filesystem: RWX (ReadWriteMany) filesystem storage for multi-node access

Example: Creating a PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-data
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: ceph-block

Troubleshooting

OSD Not Starting

Check OSD pod logs:

kubectl -n rook-ceph logs <rook-ceph-osd-X-pod-name>

Check OSD prepare job logs:

kubectl -n rook-ceph logs <rook-ceph-osd-prepare-pod-name>

Check OSD Status on Node

Get shell access to a storage node:

talosctl -n 172.16.103.170 shell

List OSD disks:

ls -l /dev/disk/by-path/ | grep scsi

Verify Storage Network

Confirm OSDs are using the storage network (VLAN 104):

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd dump | grep -E "^osd\." | grep 172.16.104

All OSDs should show 172.16.104.x addresses for both public and cluster networks.

Maintenance

Safely Draining an OSD Node

Before performing maintenance on a storage node:

  1. Set the noout flag to prevent rebalancing:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd set noout
  1. Perform maintenance (reboot, updates, etc.)

  2. After the node is back online, unset the noout flag:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd unset noout

Adding a New OSD

To add additional OSDs, update the storage node configuration in:

  • File: flux-repo/infrastructure/controllers/rook-ceph/helmrelease-rook-ceph-cluster.yaml
  • Section: spec.values.storage.nodes

Add new device entries and commit to trigger Flux deployment.