Rook-Ceph Operations¶
This document covers operational procedures for managing Rook-Ceph storage in the production Kubernetes cluster.
Overview¶
- Cluster: Production Kubernetes cluster on Fuji
- Storage Nodes: prod-storage-01, prod-storage-02
- Total Capacity: 8.7 TiB (16 OSDs, 8 per node)
- Storage Network: VLAN 104 (172.16.104.0/24) - 10GbE dedicated storage network
- Rook Version: v1.18.5
- Ceph Version: v18.2.0 (Reef)
Accessing the Ceph Dashboard¶
The Ceph dashboard provides a web-based interface for monitoring and managing the Ceph cluster.
Quick Access via Port Forward¶
To access the dashboard locally:
export KUBECONFIG=/Users/dskaggs/Projects/homelab/infra-talos/cluster-configs/production/kubeconfig
kubectl -n rook-ceph port-forward svc/rook-ceph-mgr-dashboard 8443:7000
Then browse to: https://localhost:8443
Login Credentials:
- Username:
admin - Password: Retrieve from password manager (stored as "Rook-Ceph Dashboard - Production")
Note: The dashboard uses a self-signed certificate. You'll need to accept the browser security warning.
Future: Ingress Access (After Split DNS Setup)¶
Once internal split DNS is configured, an ingress resource can be created for permanent access via an internal-only hostname with proper TLS certificates from cert-manager.
Common Operations¶
Check Cluster Health¶
export KUBECONFIG=/Users/dskaggs/Projects/homelab/infra-talos/cluster-configs/production/kubeconfig
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
Expected output for healthy cluster:
cluster:
id: a19f3007-4cc6-4c9f-923a-764e6ea06150
health: HEALTH_OK
services:
mon: 1 daemons, quorum a
mgr: b(active), standbys: a
osd: 16 osds: 16 up, 16 in
View OSD Topology¶
Check Storage Capacity¶
View OSD Performance Statistics¶
List All Pools¶
Storage Classes¶
The following storage classes are available for PVC provisioning:
- ceph-block: RWO (ReadWriteOnce) block storage for single-node access
- ceph-filesystem: RWX (ReadWriteMany) filesystem storage for multi-node access
Example: Creating a PVC¶
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-app-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: ceph-block
Troubleshooting¶
OSD Not Starting¶
Check OSD pod logs:
Check OSD prepare job logs:
Check OSD Status on Node¶
Get shell access to a storage node:
List OSD disks:
Verify Storage Network¶
Confirm OSDs are using the storage network (VLAN 104):
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd dump | grep -E "^osd\." | grep 172.16.104
All OSDs should show 172.16.104.x addresses for both public and cluster networks.
Maintenance¶
Safely Draining an OSD Node¶
Before performing maintenance on a storage node:
- Set the noout flag to prevent rebalancing:
-
Perform maintenance (reboot, updates, etc.)
-
After the node is back online, unset the noout flag:
Adding a New OSD¶
To add additional OSDs, update the storage node configuration in:
- File:
flux-repo/infrastructure/controllers/rook-ceph/helmrelease-rook-ceph-cluster.yaml - Section:
spec.values.storage.nodes
Add new device entries and commit to trigger Flux deployment.