Operational Runbooks¶
This section contains step-by-step procedures for common operational tasks and maintenance activities.
FluxCD Operations¶
Upgrading Flux CRDs and HelmRelease API Version¶
Procedures for updating FluxCD Custom Resource Definitions and migrating HelmRelease resources to newer API versions.
Git-Pinned Flux Upgrades¶
Instructions for upgrading FluxCD when using git-pinned component versions.
Infrastructure Maintenance¶
Proxmox Root Password Reset¶
Procedure for resetting the root password on Proxmox hosts via GRUB when the existing password is unknown.
Dell R630 BIOS & iDRAC Update¶
Instructions for updating BIOS, iDRAC, and firmware on Dell PowerEdge R630 servers using the Dell Support Live Image and SUU ISO.
Kubernetes Cluster Operations¶
- Node maintenance and updates
- Certificate rotation
- Backup and restore procedures
- Disaster recovery testing
Storage Operations¶
- Persistent volume management
- Ceph cluster maintenance (planned)
- VolSync backup operations
- Storage capacity planning
Network Operations¶
- VLAN configuration changes
- Switch maintenance procedures
- Firewall rule updates
- DNS configuration changes
Monitoring and Alerting¶
Prometheus Operations¶
- Alerting rule updates
- Metric retention management
- Prometheus configuration updates
- Grafana dashboard management
Log Management¶
- Log retention policies
- Log aggregation setup
- Alert investigation procedures
- Performance troubleshooting
Security Operations¶
Certificate Management¶
- TLS certificate renewal
- Certificate authority operations
- Secrets rotation procedures
- Access control updates
Backup Operations¶
- Backup verification procedures
- Disaster recovery testing
- Off-site backup management
- Data retention compliance
Emergency Procedures¶
Incident Response¶
- Service outage procedures
- Data loss recovery
- Security incident response
- Communication protocols
Disaster Recovery¶
- Complete cluster rebuild
- Data center failover
- Cloud failover procedures
- Business continuity planning