Cluster Policies¶

Last Updated: October 9, 2025 Status: ✅ Active

Overview¶

Cluster policies provide standardized configurations for scheduling priority and resource allocation across all Kubernetes workloads. These policies ensure consistent performance, fair resource distribution, and predictable behavior during resource contention.

Priority Classes¶

Priority classes determine which pods get scheduled first and which pods can be evicted when cluster resources are constrained.

Available Priority Classes¶

Priority Class	Value	Use Case	Examples
`cluster-critical`	9000	Critical infrastructure services	BGP, Traefik, CoreDNS, kube-proxy
`cluster-important`	5000	Important services affecting user experience	Authelia, cert-manager, sealed-secrets
`cluster-normal`	1000	Standard applications (default)	Most user applications
`cluster-low`	100	Background tasks and batch jobs	Backup jobs, cleanup tasks

How Priority Works¶

During Scheduling: - Higher priority pods are scheduled before lower priority pods - If resources are insufficient, lower priority pods wait in pending state

During Eviction: - When nodes are overcommitted, lower priority pods are evicted first - Critical pods (priority 9000) are protected from eviction

Usage in HelmRelease¶

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: my-app
spec:
  values:
    priorityClassName: cluster-important

Usage in Pod Specs¶

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  priorityClassName: cluster-normal
  containers:
    - name: app
      image: my-app:latest

Resource Tiers¶

Standardized resource request and limit configurations for consistent application performance.

Tier Definitions¶

Tier	CPU Request	CPU Limit	Memory Request	Memory Limit	Use Case
Small	100m	1000m	256Mi	1Gi	Lightweight services, configuration apps, simple APIs
Medium	500m	2000m	1Gi	4Gi	Standard web apps, moderate database operations
Large	1000m	4000m	2Gi	8Gi	Heavy processing, media operations, intensive imports
XLarge	2000m	6000m	4Gi	12Gi	Intensive downloads, transcoding, video processing
Database	500m	4000m	2Gi	8Gi	Database workloads with memory optimization

Understanding Requests vs Limits¶

Requests (Guaranteed): - The minimum resources guaranteed to the pod - Kubernetes scheduler uses requests to decide pod placement - Pod will NOT start if node doesn't have requested resources available - CPU requests determine CPU shares during contention

Limits (Ceiling): - Maximum resources the pod can consume - CPU: Pod is throttled when limit is reached - Memory: Pod is OOM-killed if limit is exceeded - Pods can burst to limits when node resources are available

Implementation¶

Resource tiers are defined as ConfigMaps in flux-repo/infrastructure/policies/:

# resource-tier-medium.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: resource-tier-medium
  namespace: flux-system
data:
  cpu-request: "500m"
  cpu-limit: "2000m"
  memory-request: "1Gi"
  memory-limit: "4Gi"
  description: "Standard applications with moderate resource requirements"
  tier: "medium"

Usage Methods¶

Method 1: Direct Reference (Current Standard)¶

Apply tier values directly in HelmRelease manifests:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: my-app
spec:
  values:
    resources:
      requests:
        cpu: "500m"      # Medium tier
        memory: "1Gi"    # Medium tier
      limits:
        cpu: "2000m"     # Medium tier
        memory: "4Gi"    # Medium tier

Method 2: Kustomize Patches (Cluster-Specific Overrides)¶

Apply different tiers per cluster using Kustomize:

# clusters/minimal/apps/my-app.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
spec:
  path: ./apps/_bases/my-app
  patches:
    - patch: |
        apiVersion: helm.toolkit.fluxcd.io/v2
        kind: HelmRelease
        metadata:
          name: my-app
        spec:
          values:
            resources:
              requests:
                cpu: "1000m"    # Large tier for prod cluster
                memory: "2Gi"
              limits:
                cpu: "4000m"
                memory: "8Gi"
      target:
        kind: HelmRelease
        name: my-app

Method 3: Multiple Containers¶

Apply different tiers to different containers in the same pod:

spec:
  values:
    containers:
      app:
        resources:
          requests:
            cpu: "1000m"    # Large tier for main app
            memory: "2Gi"
          limits:
            cpu: "4000m"
            memory: "8Gi"

      sidecar:
        resources:
          requests:
            cpu: "100m"     # Small tier for sidecar
            memory: "256Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"

Current Application Assignments¶

Infrastructure Components¶

Component	Priority	Tier	Rationale
Traefik	critical	Medium	Ingress controller, critical for all traffic
cert-manager	important	Small	TLS certificates, important but lightweight
Authelia	important	Small	Authentication, important but low resource
MetalLB	critical	Small	LoadBalancer IPs, critical but minimal resources
sealed-secrets	important	Small	Secret management, important but lightweight
external-dns	normal	Small	DNS automation, standard priority

Applications¶

Application	Priority	Tier	Rationale
Vaultwarden	important	Small	Password manager, important but lightweight
PostgreSQL	important	Database	Database backend, important with memory needs
Sonarr	normal	Large	TV management, intensive media operations
SABnzbd	normal	XLarge	Download client, very intensive processing
Prowlarr	normal	Medium	Indexer manager, moderate resource needs

Capacity Planning¶

Current Minimal Cluster¶

Node Resources: - Control Plane: 1 node × 8 cores, 32GB RAM - Workers: 2 nodes × 12 cores, 64GB RAM each - Total Worker Capacity: 24 cores, 128GB RAM

System Overhead: - Kubernetes system: ~2 cores, ~8GB RAM - Available for workloads: ~22 cores, ~120GB RAM

Resource Request Allocation¶

Current Infrastructure: - Traefik (Medium): 500m CPU, 1Gi RAM - cert-manager (Small): 100m CPU, 256Mi RAM - Authelia (Small): 100m CPU, 256Mi RAM - MetalLB (Small): 100m CPU, 256Mi RAM - PostgreSQL (Database): 500m CPU, 2Gi RAM - Other infra: ~500m CPU, 2Gi RAM - Total Infrastructure: ~1.8 cores, ~6Gi RAM

Current Applications: - Vaultwarden (Small): 100m CPU, 256Mi RAM - Sonarr (Large): 500m CPU, 1Gi RAM - SABnzbd (XLarge): 1000m CPU, 2Gi RAM - Prowlarr (Medium): 200m CPU, 512Mi RAM - Total Applications: ~1.8 cores, ~3.75Gi RAM

Total Requests: ~3.6 cores, ~10Gi RAM Remaining Capacity: ~18 cores, ~110Gi RAM

Burst Capacity¶

When applications burst to their limits: - Infrastructure limits: ~8 cores, ~16Gi RAM - Application limits: ~16 cores, ~22Gi RAM - Total potential burst: ~24 cores, ~38Gi RAM

The cluster has significant headroom for bursting and additional workloads.

Best Practices¶

Resource Allocation¶

Start Conservative: Begin with a lower tier and scale up based on metrics
Monitor Usage: Use Prometheus/Grafana to validate tier assignments
Request = Guarantee: Set requests based on minimum needed resources
Limit = Safety: Set limits to prevent resource exhaustion
Memory for Databases: Databases benefit from higher memory allocations

Priority Assignment¶

Critical: Only for services that affect cluster functionality
Important: Services that significantly impact user experience
Normal: Most user applications (default)
Low: Batch jobs that can be delayed or interrupted

Tier Selection Guide¶

Choose Small when: - Application is primarily I/O bound - CPU usage consistently under 0.5 cores - Memory usage under 512Mi - Examples: Static sites, simple APIs, configuration tools

Choose Medium when: - Moderate request processing - CPU usage 0.5-1.5 cores during normal operation - Memory usage 512Mi-2Gi - Examples: Web apps, REST APIs, moderate databases

Choose Large when: - CPU-intensive operations (imports, processing) - CPU usage 1-3 cores during operation - Memory usage 1-4Gi - Examples: Media management, batch processing

Choose XLarge when: - Very intensive operations (downloads, transcoding) - CPU usage 2+ cores regularly - Memory usage 4Gi+ - Examples: Download clients, video processing

Choose Database when: - Database workload - Memory-intensive with moderate CPU - Benefits from caching - Examples: PostgreSQL, MySQL, Redis

Troubleshooting¶

Pod Pending Due to Resources¶

Symptom: Pod stuck in Pending state with event "Insufficient cpu/memory"

Solutions: 1. Check current resource usage: kubectl top nodes 2. Lower the tier for the application 3. Scale down or remove lower-priority applications 4. Add more worker nodes

Pod OOMKilled¶

Symptom: Pod restarting with OOMKilled status

Solutions: 1. Increase memory limit 2. Move to higher tier 3. Investigate memory leaks in application 4. Add memory monitoring/alerting

CPU Throttling¶

Symptom: Application slow despite low cluster CPU usage

Solutions: 1. Check CPU throttling metrics 2. Increase CPU limit 3. Move to higher tier 4. Optimize application code

Resource Contention¶

Symptom: Multiple pods competing for resources, some being evicted

Solutions: 1. Set appropriate requests to guarantee resources 2. Use priority classes to protect critical workloads 3. Set realistic limits to prevent resource exhaustion 4. Add more cluster capacity

Future Enhancements¶

Planned Features¶

[ ] Automated Tier Recommendations: ML-based tier suggestions from metrics
[ ] Flux Variable Substitution: Dynamic tier selection per environment
[ ] ResourceQuota: Namespace-level resource limits based on tiers
[ ] LimitRange: Default tier-based limits for pods without resources
[ ] Vertical Pod Autoscaler (VPA): Automatic tier adjustments
[ ] Horizontal Pod Autoscaler (HPA): Tier-specific scaling configs
[ ] Metrics Dashboard: Grafana dashboard showing tier utilization

Resource Optimization¶

[ ] Right-sizing Analysis: Identify over/under-provisioned workloads
[ ] Cost Attribution: Track resource costs per application
[ ] Trend Analysis: Historical resource usage patterns
[ ] Capacity Forecasting: Predict when cluster expansion needed

References¶

Flux Configuration: flux-repo/infrastructure/policies/
Examples: flux-repo/infrastructure/policies/examples/
Kubernetes Resource Management
Priority Classes Documentation
Flux Kustomization Patches

Last Updated: October 9, 2025 Next Review: November 9, 2025