Skip to content

Kubernetes Troubleshooting Reference

This page provides a quick reference for common Kubernetes troubleshooting commands and techniques used in the homelab environment.

Pod Troubleshooting

Checking Pod Status

# Get all pods in a namespace
kubectl get pods -n <namespace>

# Get pods with labels
kubectl get pods -n <namespace> -l app.kubernetes.io/name=<app-name>

# Watch pods in real-time
kubectl get pods -n <namespace> -w

# Get detailed pod information
kubectl describe pod <pod-name> -n <namespace>

# Check pod events (last 20 lines)
kubectl describe pod <pod-name> -n <namespace> | tail -20

When to use: Start with get pods to see overall status, then use describe to investigate why a pod is stuck in Pending, ContainerCreating, or CrashLoopBackOff states.

Viewing Pod Logs

# Get logs from a pod
kubectl logs <pod-name> -n <namespace>

# Get logs from a specific container in a multi-container pod
kubectl logs <pod-name> -n <namespace> -c <container-name>

# Tail logs in real-time
kubectl logs <pod-name> -n <namespace> --tail=50 -f

# Get logs from all pods with a label
kubectl logs -n <namespace> -l app.kubernetes.io/name=<app-name> --tail=20

# Get logs from previous crashed container
kubectl logs <pod-name> -n <namespace> --previous

When to use: Use when troubleshooting application crashes, startup failures, or unexpected behavior. The --previous flag is critical for investigating why a pod crashed.

Checking Container State

# Get container state details
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].state}'

# Get container image being used
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].image}'

# Check all container images in a pod
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

When to use: Useful when a pod is stuck in ContainerCreating or you need to verify which image version is actually running.

Executing Commands in Pods

# Execute a command in a pod
kubectl exec -n <namespace> <pod-name> -- <command>

# Get an interactive shell
kubectl exec -it -n <namespace> <pod-name> -- /bin/bash
# or for Alpine-based images:
kubectl exec -it -n <namespace> <pod-name> -- /bin/sh

# Execute in a specific container
kubectl exec -it -n <namespace> <pod-name> -c <container-name> -- /bin/bash

# List files in a directory
kubectl exec -n <namespace> <pod-name> -- ls -lah /config/

# Check mounted volumes
kubectl exec -n <namespace> <pod-name> -- df -h

When to use: Essential for inspecting file systems, verifying mounts, checking configurations, and debugging application issues directly.

Storage & PVC Troubleshooting

Checking PVCs and PVs

# List PersistentVolumeClaims
kubectl get pvc -n <namespace>

# Get detailed PVC information
kubectl describe pvc <pvc-name> -n <namespace>

# List all PersistentVolumes
kubectl get pv

# Check which PV is bound to a PVC
kubectl get pv | grep <pvc-name>

# Get PV details
kubectl describe pv <pv-name>

When to use: Use when pods are stuck in Pending due to PVC issues, or when investigating storage capacity or binding problems.

Checking Storage Classes

# List available storage classes
kubectl get storageclass

# Get detailed storage class information
kubectl describe storageclass <storage-class-name>

# Check which storage class is default
kubectl get storageclass | grep "(default)"

When to use: Use when PVCs aren't being provisioned, or to verify which storage backend will be used for dynamic provisioning.

Verifying Volume Mounts

# Check volumes mounted in a pod
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.volumes[*].name}' | tr ' ' '\n'

# Verify volume mount paths
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Mounts:"

# Check mounted filesystems inside pod
kubectl exec -n <namespace> <pod-name> -- df -h | grep -E "(config|media|data)"

When to use: Use when verifying data persistence, debugging mount issues, or confirming correct PVC attachments.

Network Troubleshooting

Checking Services and Endpoints

# List services in a namespace
kubectl get svc -n <namespace>

# Get service details
kubectl describe svc <service-name> -n <namespace>

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Verify service selector matches pods
kubectl get pods -n <namespace> --show-labels

When to use: Use when services aren't reaching pods, or when LoadBalancer/ClusterIP services aren't responding.

Checking Ingress

# List ingresses
kubectl get ingress -n <namespace>

# Get ingress details
kubectl describe ingress <ingress-name> -n <namespace>

# Check ingress class
kubectl get ingressclass

When to use: Use when external traffic isn't reaching services, or when troubleshooting TLS/certificate issues.

Network Connectivity Testing

# Test connectivity from a debug pod
kubectl run -it --rm debug --image=nicolaka/netshoot -n <namespace> -- ping <ip-address>

# Test DNS resolution
kubectl run -it --rm debug --image=nicolaka/netshoot -n <namespace> -- nslookup <hostname>

# Test HTTP connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot -n <namespace> -- curl <url>

When to use: Use when diagnosing network connectivity issues, DNS problems, or service communication failures.

Flux/GitOps Troubleshooting

Validating Kustomize Builds

Before committing changes to the flux-repo, validate that your Kustomize configurations build correctly:

# Navigate to flux-repo
cd /Users/dskaggs/Projects/homelab/flux-repo

# Build the production overlay (includes base + patches)
kustomize build apps/overlays/prod/<app-name>

# Build just the base (without overlay patches)
kustomize build apps/_bases/<app-name>

# Dry-run apply to check for Kubernetes validation errors
kubectl apply --dry-run=client -k apps/overlays/prod/<app-name>

# Run repository-wide validation script
./scripts/validate-kustomize.sh

What these commands catch:

  • YAML syntax errors
  • Invalid patch operations
  • Missing resources referenced in patches
  • Kustomization configuration errors
  • Kubernetes resource validation errors

When to use: Always validate changes before committing to prevent deployment failures. Use kustomize build to see the final rendered YAML that will be applied to the cluster after all patches are applied.

Checking Flux Resources

# Get all Flux kustomizations
flux get kustomizations -A

# Get Flux sources
flux get sources git -n flux-system
flux get sources helm -n flux-system

# Check HelmReleases
kubectl get helmrelease -A
flux get helmreleases -A

# Get detailed HelmRelease status
kubectl describe helmrelease <name> -n <namespace>

When to use: Use when applications aren't deploying, or to verify GitOps reconciliation status.

Force Flux Reconciliation

# Force reconcile a kustomization
flux reconcile kustomization <name> -n flux-system

# Force reconcile with source update
flux reconcile kustomization <name> -n flux-system --with-source

# Reconcile a HelmRelease
flux reconcile helmrelease <name> -n <namespace>

# Suspend and resume a kustomization
flux suspend kustomization <name> -n flux-system
flux resume kustomization <name> -n flux-system

When to use: Use after pushing changes to Git, when Flux hasn't automatically reconciled, or when troubleshooting stuck deployments.

Checking Flux Logs

# Get Flux controller logs
flux logs --kind Kustomization --name <name> -n flux-system

# Get all Flux logs
kubectl logs -n flux-system deployment/source-controller --tail=100
kubectl logs -n flux-system deployment/kustomize-controller --tail=100
kubectl logs -n flux-system deployment/helm-controller --tail=100

When to use: Use when Flux reconciliation is failing, or to debug HelmRelease/Kustomization issues.

Checking HelmCharts

# List HelmCharts
kubectl get helmchart -n flux-system

# Get HelmChart details
kubectl describe helmchart <name> -n flux-system

# Delete HelmChart to force rebuild
kubectl delete helmchart <name> -n flux-system

When to use: Use when HelmReleases are stuck with chart download errors. Deleting the HelmChart resource forces Flux to re-download and rebuild.

Resource Inspection

Checking Node Resources

# Get node status
kubectl get nodes

# Get detailed node information
kubectl describe node <node-name>

# Check node resource usage
kubectl top nodes

# Check pod resource usage on nodes
kubectl top pods -A

When to use: Use when pods are pending due to insufficient resources, or when investigating performance issues.

Checking Deployments and ReplicaSets

# List deployments
kubectl get deployments -n <namespace>

# Get deployment details
kubectl describe deployment <deployment-name> -n <namespace>

# Check ReplicaSets
kubectl get replicasets -n <namespace>

# Scale a deployment
kubectl scale deployment <deployment-name> -n <namespace> --replicas=<count>

When to use: Use when managing application replicas, investigating rollout issues, or temporarily stopping applications.

Checking Events

# Get recent events in a namespace
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Get events and filter
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <keyword>

# Watch events in real-time
kubectl get events -n <namespace> -w

# Get events for all namespaces
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

When to use: Critical for understanding what happened during failures. Events show pod scheduling, volume attachment, image pulls, and more.

Data Migration & Backup

Copying Files To/From Pods

# Copy file from pod to local machine
kubectl cp <namespace>/<pod-name>:/path/to/file /local/path

# Copy file from local machine to pod
kubectl cp /local/path <namespace>/<pod-name>:/path/to/file

# Copy directory from pod
kubectl cp <namespace>/<pod-name>:/path/to/dir /local/path

# Create tar backup from pod
kubectl exec -n <namespace> <pod-name> -- tar czf /tmp/backup.tar.gz -C /config .
kubectl cp <namespace>/<pod-name>:/tmp/backup.tar.gz /tmp/backup.tar.gz

When to use: Essential for migrating application data between clusters, creating backups, or restoring configurations.

Creating Temporary Backup/Restore Pods

# Create a pod with PVC mounted for backup/restore
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: backup-pod
  namespace: <namespace>
spec:
  containers:
  - name: backup
    image: busybox
    command: ["sleep", "3600"]
    volumeMounts:
    - name: data
      mountPath: /data
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
      seccompProfile:
        type: RuntimeDefault
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: <pvc-name>
  restartPolicy: Never
EOF

# Wait for pod to be ready
kubectl wait --for=condition=Ready pod/backup-pod -n <namespace> --timeout=60s

# Cleanup when done
kubectl delete pod backup-pod -n <namespace>

When to use: Use when you need to access PVC data while the main application is scaled down, or for data migration between clusters.

Database Operations

PostgreSQL (CloudNativePG)

# List PostgreSQL clusters
kubectl get cluster -n <namespace>

# Get cluster status
kubectl describe cluster <cluster-name> -n <namespace>

# Get PostgreSQL pods
kubectl get pods -n <namespace> -l cnpg.io/cluster=<cluster-name>

# Connect to PostgreSQL
kubectl exec -it <pod-name> -n <namespace> -- psql -U postgres

# Dump a database
kubectl exec <pod-name> -n <namespace> -- pg_dump -U postgres <database-name> > backup.sql

# Restore a database via stdin
cat backup.sql | kubectl exec -i <pod-name> -n <namespace> -- psql -U postgres -d <database-name>

When to use: Use for PostgreSQL administration, backups, and data migration with CloudNativePG operator.

Common Troubleshooting Workflows

Application Not Starting

  1. Check pod status: kubectl get pods -n <namespace>
  2. Check events: kubectl describe pod <pod-name> -n <namespace> | tail -20
  3. Check logs: kubectl logs <pod-name> -n <namespace>
  4. Check PVC binding: kubectl get pvc -n <namespace>
  5. Check image pull: kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Events:"

Storage Issues

  1. Check PVC status: kubectl get pvc -n <namespace>
  2. Check storage class: kubectl get storageclass
  3. Check CSI driver pods: kubectl get pods -n <csi-namespace>
  4. Check PV binding: kubectl describe pvc <pvc-name> -n <namespace>
  5. Verify volume mounts: kubectl exec -n <namespace> <pod-name> -- df -h

Flux Deployment Not Updating

  1. Check Flux reconciliation: flux get kustomizations -A
  2. Check Git source: flux get sources git -n flux-system
  3. Force reconcile: flux reconcile kustomization <name> --with-source
  4. Check HelmRelease: kubectl get helmrelease -n <namespace>
  5. Check Flux logs: flux logs --kind Kustomization --name <name>

Network Connectivity Issues

  1. Check service: kubectl get svc -n <namespace>
  2. Check endpoints: kubectl get endpoints <service-name> -n <namespace>
  3. Check ingress: kubectl get ingress -n <namespace>
  4. Test from debug pod: kubectl run -it --rm debug --image=nicolaka/netshoot -- curl <url>
  5. Check network policies: kubectl get networkpolicy -n <namespace>

Tips and Best Practices

  • Always check events first: Events often contain the root cause of issues
  • Use labels for filtering: Speeds up troubleshooting in namespaces with many resources
  • Watch resources in real-time: Add -w to get commands when waiting for changes
  • Use --previous for crashed containers: Essential for diagnosing crash loops
  • Tail logs with context: Use --tail=50 instead of dumping entire logs
  • Use debug pods for network testing: The nicolaka/netshoot image has all network troubleshooting tools
  • Scale down before data migration: Prevents data inconsistency during migration
  • Force Flux reconciliation after Git pushes: Don't wait for the interval when actively developing