Kubernetes Health Checks Under Load

Observability is one of the main traits of any reliable system, and in the light of this health checks in Kubernetes are not to be underestimated. Yet, under certain circumstances, using those can become a source of cascading failures under load. A misconfigured readiness probe might take down the entire service during peak traffic.

Probe Types and Load Impact

Kubernetes offers three probe types: liveness, readiness, and startup. Each affects pod lifecycle differently and requires specific configuration under high load scenarios:

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    readinessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 2
      successThreshold: 1
      failureThreshold: 3

Under load, these seemingly usual timeout values often prove inadequate. A 2-second timeout might work in testing but fail when the service experiences GC pauses or high CPU utilization.

Component Health Verification

A production-grade health check system typically monitors multiple critical dependencies:

Database systems require connection pool monitoring and basic connectivity verification. Message brokers like RabbitMQ or NATS need confirmation of both connection status and queue accessibility. Cache services such as Redis or Memcached must verify both connectivity and basic operations. For services with storage requirements, disk space and write permission checks are crucial.

External API dependencies need verification with appropriate timeouts. A failure in any of these checks could signal potential issues before they impact end users.

Memory Pressure Impact

Health check endpoints must handle memory carefully, especially when verifying multiple components. For readiness probes, Kubernetes only cares about the HTTP status code - any non-2xx response means the pod isn't ready:

type ComponentCheck struct {
    Timeout  time.Duration
    Check    func(context.Context) error
}

func healthCheck(w http.ResponseWriter, r *http.Request) {
    components := []ComponentCheck{
        {
            Timeout: 200 * time.Millisecond,
            Check: func(ctx context.Context) error {
                return redisClient.Ping(ctx).Err()
            },
        },
        {
            Timeout: 500 * time.Millisecond,
            Check: func(ctx context.Context) error {
                return db.PingContext(ctx)
            },
        },
        {
            Timeout: 200 * time.Millisecond,
            Check: func(ctx context.Context) error {
                return nc.FlushTimeout(200 * time.Millisecond)
            },
        },
        {
            Timeout: 300 * time.Millisecond,
            Check: func(ctx context.Context) error {
                return memcached.Ping(ctx)
            },
        },
        {
            Timeout: 200 * time.Millisecond,
            Check: func(ctx context.Context) error {
                var stats syscall.Statfs_t
                if err := syscall.Statfs("/path/to/required/dir", &stats); err != nil {
                    return err
                }
                // Check if free space is less than 10%
                if (float64(stats.Bavail) / float64(stats.Blocks)) < 0.1 {
                    return fmt.Errorf("low disk space")
                }
                return nil
            },
        },
    }

    for _, c := range components {
        ctx, cancel := context.WithTimeout(r.Context(), c.Timeout)
        if err := c.Check(ctx); err != nil {
            cancel()
            w.WriteHeader(http.StatusServiceUnavailable)
            return
        }
        cancel()
    }

    w.WriteHeader(http.StatusOK)
}

This implementation efficiently manages memory by using individual timeouts for each component. Proper context cancellation ensures resources are released promptly.

Database Connection Management

Health checks involving database connections require careful handling: frequent checks can exhaust connection pools, creating artificial pressure on the database. A poorly configured health check might open a new connection for each probe instead of reusing existing ones from the pool. Under high probe frequency, this can lead to connection pool saturation, leaving no connections available for actual user requests. Additionally, if health check queries don't use timeouts, they might hold connections longer than necessary, especially during database slowdowns.

func dbHealthCheck() error {
    ctx, cancel := context.WithTimeout(context.Background(), 500*time.Millisecond)
    defer cancel()

    // Uses existing connection from pool
    return db.PingContext(ctx)
}

Using connection pooling and tight timeouts prevents health checks from consuming database resources during high-load periods.

Implementation

What worth consideration:

Timeout configuration must account for potential GC pauses and CPU spikes - default 1-2 second timeouts might be insufficient under load. Connection pooling is crucial - health checks should reuse existing connections rather than creating new ones. Resource allocation in health check handlers must be minimal as these endpoints are called frequently.

Use different probe types for different purposes - readiness probes for dependency checks, liveness probes for deadlock detection. Remember that health checks run frequently - even a small memory allocation or goroutine leak in the handler will be amplified.