Health Check #

Probe Kubernetes adalah mekanisme yang memungkinkan cluster mengetahui apakah aplikasi di dalam container berjalan dengan benar dan siap melayani traffic. Tanpa probe yang dikonfigurasi dengan tepat, Kubernetes tidak tahu perbedaan antara container yang berjalan normal, container yang crash-loop, dan container yang berjalan tapi tidak responsif. Salah mengonfigurasi probe bisa menyebabkan restart yang tidak perlu, downtime saat deployment, atau traffic yang masuk ke Pod yang tidak siap.

Tiga Jenis Probe #

Liveness Probe:
  "Apakah aplikasi masih hidup?"
  Jika gagal: Kubernetes restart container
  Gunakan untuk: deteksi deadlock, infinite loop, state yang tidak bisa recover

Readiness Probe:
  "Apakah aplikasi siap menerima traffic?"
  Jika gagal: Pod dihapus dari Endpoints Service (tidak lagi menerima traffic)
  Container TIDAK direstart
  Gunakan untuk: startup initialization, koneksi ke database belum siap,
                 sedang memproses antrian besar, circuit breaker terbuka

Startup Probe:
  "Apakah aplikasi sudah selesai startup?"
  Jika gagal: Kubernetes restart container
  Aktif HANYA selama startup, setelah sukses liveness mengambil alih
  Gunakan untuk: aplikasi dengan startup yang lambat (>30 detik)

Perbedaan Kritis: Liveness vs Readiness #

Skenario: database tidak bisa diakses

  Dengan Liveness Probe yang cek koneksi database:
    Database down → liveness probe gagal → Kubernetes restart container
    Container restart → database masih down → probe gagal lagi → restart lagi
    → CrashLoopBackOff! Padahal masalah ada di database, bukan aplikasi
    → Semua Pod crash loop, tidak ada yang bisa melayani request sama sekali

  Dengan Readiness Probe yang cek koneksi database:
    Database down → readiness probe gagal → Pod keluar dari Endpoints Service
    Container TIDAK direstart, masih hidup
    → Traffic tidak masuk ke Pod, tapi Pod siap menerima kembali saat database pulih
    → Ini behavior yang benar

Aturan praktis:
  Liveness: cek kondisi INTERNAL yang tidak bisa recover sendiri
    → Thread pool exhausted
    → Deadlock
    → Memory leak parah
    → State corruption

  Readiness: cek kondisi yang BISA berubah (internal maupun eksternal)
    → Startup belum selesai
    → Database tidak tersedia
    → Cache belum ter-warm up
    → Sedang mode graceful shutdown

Implementasi Endpoint /health #

# Python (FastAPI): endpoint health yang tepat
from fastapi import FastAPI, Response, status
import time

app = FastAPI()
start_time = time.time()

# Liveness: cek internal state aplikasi saja
# JANGAN cek koneksi ke external service di sini
@app.get("/health/live")
async def liveness():
    # Hanya cek kondisi internal yang tidak bisa recover sendiri
    if is_deadlocked() or is_memory_exhausted():
        return Response(status_code=status.HTTP_503_SERVICE_UNAVAILABLE)
    return {"status": "alive"}

# Readiness: cek apakah siap terima traffic
# BOLEH cek external dependencies di sini
@app.get("/health/ready")
async def readiness():
    checks = {}
    all_ok = True

    # Cek koneksi database
    try:
        db.execute("SELECT 1")
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {str(e)}"
        all_ok = False

    # Cek warmup selesai (contoh: cache sudah ter-load)
    if not cache.is_warmed_up():
        checks["cache"] = "warming up"
        all_ok = False
    else:
        checks["cache"] = "ok"

    status_code = status.HTTP_200_OK if all_ok else status.HTTP_503_SERVICE_UNAVAILABLE
    return Response(
        content=json.dumps({"status": "ready" if all_ok else "not ready", "checks": checks}),
        status_code=status_code,
        media_type="application/json"
    )

Konfigurasi Probe yang Tepat #

spec:
  containers:
  - name: api
    image: my-api:v2

    # Startup Probe: untuk aplikasi yang butuh waktu startup lama
    # Kubernetes tidak jalankan liveness/readiness sampai startup probe sukses
    startupProbe:
      httpGet:
        path: /health/live
        port: 8080
      failureThreshold: 30    # coba sampai 30 × 10s = 5 menit
      periodSeconds: 10

    # Liveness Probe: mulai setelah startup probe sukses
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 0   # langsung setelah startup probe sukses
      periodSeconds: 15
      failureThreshold: 3      # restart setelah 3 × 15s = 45 detik gagal
      timeoutSeconds: 5        # timeout per probe
      successThreshold: 1

    # Readiness Probe: mulai setelah startup probe sukses
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 0
      periodSeconds: 5         # lebih sering dari liveness
      failureThreshold: 3      # hapus dari Endpoints setelah 3 × 5s = 15 detik
      successThreshold: 1      # masukkan ke Endpoints setelah 1 sukses
      timeoutSeconds: 3

Tipe Probe Handler #

# 1. HTTP GET (paling umum)
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
    httpHeaders:
    - name: Custom-Header
      value: health-check

# 2. TCP Socket (untuk non-HTTP service)
livenessProbe:
  tcpSocket:
    port: 5432        # cek apakah port terbuka (database, message broker)

# 3. gRPC (untuk gRPC service)
livenessProbe:
  grpc:
    port: 50051
    service: "liveness"

# 4. Exec (jalankan command, 0=sukses, non-0=gagal)
livenessProbe:
  exec:
    command:
    - sh
    - -c
    - "redis-cli ping | grep PONG"  # atau cek file existence

Anti-Pattern Health Check yang Berbahaya #

# ANTI-PATTERN 1: liveness probe yang cek database
livenessProbe:
  httpGet:
    path: /health/live    # endpoint ini query database
  # → Database down → semua Pod restart → cascade failure

# ANTI-PATTERN 2: initialDelaySeconds yang terlalu besar
livenessProbe:
  initialDelaySeconds: 300   # tunggu 5 menit
  # → Jika crash loop, butuh 5 menit setiap siklus untuk deteksi
  # → Gunakan startupProbe sebagai gantinya

# ANTI-PATTERN 3: failureThreshold terlalu kecil
readinessProbe:
  failureThreshold: 1   # satu kali gagal langsung dikeluarkan dari Endpoints
  # → Spike sementara, hiccup network → Pod dikeluarkan dari traffic
  # → Gunakan minimal 3

# ANTI-PATTERN 4: endpoint yang sama untuk liveness dan readiness
livenessProbe:
  httpGet:
    path: /health         # satu endpoint untuk keduanya
readinessProbe:
  httpGet:
    path: /health         # tidak bisa bedakan antara "mati" dan "tidak siap"

Ringkasan #

Tiga probe dengan tujuan berbeda — liveness (restart jika mati), readiness (keluarkan dari traffic jika tidak siap), startup (beri waktu startup lebih panjang tanpa restart prematur).
Liveness TIDAK boleh cek external dependency — database down bukan alasan restart container; gunakan readiness untuk dependency eksternal, liveness hanya untuk kondisi internal yang tidak bisa recover.
Endpoint /health/live dan /health/ready terpisah — mereka punya tujuan dan behavior yang berbeda; satu endpoint untuk keduanya tidak bisa mengekspresikan perbedaan ini.
startupProbe untuk aplikasi startup lambat — jangan inflate initialDelaySeconds di liveness; startupProbe memberikan waktu yang tepat tanpa memperlambat deteksi masalah setelah startup.
failureThreshold: 3 sebagai minimum — satu kegagalan probe bisa karena hiccup sementara; tiga kegagalan berturut-turut lebih reliable sebagai indikator masalah nyata.
Readiness gagal = Pod keluar dari Endpoints, bukan restart — ini perbedaan yang paling penting untuk dipahami; salah jenis probe menyebabkan CrashLoopBackOff yang tidak perlu.

← Sebelumnya: Grafana Dashboard Berikutnya: Anti-Pattern Observability →