Grafana Dashboard #

Metrics yang dikumpulkan Prometheus hanya berguna jika bisa divisualisasikan dengan cepat dan jelas saat dibutuhkan. Grafana adalah UI visualisasi yang menjadi standar untuk ekosistem Prometheus — ia memungkinkan pembuatan dashboard yang menampilkan kondisi cluster dan aplikasi secara real-time. Dashboard yang dirancang dengan baik bisa mengubah waktu mean-time-to-detect (MTTD) dari jam menjadi menit.

Hierarki Dashboard yang Direkomendasikan #

Level 1: Cluster Overview
  → Status semua node (CPU, memory, disk)
  → Jumlah Pod per namespace
  → Cluster resource utilization
  → Untuk: SRE/Ops yang monitor kesehatan cluster secara keseluruhan

Level 2: Namespace / Service Overview
  → Semua service di satu namespace
  → Four Golden Signals per service: error rate, latensi p99, throughput, saturation
  → Untuk: tim yang bertanggung jawab atas namespace tersebut

Level 3: Service Detail
  → Breakdown per endpoint
  → Database query latency
  → External dependency health
  → Untuk: debugging insiden spesifik pada satu service

Level 4: Business Dashboard
  → Conversion rate, checkout success, active users
  → Untuk: product manager dan leadership

Panel Essentials: Four Golden Signals #

Panel 1: Throughput (Request per Second)
  Query: sum(rate(http_requests_total{namespace="$namespace", job="$service"}[5m]))
  Visualization: Time series
  Legend: {{method}} {{handler}}

Panel 2: Error Rate
  Query:
    sum(rate(http_requests_total{
      namespace="$namespace",
      job="$service",
      status_code=~"5.."
    }[5m]))
    /
    sum(rate(http_requests_total{
      namespace="$namespace",
      job="$service"
    }[5m]))
    * 100
  Visualization: Time series (atau Stat jika hanya satu nilai)
  Unit: Percent (0-100)
  Thresholds: 0=green, 1=yellow, 5=red

Panel 3: Latensi P99
  Query:
    histogram_quantile(0.99,
      sum by (le) (
        rate(http_request_duration_seconds_bucket{
          namespace="$namespace",
          job="$service"
        }[5m])
      )
    )
  Visualization: Time series
  Unit: seconds
  Thresholds: 0=green, 0.5=yellow, 2=red

Panel 4: Saturation (CPU Usage % dari Limit)
  Query:
    sum(rate(container_cpu_usage_seconds_total{
      namespace="$namespace",
      pod=~"$service.*"
    }[5m]))
    /
    sum(kube_pod_container_resource_limits{
      namespace="$namespace",
      pod=~"$service.*",
      resource="cpu"
    })
    * 100
  Visualization: Gauge atau Time series
  Unit: Percent
  Thresholds: 0=green, 60=yellow, 85=red

Variabel dan Template untuk Reusable Dashboard #

Dashboard dengan variabel memungkinkan satu template dipakai untuk semua service dan namespace:

// Variabel yang direkomendasikan:

"templating": {
  "list": [
    {
      "name": "datasource",
      "type": "datasource",
      "query": "prometheus",
      "label": "Data Source"
    },
    {
      "name": "namespace",
      "type": "query",
      "datasource": "${datasource}",
      "query": "label_values(kube_namespace_labels, namespace)",
      "label": "Namespace",
      "multi": false,
      "includeAll": false
    },
    {
      "name": "service",
      "type": "query",
      "datasource": "${datasource}",
      "query": "label_values(up{namespace=\"$namespace\"}, job)",
      "label": "Service",
      "refresh": "On time range change"
    }
  ]
}

Dengan variabel ini, kamu bisa memilih namespace dan service dari dropdown, dan semua panel otomatis update menampilkan data yang relevan.

Dashboard Kubernetes Cluster Overview #

Panel-panel penting untuk dashboard cluster:

Row 1: Cluster Health Summary
  - Total Nodes (Stat): count(kube_node_info)
  - Nodes Not Ready (Stat): count(kube_node_status_condition{condition="Ready",status!="true"})
  - Total Pods Running (Stat): count(kube_pod_status_phase{phase="Running"})
  - Pods Not Running (Stat): count(kube_pod_status_phase{phase!="Running"})

Row 2: Resource Utilization
  - Cluster CPU Usage (Time series):
    sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
    /
    sum(kube_node_status_capacity{resource="cpu"})
    * 100

  - Cluster Memory Usage (Time series):
    sum(container_memory_working_set_bytes{container!=""})
    /
    sum(kube_node_status_capacity{resource="memory"})
    * 100

Row 3: Per-Namespace Resource
  - CPU per namespace (Bar gauge):
    sum by (namespace) (
      rate(container_cpu_usage_seconds_total{container!=""}[5m])
    )

Row 4: Problematic Pods
  - CrashLooping Pods (Table):
    increase(kube_pod_container_status_restarts_total[1h]) > 0
    → tampilkan: pod, namespace, container, restart count

  - Pods Pending terlama (Table):
    kube_pod_status_phase{phase="Pending"} > 0

Dashboard-as-Code: Grafonnet dan Provisioning #

Dashboard yang dibuat via UI tidak bisa di-version-control. Gunakan provisioning atau Grafonnet:

# Provisioning via ConfigMap (Grafana akan auto-load)
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
  labels:
    grafana_dashboard: "1"      # label yang dipantau Grafana sidecar
data:
  api-service-dashboard.json: |
    {
      "title": "API Service Overview",
      "uid": "api-service-v1",
      "panels": [...]
    }

Alternatif: Grafonnet (Jsonnet library untuk Grafana)
  → Tulis dashboard sebagai kode Jsonnet
  → Reusable components dan functions
  → Compile ke JSON dan di-commit ke Git
  → CI/CD otomatis push ke Grafana via API

Untuk tim yang serius dengan dashboard-as-code:
  Gunakan grafana-operator atau grizzly untuk manage
  dashboard lifecycle via GitOps

Alert Annotations di Dashboard #

Tampilkan kapan alert aktif langsung di time series panel:

// Annotations di dashboard
"annotations": {
  "list": [
    {
      "datasource": "${datasource}",
      "enable": true,
      "expr": "ALERTS{alertname!=\"Watchdog\", namespace=\"$namespace\"}",
      "hide": false,
      "iconColor": "red",
      "name": "Alerts",
      "step": "60s",
      "titleFormat": "{{alertname}}"
    }
  ]
}

Ini menampilkan garis merah di chart tepat saat alert aktif — memudahkan korelasi antara anomali metrik dan alert yang terpicu.

Ringkasan #

Hierarki dashboard: cluster → namespace → service → business — jangan taruh semua di satu dashboard besar; buat hierarki yang memudahkan drill-down dari overview ke detail.
Four Golden Signals di setiap service dashboard — error rate, latensi p99, throughput, dan saturation adalah minimum; dari keempat ini hampir semua masalah terdeteksi.
Variabel $namespace dan $service — satu template dashboard untuk semua service; dropdown memungkinkan switch tanpa duplikasi.
Threshold warna di panel — green/yellow/red berdasarkan threshold yang bermakna; engineer bisa scan kondisi cluster dari warna tanpa membaca angka.
Dashboard-as-code via provisioning atau Grafonnet — dashboard yang hanya ada di UI akan hilang atau drifted; simpan sebagai JSON di Git dan load via ConfigMap atau grafana-operator.
Annotations dari alert — tampilkan kapan alert aktif di chart untuk korelasi visual yang cepat antara anomali dan respons alert.

← Sebelumnya: Distributed Tracing Berikutnya: Health Check →