Network Troubleshooting #

Masalah jaringan di Kubernetes sering membingungkan karena ada banyak lapisan yang terlibat: DNS, CNI plugin, iptables/IPVS, NetworkPolicy, Ingress controller, dan kode aplikasi itu sendiri. Tanpa metodologi yang sistematis, debugging bisa berputar tanpa arah. Artikel ini memberikan pendekatan step-by-step dan checklist untuk hampir semua skenario masalah jaringan yang umum ditemui.

Metodologi: Narrowing Down dari Luar ke Dalam #

Langkah diagnostik dari broad ke specific:

1. Apakah Pod sudah Running dan Ready?
   └─ Tidak → masalah bukan di networking, cek Pod lifecycle dulu

2. Apakah Pod bisa ping Pod lain di node yang sama?
   └─ Tidak → masalah CNI plugin di node tersebut

3. Apakah Pod bisa ping Pod di node yang berbeda?
   └─ Tidak → masalah routing lintas node (overlay atau BGP)

4. Apakah Pod bisa resolve DNS?
   └─ Tidak → masalah CoreDNS atau NetworkPolicy yang blokir port 53

5. Apakah Pod bisa akses Service via ClusterIP?
   └─ Tidak → masalah kube-proxy atau iptables rules

6. Apakah Pod bisa akses Service via DNS name?
   └─ Tidak → masalah DNS resolution (cek step 4)

7. Apakah Service forwarding ke Pod yang benar?
   └─ Cek Endpoints → apakah ada Pod yang listed?

8. Apakah ada NetworkPolicy yang memblokir?
   └─ Test sementara hapus NetworkPolicy, coba lagi

Toolkit Debug #

Debug Pod dengan Ephemeral Container #

# Tambahkan container debug ke Pod yang bermasalah (tidak perlu restart)
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=<container-name>

# netshoot mengandung: curl, dig, nslookup, netstat, tcpdump, nmap, dll

Jalankan Pod Debug Sementara #

# Jalankan Pod sementara dengan tools networking lengkap
kubectl run netshoot --image=nicolaka/netshoot -it --rm --restart=Never -- sh

# Atau di namespace tertentu
kubectl run netshoot -n production --image=nicolaka/netshoot -it --rm --restart=Never -- sh

Cek State Jaringan #

# Lihat semua IP Pod di cluster
kubectl get pods -A -o wide

# Lihat Endpoints sebuah Service
kubectl get endpoints <service-name> -n <namespace>

# Lihat detail Service
kubectl describe service <service-name> -n <namespace>

# Cek aturan iptables di node (perlu akses ke node)
iptables -t nat -L KUBE-SERVICES | grep <service-cluster-ip>

Skenario 1: Pod Tidak Bisa Ping Pod Lain #

# Dari dalam Pod A, coba hubungi Pod B
kubectl exec -it pod-a -- ping 10.244.2.5

# Jika gagal:

# Langkah 1: Verifikasi IP Pod B masih valid
kubectl get pod pod-b -o wide

# Langkah 2: Cek apakah kedua Pod di node yang sama atau berbeda
# Sama? → masalah bridge lokal, cek CNI
# Berbeda? → masalah routing lintas node

# Langkah 3: Cek CNI plugin di node
kubectl get pods -n kube-system -o wide | grep cni
# Apakah CNI DaemonSet berjalan di semua node?

# Langkah 4: Cek log CNI plugin
kubectl logs -n kube-system <cni-pod-name>

# Langkah 5: Cek pod CIDR node
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'

Skenario 2: DNS Tidak Bekerja #

# Test DNS dari dalam Pod
kubectl exec -it <pod-name> -- nslookup kubernetes.default
kubectl exec -it <pod-name> -- nslookup <service-name>.<namespace>

# Jika timeout atau NXDOMAIN:

# Langkah 1: Cek CoreDNS berjalan
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Langkah 2: Cek log CoreDNS
kubectl logs -n kube-system -l k8s-app=kube-dns

# Langkah 3: Cek apakah Pod bisa reach CoreDNS
kubectl exec -it <pod-name> -- cat /etc/resolv.conf
# nameserver harus menunjuk ke ClusterIP CoreDNS (biasanya 10.96.0.10)

kubectl exec -it <pod-name> -- nc -vz 10.96.0.10 53
# Jika timeout → NetworkPolicy memblokir port 53

# Langkah 4: Cek NetworkPolicy
kubectl get networkpolicy -n <namespace>
# Apakah ada policy yang mungkin memblokir egress ke kube-system?

Skenario 3: Service Tidak Bisa Diakses #

# Langkah 1: Cek Service ada dan konfigurasinya benar
kubectl describe service <service-name> -n <namespace>
# Perhatikan: Selector, Port, Type, ClusterIP

# Langkah 2: Cek Endpoints — apakah ada Pod yang healthy?
kubectl get endpoints <service-name> -n <namespace>
# Jika "<none>" → tidak ada Pod yang cocok dengan selector atau tidak ada yang Ready

# Langkah 3: Verifikasi label Pod cocok dengan selector Service
kubectl get pods -n <namespace> --show-labels
# Bandingkan dengan selector di Service spec

# Langkah 4: Cek readiness probe Pod
kubectl describe pod <pod-name> -n <namespace>
# Pod harus Ready agar masuk ke Endpoints

# Langkah 5: Test akses langsung ke Pod (bypass Service)
kubectl exec -it debug-pod -- curl http://<pod-ip>:<container-port>
# Jika berhasil → masalah di Service/kube-proxy
# Jika gagal → masalah di aplikasi atau NetworkPolicy ke Pod

# Langkah 6: Test via ClusterIP
kubectl exec -it debug-pod -- curl http://<service-cluster-ip>:<port>

Skenario 4: Ingress Tidak Routing dengan Benar #

# Langkah 1: Cek Ingress resource
kubectl describe ingress <ingress-name> -n <namespace>
# Perhatikan: rules, backend service, TLS

# Langkah 2: Cek Ingress Controller berjalan
kubectl get pods -n ingress-nginx   # atau namespace Ingress Controller kamu

# Langkah 3: Cek log Ingress Controller
kubectl logs -n ingress-nginx <ingress-controller-pod>

# Langkah 4: Cek events di Ingress resource
kubectl get events -n <namespace> | grep ingress

# Langkah 5: Test dengan curl dari luar cluster
curl -v http://<external-ip>/path
# Gunakan -v untuk lihat detail koneksi dan redirect

# Langkah 6: Test bypass Ingress — akses Service langsung
kubectl port-forward service/<service-name> 8080:80 -n <namespace>
curl http://localhost:8080
# Jika berhasil → masalah ada di Ingress, bukan Service/Pod

Skenario 5: NetworkPolicy Terlalu Restrictive #

# Cek NetworkPolicy yang berlaku untuk Pod
kubectl get networkpolicy -n <namespace>

# Lihat detail policy
kubectl describe networkpolicy <policy-name> -n <namespace>

# Cara cepat test: sementara hapus policy dan coba lagi
kubectl delete networkpolicy <policy-name> -n <namespace>
# Jika berhasil setelah policy dihapus → policy terlalu restrictive

# Debug dengan Cilium (jika pakai Cilium)
kubectl exec -it -n kube-system cilium-<node> -- cilium monitor --type drop
# Menampilkan paket yang di-drop secara real-time

# Lihat traffic yang diizinkan dan ditolak
kubectl exec -it -n kube-system cilium-<node> -- cilium policy trace \
  --src-k8s-pod production/pod-a \
  --dst-k8s-pod production/pod-b \
  --dport 8080

Checklist Diagnostik Cepat #

# Jalankan semua check ini secara berurutan saat ada masalah

# 1. Status Pod
kubectl get pods -n <namespace> -o wide

# 2. Events terbaru
kubectl get events -n <namespace> --sort-by=.lastTimestamp | tail -20

# 3. Endpoints Service
kubectl get endpoints -n <namespace>

# 4. DNS test dari dalam Pod
kubectl exec -it <pod> -n <namespace> -- nslookup <service-name>

# 5. Koneksi langsung ke Pod
kubectl exec -it debug-pod -- curl -v http://<pod-ip>:<port>

# 6. Koneksi via Service
kubectl exec -it debug-pod -- curl -v http://<service-name>.<namespace>:<port>

# 7. NetworkPolicy yang aktif
kubectl get networkpolicy -n <namespace>

# 8. Ingress dan events-nya
kubectl describe ingress -n <namespace>

Ringkasan #

Narrowing down dari broad ke specific — mulai dari Pod status, lalu same-node ping, lalu cross-node, lalu DNS, lalu Service, lalu Ingress; jangan langsung melompat ke detail.
nicolaka/netshoot adalah Swiss Army knife networking — image dengan semua tools (curl, dig, tcpdump, nmap); gunakan sebagai ephemeral container atau Pod sementara.
Endpoints kosong = Pod tidak masuk ke Service — paling sering karena label Pod tidak cocok selector Service, atau Pod tidak Ready karena readiness probe gagal.
DNS timeout paling sering = NetworkPolicy blokir port 53 — selalu pastikan ada egress rule ke port 53 UDP/TCP saat menerapkan default-deny egress.
Test bypass Ingress dengan port-forward — jika akses via Service berhasil tapi via Ingress gagal, masalah ada di Ingress Controller, bukan di aplikasi.
Cilium cilium monitor --type drop — cara tercepat melihat paket yang di-drop oleh NetworkPolicy di real-time, tersedia jika menggunakan Cilium CNI.

← Sebelumnya: Ingress Controller Comparison Berikutnya: Anti-Pattern Networking →