Last year I deployed a system in Google Cloud’s managed Kubernetes. From the beginning, we faced intermittent 502 Bad Gateway errors. They were very tricky to diagnose mainly because they were caused by a variety of reasons.
After months of tweaking infrastructure settings and restructuring architecture, I finally resolved these seemingly random errors. Here’s a checklist of all the issues that lead to 502s to look out for:
Disable preemptible nodes
Google Cloud has a concept of “preemptible” virtual machines, which you can use as your Kubernetes nodes to reduce costs. These are VMs that only last 24 hours after which they get terminated and replaced.
Unfortunately you cannot use these for your Kubernetes node pool which runs web applications. This is because while a node is shutting down, if it receives a request from the ingress, it will not respond and thus the ingress will return a 502.
Disable node pool autoscaling
Similarly, turning on autoscaling of your node pool will lead to virtual machines being terminated. This saves costs but will also inevitably lead to 502 errors.
You can still enable scaling at the Kubernetes deployment level, just don’t let Google Cloud delete nodes.
You can still benefit from cost reduction by moving any non-web workloads to a separate node pool, and turning on autoscaling there.
Configure healtchecks correctly
Your readiness check should reflect when your app is up and ready to receive a
request. This is what Kubernetes uses to start routing traffic. Also, make sure
the timeout is large enough so that small increases in latency do not cause a
pod restart. As for liveness checks, you can just set them to monitor the
of the process running your web server. Here’s an example:
readinessProbe: httpGet: path: "/" port: 5000 timeoutSeconds: 5 failureThreshold: 1 livenessProbe: exec: command: - /bin/sh - -c - pidof -x gunicorn
Ensure uptime during rolling updates
During new deploys of your code, you must make sure that there is at least
one pod running at all times, otherwise you will get 502s. So make sure your
RollingUpdate strategy has
maxUnavailable set to 0.