César Ferradas

Troubleshooting Ingress 502s on Google Kubernetes Engine

01 May 2022

Last year I deployed a system in Google Cloud’s managed Kubernetes. From the beginning, we faced intermittent 502 Bad Gateway errors. They were very tricky to diagnose mainly because they were caused by a variety of reasons.

After months of tweaking infrastructure settings and restructuring architecture, I finally resolved these seemingly random errors. Here’s a checklist of all the issues that lead to 502s to look out for:

Disable preemptible nodes

Google Cloud has a concept of “preemptible” virtual machines, which you can use as your Kubernetes nodes to reduce costs. These are VMs that only last 24 hours after which they get terminated and replaced.

Unfortunately you cannot use these for your Kubernetes node pool which runs web applications. This is because while a node is shutting down, if it receives a request from the ingress, it will not respond and thus the ingress will return a 502.

Disable node pool autoscaling

Similarly, turning on autoscaling of your node pool will lead to virtual machines being terminated. This saves costs but will also inevitably lead to 502 errors.

You can still enable scaling at the Kubernetes deployment level, just don’t let Google Cloud delete nodes.

You can still benefit from cost reduction by moving any non-web workloads to a separate node pool, and turning on autoscaling there.

Configure healtchecks correctly

Your readiness check should reflect when your app is up and ready to receive a request. This is what Kubernetes uses to start routing traffic. Also, make sure the timeout is large enough so that small increases in latency do not cause a pod restart. As for liveness checks, you can just set them to monitor the pid of the process running your web server. Here’s an example:

    path: "/"
    port: 5000
  timeoutSeconds: 5
  failureThreshold: 1
      - /bin/sh
      - -c
      - pidof -x gunicorn

Ensure uptime during rolling updates

During new deploys of your code, you must make sure that there is at least one pod running at all times, otherwise you will get 502s. So make sure your RollingUpdate strategy has maxUnavailable set to 0.