Regional Persistent Disks on Google Kubernetes Engine

deesix | 106 points

This is a cool feature. However, for me it is helpful to know how often GCE tends to have a zone fail, but not the whole region. Personally, I've been using GCP to run cocalc.com since 2014. In the last year I remember two significant outages to our site, which were 100% the fault of Google:

(1) Last week, the GCE network went down completely for over an hour -- this killed the entire region (not just zone!) where cocalc is deployed -- see ://status.cloud.google.com/incident/cloud-networking/18010

(2) Last year, the GCE network went down completely for the entire world (!), and again this made cocalc not work.

In both cases, when the outage happened, having cocalc be hosted in multiple zones (but one regions, or in (2) one cloud), would not have been enough. I haven't had to deal with any other significant GCE outages that I can remember that weren't at least partly my fault. For what it is worth, I used to host cocalc both on premise and on GCE, but can no longer afford to do that.

williamstein | 6 years ago

What does a failover look like in Kubernetes? From the GCP Docs: [1]

> In the unlikely event of a zonal outage, you can failover your workload running on regional persistent disks to another zone using the force-attach command. The force-attach command allows you to attach the regional persistent disk to a standby VM instance even if the disk cannot be detached from the original VM due to its unavailability.

Does the kubernetes.io/gce-pd provisioner have the logic to detect a zone failure in GCP and call the "force-attach" command if a failover is needed? Or does it always try to do a "force-attach" if a normal attach call fails? How does it handle a split-brain scenario, where the disk is requested by two separate nodes in each zone?

[1] https://cloud.google.com/compute/docs/disks/#repds

caleblloyd | 6 years ago

This is going to enable some very simple multi-region failover for k8s, and I'm excited to try it out!

robbyt | 6 years ago

I wonder what the performance hit from this is.

advisedwang | 6 years ago