I have a Nutanix cluster that needs to be upgraded from 3.1.2 to 188.8.131.52 (or 184.108.40.206 if it is out by the time I get around to upgrading it). That got me to thinking about the upgrade process. When you perform a Nutanix Operating System (NOS) upgrade, it performs what Nutanix calls a “rolling upgrade”. This in effect only performs the upgrade on one CVM at a time. While the CVM is being upgraded, the storage on that node is directed to another CVM.
My first thought was, “How does that actually work”? Thanks to Zach Vaughn @z_n_v, Nutanix SE Extraordinaire, my eyes were opened. When the cluster detects that a CVM is down, it SSHs to the Hypervisor (I’m referring to ESXi) and adds a route to the external IP of another CVM in the cluster. The cluster performs this check every 30 seconds, so it is possible that your VM will be without storage for 30 seconds. How disasterous could this be? (I’m told that as of NOS version 220.127.116.11 this will be much faster than 30 seconds). The following video shows what happens.
This test was performed on a Nutanix 1350 block running NOS 18.104.22.168. The desktop is running on Node C. I start encoding a video using handbrake which is writing to the user’s desktop on the local disk. When I shut down the CVM on Node C the desktop appears to hang for 20 seconds. However, it is possible that the PCoIP server process stops responding for those 20 seconds, as when the desktop resumes you can see that it has still received pings from the hypervisor.
I ran this test from a different machine and the View Client seemed to stay connected. The difference being that it was an iMac connected via ethernet and I recorded the video on my Macbook Pro connected via wireless. The desktop continued to receive pings, but the handbrake process stopped while the disk was unavailable for about 20 seconds and then resumed when the route to the CVM was changed on the hypervisor. If I can get that to work again I’ll try to post another video.