One of the great things about Nutanix is that you can add nodes one at a time and grow your storage cluster. One of the bad things about Nutanix is that there really isn’t a way to remove a node from a cluster (yet) without doing a cluster destroy. Cluster destroy is basically game over for that cluster, it removes all of the nodes and puts them back to factory restore mode, as in they look as when they arrived from the factory.
So what happens when you buy a few more blocks from Nutanix, create a new cluster, and need to migrate your production VMs from the old cluster to the new cluster?
We ran into a situation where we had our production servers running on Nutanix 3450 blocks, and needed a bit more oomph, so we purchased Nutanix 3460 blocks which support 512GB RAM per node instead of 256GB and have 10 core CPUs instead of 8. We could have added these nodes to the same cluster except that we wanted to take the old nodes and add them to our VDI cluster. (We haven’t performed any performance testing on the solution of just having one cluster mixing VDI and server workloads, so we decided to play it safe and segregate the clusters).
So how do we migrate 6TB of production VMs all in one night and maintain application consistency? Live Migration!? Well, we could have tried it, but upgrading to vSphere 5.5 SSO seems to have killed our vSphere webclient. Support ticket opened… Yay VMware for not including live migration in the Windows client because it’s not like we still need that supported for SRM or Update Manager or anything because that is fully supported by the webcli…. oh. Also I’m sure that as soon as they get everything working in the webclient that 95% of their enterprise customers are going to ditch windows because finally it will be the year of the linux deskt… oh.
Meanwhile back at the ranch, we need to get these VMs over to the new cluster. I guess we’re going to power them off and do a storage migration. Luckily our production servers support a mission that only happens during the day, so powering them off for a few hours isn’t that big of a deal. Maybe we should test this first. Test VM created, power off, right click Migrate, start migration and… it’s moving at a whopping 33MB/s. Hmm… so 6TB/33MB/s = 58hrs 15 minute and 15 seconds to complete. Uh, I don’t think that’s going to work. VMware should really add storage migrations to the VAAI API and let the storage vendors figure out how to speed up transfers.
Still, I don’t have 58 hours of downtime to migrate these VMs. How can I get them migrated in a reasonable time? Nutanix DR to the rescue!
All of the gory details about how DR works is a separate blog post. Let’s suffice it to say that I did the following:
#Log into CVM and open firewall ports for DR
for i in svmips
; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp –dport 2009 -j ACCEPT && sudo service iptables save"; done
#Create the remote site of new cluster on old cluster
remote-site create name=NEW_CLUSTER address-list="10.xxx.xxx.2" container-map="OLD_DATASTORE:NEW_DATASTORE" enable-proxy="true"
#Create the remote site of old cluster on new cluster
remote-site create name=KEN address-list="10.xxx.xxx.1" container-map="NEW_DATASTORE:OLD_DATASTORE" enable-proxy="true"
#Create the protection domain
pd create name="PRODUCTION"
#Add my production server VMs to the protection domain
pd protect name="PRODUCTION" vm-names=PROD01,PROD02,PROD03 cg-name="PRODCG"
#Migrate the production VMs
pd migrate name=”PRODUCTION” remote-site=”NEW_CLUSTER”
This operation does the following:
1. Creates and replicates a snapshot of the protection domain.
2. Shuts down VMs on the local site.
3. Creates and replicates another snapshot of the protection domain.
4. Unregisters all VMs and removes their associated files.
5. Marks the local site protection domain as inactive.
6. Restores all VM files from the last snapshot and registers them on the remote site.
7. Marks the remote site protection domain as active.
#Check that replication started
pd list-replication-status
You will see an output similar to below on the sending cluster:
ID | 2345700 |
Protection Domain | PRODUCTION |
Replications Operation | Sending |
Start Time | 01/11/2014 20:35:00 PST |
Remote Site | NEW_CLUSTER |
Snapshot Id | 2345688 |
Aborted | false |
Paused | false |
Bytes Completed | 2.72 GB (2,916,382,112 bytes) |
Complete Percent | 91.117836 |
On the receiving cluster you will see:
ID | 4830 |
Protection Domain | PRODUCTION |
Replications Operation | Receiving |
Start Time | 01/11/2014 20:35:00 PST |
Remote Site | OLD_CLUSTER |
Snapshot Id | OLD_CLUSTER:2345688 |
Aborted | false |
Paused | false |
Bytes Completed | 2.72 GB (2,916,382,112 bytes) |
Complete Percent | 91.117836 |
If you want to watch the replication status a helpful command to know is the linux command watch. The command below will update the status every 1 second.
watch –n 1 ncli pd list-replication-status
Since the migration takes two snapshots you will see the replication status reach 100% and then another replication will start for the snapshot of the powered off VMs.
When it gets to 100% on the first snapshot the VMs we be removed from the old cluster in vCenter. After the 2nd replication completes they will be added to the new cluster.
For our migration the transfer seemed to reach 90% fairly quickly, then took about 1-2 hrs to get from 90-100%. Perhaps someone from Nutanix can shed some light on what is happening during that last 10% and why it takes so long.