Category Archives: Nutanix

Migrate VMs on Nutanix from one cluster to another without Live Migration

One of the great things about Nutanix is that you can add nodes one at a time and grow your storage cluster.  One of the bad things about Nutanix is that there really isnít a way to remove a node from a cluster (yet) without doing a cluster destroy.  Cluster destroy is basically game over for that cluster, it removes all of the nodes and puts them back to factory restore mode, as in they look as when they arrived from the factory.

So what happens when you buy a few more blocks from Nutanix, create a new cluster, and need to migrate your production VMs from the old cluster to the new cluster?

We ran into a situation where we had our production servers running on Nutanix 3450 blocks, and needed a bit more oomph, so we purchased Nutanix 3460 blocks which support 512GB RAM per node instead of 256GB and have 10 core CPUs instead of 8.  We could have added these nodes to the same cluster except that we wanted to take the old nodes and add them to our VDI cluster.  (We havenít performed any performance testing on the solution of just having one cluster mixing VDI and server workloads, so we decided to play it safe and segregate the clusters).

So how do we migrate 6TB of production VMs all in one night and maintain application consistency?  Live Migration!?  Well, we could have tried it, but upgrading to vSphere 5.5 SSO seems to have killed our vSphere webclient.  Support ticket openedÖ Yay VMware for not including live migration in the Windows client because itís not like we still need that supported for SRM or Update Manager or anything because that is fully supported by the webcliÖ. oh.  Also Iím sure that as soon as they get everything working in the webclient that 95% of their enterprise customers are going to ditch windows because finally it will be the year of the linux desktÖ oh.

Meanwhile back at the ranch, we need to get these VMs over to the new cluster.  I guess weíre going to power them off and do a storage migration.  Luckily our production servers support a mission that only happens during the day, so powering them off for a few hours isnít that big of a deal.  Maybe we should test this first.  Test VM created, power off, right click Migrate, start migration andÖ itís moving at a whopping 33MB/s.  HmmÖ so 6TB/33MB/s = 58hrs 15 minute and 15 seconds to complete.  Uh, I donít think thatís going to work.  VMware should really add storage migrations to the VAAI API and let the storage vendors figure out how to speed up transfers.

Still, I donít have 58 hours of downtime to migrate these VMs.  How can I get them migrated in a reasonable time?  Nutanix DR to the rescue!

All of the gory details about how DR works is a separate blog post.  Letís suffice it to say that I did the following:

#Log into CVM and open firewall ports for DR
for i in svmips; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp Ėdport 2009 -j ACCEPT && sudo service iptables save"; done

#Create the remote site of new cluster on old cluster
remote-site create name=NEW_CLUSTER address-list="10.xxx.xxx.2" container-map="OLD_DATASTORE:NEW_DATASTORE" enable-proxy="true"

#Create the remote site of old cluster on new cluster
remote-site create name=KEN address-list="10.xxx.xxx.1" container-map="NEW_DATASTORE:OLD_DATASTORE" enable-proxy="true"

#Create the protection domain
pd create name="PRODUCTION"

#Add my production server VMs to the protection domain
pd protect name="PRODUCTION" vm-names=PROD01,PROD02,PROD03 cg-name="PRODCG"

#Migrate the production VMs
pd migrate name=ĒPRODUCTIONĒ remote-site=ĒNEW_CLUSTERĒ 

This operation does the following:
1. Creates and replicates a snapshot of the protection domain.
2. Shuts down VMs on the local site.
3. Creates and replicates another snapshot of the protection domain.
4. Unregisters all VMs and removes their associated files.
5. Marks the local site protection domain as inactive.
6. Restores all VM files from the last snapshot and registers them on the remote site.
7. Marks the remote site protection domain as active.

#Check that replication started
pd list-replication-status

You will see an output similar to below on the sending cluster:

ID 2345700
Protection Domain PRODUCTION
Replications Operation Sending
Start Time 01/11/2014 20:35:00 PST
Remote Site NEW_CLUSTER
Snapshot Id 2345688
Aborted false
Paused false
Bytes Completed 2.72 GB (2,916,382,112 bytes)
Complete Percent 91.117836

On the receiving cluster you will see:

ID 4830
Protection Domain PRODUCTION
Replications Operation Receiving
Start Time 01/11/2014 20:35:00 PST
Remote Site OLD_CLUSTER
Snapshot Id OLD_CLUSTER:2345688
Aborted false
Paused false
Bytes Completed 2.72 GB (2,916,382,112 bytes)
Complete Percent 91.117836

If you want to watch the replication status a helpful command to know is the linux command watch.  The command below will update the status every 1 second.

watch Ėn 1 ncli pd list-replication-status

Since the migration takes two snapshots you will see the replication status reach 100% and then another replication will start for the snapshot of the powered off VMs.

When it gets to 100% on the first snapshot the VMs we be removed from the old cluster in vCenter.  After the 2nd replication completes they will be added to the new cluster.

For our migration the transfer seemed to reach 90% fairly quickly, then took about 1-2 hrs to get from 90-100%.  Perhaps someone from Nutanix can shed some light on what is happening during that last 10% and why it takes so long.

Nutanix 1350

I have been using the Nutanix Virtual Computing Platform 3450 and 3460 appliances on some of my recent projects.  I have been wanting to do some testing to see what these appliances are capable of, I mean other than hosting 5000+ VMware View desktops, but itís not like I can just go pull one out of production and fire up IOMeter, or install Hyper-V on it, or do some What-If-BadThingsTM happen like a hard drive accidently getting pulled or two nodes decide to power off at the same time.

Nutanix was kind enough to send me a Nutanix 1350 Virtual Computing Platform appliance to do exactly this.  The 1000 series is the little brother to the 3000 series.  Without having received Nutanix Official Sales Training(TM) I should clarify what the series numbers mean:

X (Series Number)
X (Number of Nodes)
X (Processor Type)
X (SSD Drive Capacity)
1 (1000 Series)
3 (3 Nodes)
5 (Dual Intel Sandy Bridge E5-2620)
0 (1-400GB SSD Drive)

Nutanix had also warned me that the appliance is rated to consume 1150W at 10-12A.  With all of the other equipment that I have in the office, my 15A circuit didnít look like it was going to cut it.  Time for a power upgrade!

However, something seemed to be missing to complete this power upgradeÖ attic access!  5 days, 10 trips to Home Depot, a stud finder, 1 new reciprocating saw, and 4 holes in the wall later I had finally installed a new 20A circuit!

This is also probably where I should put the disclaimer:
I am a computer systems engineer and not a licensed electrician.  Any work performed on your own structures must be performed according to your local laws and building codes.  It is highly recommended to have any electrical work performed by a licensed electrician.

Found the back of the electrical panel!

 electrical

 

Circuit breaker installed!

20A 

 

Time for unboxing!

 Nutanix

Even though it came with rails, I donít feel like moving everything around in my lab rack, I want to play!  Iíll just set it on top and rack it later.

nutanix

So now that I have it plugged in, letís see what this thing is going to cost me to run.  Thanks to Southern California Edison and the California Public Utilities Exchange Commission Iím in Tier 4 which costs $0.31 per kilowatt hour.  At 1.15 kw/hr * 24 hrs per day * 30 days per month * $0.31, Iím looking at a $256.68 increase in my bill next month. 

However, I plugged in my Kill-A-Watt meter and it shows me that these 3 nodes are only consuming 367 Watts.  At 0.367 kw/hr * 24 hrs per day * 30 days per month * $0.31, it looks like Iím only going to be paying an additional $81.91.  I realize that these numbers are at idle, so Iíll have to write another post once I get a load spun up.  Also, this load probably could have fit on my existing 15A circuit.  But at least I got to play Tim Taylor over the holiday break and get more power!

kill-a-watt