Change Nutanix CVM RAM with PowerCLI

*Update – story behind the script*
Finally I have a few minutes to write the story behind this script.

One of our VMware View environments was experiencing performance problems. The CPUs on our VMs would constantly spike to 100% after they were powered on. Our admins relayed back to engineering that they were having density issues. We reached out to Nutanix who recommended that we increase the cache size to be able to absorb more IOPS. To increase the cache size on Nutanix you simply need to power off the controller virtual machine (CVM) on a host, increase RAM, and power it back on. While is a non disruptive process if you power the CVMs on and off one at a time, it becomes a very disruptive process if someone makes a mistake and powers off more than one CVM at a time. It is also very time intensive because you must check that the CVM services are completely back up before you perform the procedure on the next CVM. With 120 hosts in our environment, and averaging 10 minutes per manual CVM procedure, it looked like it was going to take about 20 hours to perform this task. For us this means 3-4 days in maintenance windows!

I figured there has to be a way to automate this and eliminate the human component so we could perform this maintenance task all in one maintenance window. Well a couple hours of fiddling with powerCLI and trying to figure out which service is the last CVM service to power on, and running the script in our test environment to work out the bugs and we were ready to run it in production. In our environment the average run time per CVM was about 5 minutes, but the best part is that it really saves hours of admin time. An admin only needs to babysit the script while it is running instead of needing to perform an intensive manual process. This shows the huge benefit of Software Defined Storage. Imagine trying to update cache on a traditional SAN without any downtime… isn’t going to happen.

It later turned out that the issue in our environment was a classic VMware View admin mistake of installing updates and then shutting down immediately and recomposing the pool. The updates needed to finish installing after reboot, so they finished installing on all of the linked clones when they powered on. Combined with refresh on logoff which occurs multiple times per day and it was a sure way to test max performance of our equipment!

VMware View Guy Admits that Citrix XenDesktop is Just As Good

So I’ll admit it, I knew nothing about Citrix.  Well I mean other than all the FUD VMware was spewing about how much “fun” I would have if I ever implemented it for a customer.  Citrix actually showed up in the office about 4 years ago to try to explain what was going on but all I remember is that they showed me something called Dazzle and I thought, “how the hell am I supposed to explain to my customers what a Dazzle is supposed to do?” and then went back to installing VMware View.

Really, I was just too busy running around deploying View to get a couple hours to deploy XenDesktop and do my own fact checking.  And really, that is all it takes, is a couple hours. 

One of my vendors insisted that I was missing out.  They introduced me to the Federal team over at Citrix, who got me into Citrix Synergy and introduced me to Bob Mensah, Systems Engineer for Citrix.  Bob is an amazing font of Citrix knowledge!  Bob was able to walk me through the installation of XenDesktop in my lab in a couple hours while I was literally sitting at Honda waiting for my wife’s van to be serviced.

If you’ve been doing View for any significant period of time it’s not that hard to pick up.  Yeah, all the services have different names, but they have the same functionality.  Here’s a chart to help you figure it out:

Horizon View Citrix XenDesktop
vCenter vCenter (but could also be XenCenter or SCVMM)
View Connection Server StoreFront
View Composer Machine Creation Services
View Administrator Citrix Studio
Horizon Workspace StoreFront
Install license key on host Licensing Server
Need 3rd party load balancer Netscaler included
ThinApp (packaged executables) XenApp (Streamed Applications)
Blast (run ThinApps, XenApps, or RDSApps) StoreFront / XenApp

Bob Mensah even pointed me toward these guides that helped me set up CAC authentication in my lab:
Citrix – Create a JITC test CAC environment for XenDesktop/XenApp
Microsoft Technet – Step by Step Guide – Single Tier PKI Hierarchy Deployment

The Citrix administrative tools are Windows only, which could be seen as a draw back, but really the vSphere Web Client and View Administrator client are written in Flash and are slow, so I think Citrix actually has better functioning tools here.

Using Citrix Receiver to connect to a Windows desktop feels a lot like using the View Client.  The one thing that I did notice using my CAC was that I had to use my PIN two times.  Once to authenticate to StoreFront and then another to authenticate to the Windows VM.  With View I only have to put in my PIN once to authenticate to the View Connection Server and that gets passed to the VM.  Citrix told me that this is to overcome a security issue with having the PIN cached on the connection broker, but really I have never had an IA person tell me that was an issue with View so I am curious to understand where that requirement came from.

One thing that the Citrix Receiver has going for it is that it works with the new Tactivo iPad CAC Reader from Precise Biometrics.  CAC Authentication for iPad is nothing new, but previously it could only be accomplished on a per app basis with specialized apps designed to interact with some kind of Bluetooth CAC reader or dongle.  Neither were very convenient.  The Bluetooth reader meant that you needed to carry around an extra peripheral, charge it, and hope nothing interrupted your bluetooth connection.  The dongle… was just cumbersome and silly.  The Tactivo is a sleek integrated case, shown below in the iPad mini model with a magnetic smart cover (not included).  It connects via the lightning adapter and has a micro USB port that supports charging only.  See my photos of the unit below.  The VMware View client does not support this unit yet and I’m suspecting that it will actually fuel a lot of interest in Citrix until they do.

photo 3 photo 2 photo 4

Using XenApp you can now wrap CAC authentication around any application and present it on the iPad, including presenting entire Windows desktops complete with paired bluetooth keyboard and mouse (explained below)!

photo 6         

The other innovative thing about the Citrix Receiver client for iPad is that they have cleverly overcome the iOS inability to pair with a bluetooth mouse!  You can use another iOS device with the Citrix Receiver client installed on it as a touchpad!  The only silly part about this was that I had to set up the storefront connection on the extra device before I could pair it.  I am assuming that it either communicates between the iDevices through wireless or bluetooth, so I think that having to set up the client before you can use it as a touchpad is unnecessary.  However it works really well.  While the screen is a little small on the iPad mini, I was able to open applications and even play a movie just like I could with the Windows client.  My opinion is that it would definitely be a better experience with a full size iPad.

The only other issue I had when I was using the Citrix Receiver client is that there are a lot of extra options in the settings (shown in the picture below) that weren’t intuitive.  Here is the documentation for the client, but if you look through it you will see that the settings in the picture below are not documented.  If you look at the documentation for the View Client for iOS you see that every little feature in the client has a blurb explaining what it does.

options

In all, my initial impression of Citrix XenDesktop is that it has just as much functionality as VMware View.  I just wish that some things had more effort put into documentation rather than getting the functionality ready to ship.

What #NixVblock should have been

Nutanix is running a marketing campaign #NixVblock.  As part of the marketing  campaign they had a video that I really can’t describe better than the way Sean Massey put it:

“VBlock is supposed to be an uninteresting, high maintenance woman who hears three voices in her head and dresses like three separate people.

The “VBlock” character is supposed to represent the negatives of the competing VCE vBlock product.  Instead, it comes off as the negative stereotype of a crazy ex that has been cranked past 11 into offensive territory.”

While I was not personally offended by the video, it was inappropriate, and I was very disappointed.  It had the feeling of the inside joke that you tell someone else who isn’t involved and then you come off as an insensitive jerk.  You didn’t mean to be an insensitive jerk, you just wanted to let your new friend in on the joke too.  When you turn that joke into a marketing video for your company, comparing your competitor to a crazy date and broadcast it to the world in an official marketing campaign, that is sexist and immature.  Would VCE put out a video like that?  The immaturity of the video just makes Nutanix come out looking like the underdog that they are… nipping at the heels of VMware, Cisco and EMC.

Since Nutanix is still a startup, perhaps they still have interns running the marketing department?  I really only need to ask the marketing department one question that should illustrate why I am upset that they chose such an immature method to attempt to communicate their product’s technical superiority to vBlock (which that video doesn’t even attempt to address).  Who is the intended audience of that video?  Is it customers that haven’t purchased Nutanix before but are also considering VCE?   Consider that some of my US Federal customers have many organizations run by women.  Is that video something that I should point them to that will make them choose Nutanix over VCE?  Is that video going to help convince them that Nutanix is actually the more mature feature rich product?

I have actually experienced trying to procure VCE for a project.  VCE is actually a separate company that resells VMware, Cisco and EMC in one package.  They market that the value add is that their support is qualified in all three products and won’t redirect you to VMware, Cisco or EMC.  But in reality this only helps tier 1 sys admins.  If you forget to check a box, VCE will help you, but if you encounter anything that is a serious bug in one of the technologies, you are going to get redirected to the source.  Also when I tried to procure VCE it came out as SIGNIFICANTLY more expensive than just buying the components separately and putting them together myself… I guess that VCE SME has to eat to?!  Imagine that… putting in a middle man costs more money rather than less… Who would have thought it?!

Another disadvantage you have with VCE is that you lose the ability to compete the internal components.  For example I lose the ability to compete VMware with Citrix, Cisco with Brocade or Arista, and EMC with Netapp, which lowers costs for my customers.  I also had the requirement to have US citizen on US soil support which at the time the VCE rep couldn’t answer if they had or not… IE I was going to get redirected to the component supplier when I called support anyway.  In the end, I just bought the separate VMware, Cisco, and EMC components and bolted them together myself.

Of course that was long before Nutanix.  Which brings be to the title of this post.  All Nutanix really had to do was highlight the features that Nutanix has that vBlock doesn’t have.  Let’s compare.

Nutanix vBlock
Built-in VM aware Disaster Recovery integrated into GUI with N:Many replication Not Built in.  Can buy RecoverPoint for Block replication and MirrorView for file replication. Not VM aware unless you’re talking about vSphere replication, but that’s not really storage-level replication
VM aware storage snapshots Block or File level snapshots
Simple web based GUI interface Cluttered Java interface that I can only get to when I alter security policies to allow some version of java 5 releases old.
Storage Controller on every node 2 Storage Controllers
Infinitely scalable Forklift upgrade
Shared nothing architecture Shared Everything Architecture
Built in Compression / Deduplication Why would you compress / dedupe?  How would VCE make you buy more disks?
Shadow Clones Nothing like shadow clones.
Built in storage analytics that detail IO by disk, VM and node Not Built in.  You can buy the EMC Storage Analytics plug-in for vCOPS for $20K.
Prism Central management interface can span multiple clusters You can argue that Unisphere can do this too, but is still in Java and sucks.

I could sit here for an hour adding to this list, but I think I’ve made my point.

Nutanix, please don’t fire anyone for failing with that video.  We can forgive you, and you need to allow people to make mistakes, learn and grow from them, but going forward please stick to marketing your strengths.  You don’t need to put anyone down, what you are doing stands out for itself.  Take the high road and you’ll win more friends.  I also get that may have grown out of an inside joke and sometimes it is hard to see any potential complications from the inside, but you have enough money to get an external PR agency for future marketing campaign analysis.

Nutanix Compression Results

Nutanix has a feature called Post Process Compression.  It’s gone through a couple marketing name changes, and it looks like the latest name for it is MapReduce Compression.  Basically what this means is that when data is written it can be compressed after a period of time (0 to X minutes later).  When the data is accessed again it is decompressed, then recompressed after the specified time period.  The compression is designed to perform the task with unused cycles, meaning that it will not compete with the production workloads. 

There are not really any other end user configurable options for the compression other than on/off and delay.

If you have a file that is constantly accessed, you will want to set a delay of at least a few minutes so it is not constantly being compressed / decompressed every time someone opens it.

Unfortunately Nutanix does not currently have an estimation tool to determine what kind of savings you may get by enabling compression or how long it will take to compress so I decided to test this feature for myself on a test cluster as I am looking at enabling it in production. 

Compression is enabled at the container level.  You can either use the ncli command or you can enable it in the PRISM UI:

NCLI: 
container edit id=[container id] enable-compression=[true] compression-delay=[# minutes]

PRISM UI:
Click on Storage, Diagram, Update, Advanced Settings.

image

As you can see here I started out with 3.27 TB of data.  About 1TB are VMs and 2TB are documents, ISOs, photos and videos. 

image

It took a couple days for it to stop churning.  It finally ended up with 12% compression.
 image

 

Below is the performance chart for the CVMs in this test cluster.  All VMs were powered on (although many were doing nothing).  You can see that 25% utilization is the normal idle and that most of the compression was performed in the first few hours.

image

image

image

 

Conclusion:
Overall I see no downside to enabling the compression feature.  While it didn’t save me an amazing 50%, from what I can tell there is no noticeable performance impact, so why not save all the space that I can?  With the changes coming to the Nutanix software licensing this is now a standard feature, which makes me happy as it was previously a separately licensed feature.

Error checking file system on altbootbank

I pulled out my last SATA disk from my homelab whitebox.  The ESXi boot partition was on there.  My goal is to never use a spinning disk again, so I decided to try boot from USB.

I actually installed ESXi on the USB drive in VMware Fusion, then pulled it out and put it in my host to boot from.

I plugged the drive into a USB 3.0 port.  It booted, but it didn’t save any configuration changes when I rebooted.  I plugged it into a USB 2.0 port and the changes saved, but when I went to install drivers I received the following message:

[InstallationError]
There was an error checking file system on altbootbank, please see log for detail.
Please refer to the log file for more details.

Google pointed me to this KB article: Remediating an ESXi 5.x host with Update Manager fails with the error: There was an error checking file system on altbootbank (2033564) which did the trick with the following results:

/tmp # vmkfstools -P /altbootbank/
vfat-0.04 file system spanning 1 partitions.
File system label (if any):
Mode: private
Capacity 261853184 (63929 file blocks * 4096), 261066752 (63737 blocks) avail, max file size 0
UUID: 03a06fe6-945129df-35c7-0bb4da8a6324
Partitions spanned (on "disks"):
        mpx.vmhba32:C0:T0:L0:6
Is Native Snapshot Capable: NO

/tmp # dosfsck -a -w /dev/disks/mpx.vmhba32:C0:T0:L0:6
dosfsck 2.11, 12 Mar 2005, FAT32, LFN
/.Spotlight-V100/Store-V2/CBAD0E16-35BF-4279-81B1-45D156946301/.store.db
  File size is 102400 bytes, cluster chain length is > 102400 bytes.
  Truncating file to 102400 bytes.
Reclaimed 3 unused clusters (12288 bytes) in 2 chains.
Performing changes.
/dev/disks/mpx.vmhba32:C0:T0:L0:6: 53 files, 188/63929 clusters

No Port Groups listed when trying to add an interface to an Edge in vShield Manager

I wanted to add a new network segment to deploy a new domain in my lab.  When I went to add a new interface to the edge, no port groups were showing up. 

image

I was running version 5.5.0 so I upgraded to version 5.5.2, but still no port groups were showing up.  I saw that there was an upgrade available so I clicked Actions, Upgrade.  The Edge redeployed and then the port groups showed up.  

image

I had another Edge with the same problem, but this time instead of upgrading I just clicked Actions, Redeploy Edge and after it redeployed I was able to select a Port Group.

Nutanix and VMware vSphere Host Profiles

Host profiles seem like a great idea… Make sure that all of your hosts are configured consistently and enforce compliance. However, when it comes to actually applying a host profile the caveat is that you need to put the host in maintenance mode to apply it. This means that you have to vMotion any running VMs to another host and then enter maintenance mode… A process that could take quite a while depending on the number of VMs you have running.

On Nutanix there is the pesky issue that there is one VM that you can not vMotion to another host… the CVM! The CVM (Controller Virtual Machine) is the storage controller that lives on the host. The physical disks are presented to the VM through VMDirectPath. Since Virtual Machines that are tied to physical devices on the host can not be vMotioned the host will fail to enter maintenance mode. It is possible to shut down a CVM on one node, then put that host into maintenance mode, apply the host profile, exit maintenance mode, power on the CVM, then SSH into the CVM to make sure it is back into the storage cluster before you rinse and repeat for all of your hosts. However, that is a very manual process! It would be bearable to perform on one block (four Nutanix hosts), but if you have hundreds of hosts it will take weeks and a small army of dedicated sys admins to complete the task.

It’s too bad that VMware couldn’t have host profiles distinguish between minor and major changes when dealing with applying host profiles. For example adding a port group would be a minor change, not requiring entering into maintenance mode, while attaching a vSwitch to a vNIC would be a major change requiring maintenance mode because of its potential to disrupt traffic for all of the VMs on that host.

Do we really need host profiles? Nutanix is trying to market the idea that infrastructure should be web-scale. I don’t really like the term web-scale because I think it implies that you’re trying to build some kind of internet service, but that’s beside the point… What they are trying to say is that it should be easy to massively scale infrastructure. This includes having to manually configure a bunch of settings. Putting all of the hosts in your environment into maintenance mode just to apply some settings definitely isn’t scalable. There is no reason to do it!

Every change that a host profile makes can be accomplished through PowerCLI without putting your host into maintenance mode. My recommendation for Nutanix hosts is to use PowerCLI to make any changes to your hosts that you want to be consistent throughout your environment, and then maintain your PowerCLI script and apply it to new hosts that you add to your environment.

You could also make a script that checks the settings on the hosts to monitor for compliance, for example to make sure that no one has added a vLAN to just one host. If you are using vCloud in your environment VMware includes VCM (vCenter Configuration Manager) which accomplishes the same task, with the added component of generating automated compliance reports.

Of course I’m implying that your hosts are running VMware, Nutanix also supports running Hyper-V and KVM where it’s almost inherently implied that you are going to need scripts to maintain consistency in the environment.

Nutanix CVM Autopathing Test

I have a Nutanix cluster that needs to be upgraded from 3.1.2 to 3.5.2.1 (or 3.5.3.1 if it is out by the time I get around to upgrading it). That got me to thinking about the upgrade process. When you perform a Nutanix Operating System (NOS) upgrade, it performs what Nutanix calls a “rolling upgrade”. This in effect only performs the upgrade on one CVM at a time. While the CVM is being upgraded, the storage on that node is directed to another CVM.

My first thought was, “How does that actually work”? Thanks to Zach Vaughn @z_n_v, Nutanix SE Extraordinaire, my eyes were opened.  When the cluster detects that a CVM is down, it SSHs to the Hypervisor (I’m referring to ESXi) and adds a route to the external IP of another CVM in the cluster. The cluster performs this check every 30 seconds, so it is possible that your VM will be without storage for 30 seconds. How disasterous could this be? (I’m told that as of NOS version 3.5.3.1 this will be much faster than 30 seconds). The following video shows what happens.

This test was performed on a Nutanix 1350 block running NOS 3.5.2.1. The desktop is running on Node C. I start encoding a video using handbrake which is writing to the user’s desktop on the local disk. When I shut down the CVM on Node C the desktop appears to hang for 20 seconds. However, it is possible that the PCoIP server process stops responding for those 20 seconds, as when the desktop resumes you can see that it has still received pings from the hypervisor.

I ran this test from a different machine and the View Client seemed to stay connected. The difference being that it was an iMac connected via ethernet and I recorded the video on my Macbook Pro connected via wireless. The desktop continued to receive pings, but the handbrake process stopped while the disk was unavailable for about 20 seconds and then resumed when the route to the CVM was changed on the hypervisor. If I can get that to work again I’ll try to post another video.

Export Nutanix Configuration to CSV through Powershell and REST API

What do you do when you have over 100 Nutanix nodes scattered across multiple datacenters and need to audit the configurations, or record the configurations for documentation?

Write a powershell script that queries the REST API of course!

In this instance I needed a known starting point.  I didn’t have all of the IP addresses of the CVMs, hosts, etc in a format that I could query.  What I did have was all of the hosts in vCenter along with all of their CVMs.  So this script starts by connecting to all of the vCenters in the Datacenters and getting a list of all of the CVMs and their IP addresses.  It then runs REST API queries against the CVM IPs.


Here’s what the output looks like when opened in Excel (and scrubbed of proprietary information):

image


Any blocks that are not configured yet, or are not running a version of NOS that has the REST API, or do not have network connectivity will return System.Collections.Hashtable values as you can see below.

image

Technologist