Nothingworks: January 2015

Tuesday, January 20, 2015

Manually Deactivating a Nutanix Protection Domain

Nutanix Protection Domains are either Active or Inactive, depending on which cluster is protected (e.g. a Prod site sending data) or unprotected (e.g. a DR site receiving data). I was poking around in my DR cluster and happened upon my inactive Protection Domain. I noticed that when I clicked on the PD name, there was an Activate link in the bottom right corner.

I foolishly decided to click the Activate link, which left me with an active protection domain at both my sites. Replication from Prod to DR immediately stopped working. I expected a Deactivate link to appear in Prism, but it did not. OK, I'm sure there is a way to issue a deactivate command via ncli, right? Well, not exactly.

ncli> protection-domain

add create remove rm list ls

activate rollback-pd migrate list-snapshots ls-snaps add-minutely-schedule

add-hourly-schedule add-daily-schedule add-weekly-schedule add-monthly-schedule remove-from-schedules clear-schedules

ls-schedules set-retention-policy clear-retention-policy ls-pending-one-time-snapshots add-one-time-snapshot create-one-time-snapshot

rm-one-time-schedules restore-snapshot protect unprotect rm-snap rm-snapshot

list-replication-status ls-repl-status retain-snap retain-snapshot pause-repl pause-replication

resume-repl resume-replication abort-repl abort-replication ls-pending-actions metro-avail-enable

metro-avail-disable promote-to-active

There is an 'activate' command, but no 'deactivate' command. So how do I get rid of this thing? I searched high and low, including the Nutanix Bible, but I couldn't find a way to do it. A simple Nutanix support request later, and I was given this hidden command.

ncli pd deactivate_and_destroy_vms name=MSP-PD

Think about what you are doing before you issue this command. There is a reason it's not shown in ncli. The "_and_destroy_vms" portion of that command should give you a clue as to the potential impact. USE AT YOUR OWN RISK

As soon as I issued this command, I got a message back that said the PD was marked for removal. A few seconds later, my protection domain was again shown as inactive in Prism, and the replication jobs from my Production site started working again.

I should note that even though the command seems to indicate that it's going to destroy data, all of my snapshots at the remote site were still intact, so subsequent replications were not full replications.

Thursday, January 8, 2015

Simulating WAN Latency with Nutanix

File this one under "unsupported"!

One of our clients asked us to run a bunch of different disaster recovery scenarios between Nutanix clusters. At least two of their existing clusters are sitting across a WAN, with average latency typical of the geographic distance. My company's WAN is not nearly as robust as the client's, so I can't really impact our remote office by pushing too much data.

I have two separate Nutanix clusters in my lab. They are all plugged into the same 10GbE switch, so how do I go about introducing some latency? There are probably Cisco tools or commands I could use, but I'm not a network guy and I don't want to bother my network guy.

After some Googling I discovered the NetEM (Network Emulation) tool for Linux. Low and behold the queuing discipline kernel module has been integrated since 2.6.8. Conveniently enough, each and every Nutanix Controller VM (CVM) runs Linux. Perfect.

I logged into each CVM at the target site and issued the following command.

nutanix@NTNX-xxxxxxx-A-CVM:192.168.4.10:~$ sudo tc qdisc add dev eth0 root netem delay 60ms

Why the target site? Well, think about this. I don't have a separate network or a separate CVM dedicated to replication. Each and every bit of VM IO is going through my CVMs, so I don't want to kill my local read and write performance by slowing down my CVMs at the source side. I only want to simulate latency on my target cluster.

It certainly had an immediate impact.

Nutanix Cluster IO Latency after starting netem on all 4 CVMs

Let's look at a simple ping test next.

You can see right away that when pinging my target cluster CVM from my source cluster CVM, I'm seeing 60ms of latency.

When I issued the sudo command to start the netem process, I noticed it didn't give me any indication that it was working. It also didn't give me an easy way to kill it. After some more reading on the aforementioned NetEM site, I simply logged back into each CVM and issued the following command.

nutanix@NTNX-xxxxxxxx-A-CVM:172.x.x.10:~$ sudo tc qdisc del dev eth0 root

I still had my other ssh session open and actively pinging the target CVM. Right away the latency values dropped to normal.

You'll have to take my word for it that the ping was initiated from another CVM. My Nutanix cluster IPs both end in .50.

By the way, if you want to avoid logging into every CVM and running this command, you can do it using a simple one-liner from any CVM in your cluster. Again, this is a real good way to kill performance so don't try this in production.

nutanix@NTNX-xxxxxxx-A-CVM:172.x.x.10:~$ for i in `svmips` ; do ssh $i 'sudo tc qdisc add dev eth0 root netem delay 50ms 10ms'; done

This particular command uses an average latency of 50ms, but provides a variance of +/- 10% in order to mimic a real-world WAN.

One of the products I'm evaluating for my client is Zerto. While we have native replication with Nutanix, it would be nice to have smaller RPOs. Zerto also gives us something else - orchestrated failover. That said, I was able to watch my VPG throughput in Zerto while turning netem on and off on my CVMs to simulate WAN latency. It definitely did the trick.

You can see that I was pushing some serious IO until about 12:06 when I turned on netem. Things tanked out really fast, as I expected. Now the major flaw in this test is that I'm not just impacting network I/O when I enable netem on my CVMs. I'm also impacting disk performance. I did specifically choose eth0 for netem, so theoretically all disk I/O happening locally from VM to host wouldn't be impacted, as that access would come through eth1. Still, I can't necessarily guarantee that I'm not unintentionally harming disk I/O without further testing. Perhaps I can leverage IOmeter on a VM that's already sitting on the target cluster and see if performance tanks. As shown above (and below), Nutanix still sees the cluster IO latency, even if I didn't touch eth1.

Before I forget, you should definitely remember to turn off the simulated latency!

nutanix@NTNX-xxxxxxx-A-CVM:172.x.x.10:~$ for i in `svmips` ; do ssh $i 'sudo tc qdisc del dev eth0 root'; done

Wednesday, January 7, 2015

Storage vMotion Impact on Nutanix

In my previous post, I created a separate Nutanix container to house VMs that I choose to protect with SRM using the SRA for Nutanix. Since a container is a separate datastore, I had to use Storage vMotion to get the VMs from the existing container to the new one. I wasn't prepared for the impact his had on my Nutanix CVMs and SSDs.

The majority of the VMs being moved are on hosts A and B.

Since, regardless of container, all the data is sitting on the same disks anyway, it's too bad that this all has to be funneled back through CVMs again. I assumed with VAAI that this would happen quickly, but it doesn't seem that has been developed by Nutanix as of yet. My fingers are crossed that a future release of NOS will address this issue.

Fun with Nutanix SRA

I was fortunate to get my hands on a pair of Nutanix blocks so that I can test Protection Domains with VMware Site Recovery Manager. Never one to RTFM, I went head first into installing my SRM servers and the Nutanix SRA shortly after configuring a Nutanix remote site and Protection Domain. It wasn't until I started getting error messages in SRM that I finally decided to consult the documentation.

I was a bit annoyed that I had to go into ncli in order to protect the vstore, when I had already gone ahead and created a protection domain and remote site with all the container info already. So as soon as I initiated the ncli vstore protect command, I realized that I really didn't want to have every VM in that container replicating. So I can just issue an ncli vstore unprotect command, right? Wrong.

ncli> vstore unprotect name=MSPNFS

Error: Error while unprotecting vStore having id 9868120: MarkPdForRemoval command not inited.

What the heck does that mean? Command not 'inited'? Google came up way empty on that one.

Turns out you can simply whack the auto-created protection domain using the following ncli command:

ncli> protection-domain remove name=MSPNFS_1420656211839

Protection domain MSPNFS_1420656211839 has been successfully marked for removal

A few seconds later, my auto-created protection domain disappeared from Prism, and ncli vstore list showed my container was no longer protected.

Now I need to configure a separate container for this so I can move VMs in and out as needed for SRM.

More to come.