Nothingworks: Simulating WAN Latency with Nutanix

File this one under "unsupported"!

One of our clients asked us to run a bunch of different disaster recovery scenarios between Nutanix clusters. At least two of their existing clusters are sitting across a WAN, with average latency typical of the geographic distance. My company's WAN is not nearly as robust as the client's, so I can't really impact our remote office by pushing too much data.

I have two separate Nutanix clusters in my lab. They are all plugged into the same 10GbE switch, so how do I go about introducing some latency? There are probably Cisco tools or commands I could use, but I'm not a network guy and I don't want to bother my network guy.

After some Googling I discovered the NetEM (Network Emulation) tool for Linux. Low and behold the queuing discipline kernel module has been integrated since 2.6.8. Conveniently enough, each and every Nutanix Controller VM (CVM) runs Linux. Perfect.

I logged into each CVM at the target site and issued the following command.

nutanix@NTNX-xxxxxxx-A-CVM:192.168.4.10:~$ sudo tc qdisc add dev eth0 root netem delay 60ms

Why the target site? Well, think about this. I don't have a separate network or a separate CVM dedicated to replication. Each and every bit of VM IO is going through my CVMs, so I don't want to kill my local read and write performance by slowing down my CVMs at the source side. I only want to simulate latency on my target cluster.

It certainly had an immediate impact.

Nutanix Cluster IO Latency after starting netem on all 4 CVMs

Let's look at a simple ping test next.

You can see right away that when pinging my target cluster CVM from my source cluster CVM, I'm seeing 60ms of latency.

When I issued the sudo command to start the netem process, I noticed it didn't give me any indication that it was working. It also didn't give me an easy way to kill it. After some more reading on the aforementioned NetEM site, I simply logged back into each CVM and issued the following command.

nutanix@NTNX-xxxxxxxx-A-CVM:172.x.x.10:~$ sudo tc qdisc del dev eth0 root

I still had my other ssh session open and actively pinging the target CVM. Right away the latency values dropped to normal.

You'll have to take my word for it that the ping was initiated from another CVM. My Nutanix cluster IPs both end in .50.

By the way, if you want to avoid logging into every CVM and running this command, you can do it using a simple one-liner from any CVM in your cluster. Again, this is a real good way to kill performance so don't try this in production.

nutanix@NTNX-xxxxxxx-A-CVM:172.x.x.10:~$ for i in `svmips` ; do ssh $i 'sudo tc qdisc add dev eth0 root netem delay 50ms 10ms'; done

This particular command uses an average latency of 50ms, but provides a variance of +/- 10% in order to mimic a real-world WAN.

One of the products I'm evaluating for my client is Zerto. While we have native replication with Nutanix, it would be nice to have smaller RPOs. Zerto also gives us something else - orchestrated failover. That said, I was able to watch my VPG throughput in Zerto while turning netem on and off on my CVMs to simulate WAN latency. It definitely did the trick.

You can see that I was pushing some serious IO until about 12:06 when I turned on netem. Things tanked out really fast, as I expected. Now the major flaw in this test is that I'm not just impacting network I/O when I enable netem on my CVMs. I'm also impacting disk performance. I did specifically choose eth0 for netem, so theoretically all disk I/O happening locally from VM to host wouldn't be impacted, as that access would come through eth1. Still, I can't necessarily guarantee that I'm not unintentionally harming disk I/O without further testing. Perhaps I can leverage IOmeter on a VM that's already sitting on the target cluster and see if performance tanks. As shown above (and below), Nutanix still sees the cluster IO latency, even if I didn't touch eth1.

Before I forget, you should definitely remember to turn off the simulated latency!

nutanix@NTNX-xxxxxxx-A-CVM:172.x.x.10:~$ for i in `svmips` ; do ssh $i 'sudo tc qdisc del dev eth0 root'; done

Nothingworks

Thursday, January 8, 2015

Simulating WAN Latency with Nutanix

No comments:

Post a Comment