Nothingworks: 2015

Thursday, June 11, 2015

Using vRealize Operations License Groups

vRealize Operations License Groups are a very useful feature for customers that have mixed license editions. For example, if you want to mix both vRealize Operations Advanced and Enterprise license keys with the same vRealize Operations Manager instance, you can use License Groups to apply them exactly where you need them. This is something you couldn't do prior to version 6 (previously known as vCenter Operations Manager).

Note: As of vRealize Operations Manager 6.2 you cannot deploy a Standard edition license with either an Advanced or Enterprise license in the same deployment.

So how do you do this? First, login to vRealize Operations Manager and navigate to the Licensing section (click the little gear icon at the top of the left hand menu).

Next, make sure your vRO license keys are applied under the License Keys tab. Make a note of the keys as they aren't exactly easy to identify later when creating a License Group.

Next, click the License Groups tab and click the plus sign to create a new license group.

In this basic example, I'm going to apply an Enterprise key to my "VC2" vCenter instance, and an Advanced key to my "VC1" vCenter instance. You can get much more granular than this. For example, you can easily select a specific cluster, resource pool, vApp, or host if you prefer to do it that way. You can even get really complex and start using specific object properties to filter. We'll get to more of that later.

In the Add License Group wizard, give it a name and select the applicable license key.

Now we add members.

Click the down arrows next to 'Objects to always include.' The object browser isn't exactly straightforward as the default view of Custom Groups only shows vRO groups and objects. You could certainly use these if you prefer, but in my case I want to select good old vCenter objects. Click the Custom Groups drop down and select vSphere Hosts and Clusters.

Now expand vSphere World and select the applicable vCenter server instance. This might not match your vCenter server host name. It will match the name you gave the vCenter Server instance when you configured the vCenter Server adapter in vRealize Operations.

Once you've select the applicable objects, click the Add button in the middle of the page and click next. Make sure to click the drop down arrow on the Add button to add "Selected objects and descendants."

Once you click Finish, you'll see your new License Group appear, along with the total number of members in the group.

Now repeat these steps with the Advanced key for the other vCenter Server instance.

I then deleted the default License Groups that were already there. I saw the Total Members metrics, but I didn't see anything under Licensable Usage. That number stayed at zero for both new groups. Why?

Click the License Keys tab and then click the green refresh icon next to the green plus and red X icons. You should see the license usage number update accordingly under both the License Keys and License Groups tabs. I did notice that the "Unlicensed Group" license group came back after refresh, but it's showing zero across the board, which is what I want.

This is a very basic example. One solid use case would be to use virtual machine tags and create license groups around those tags instead of statically linking licensing to an object. I see tags being used a lot in NSX as well, so I suggest taking advantage of them. Perhaps that will be a future post.

Wednesday, May 27, 2015

vSphere 6 - Enhanced Link Mode

In my previous posts on vSphere 6 I described how to deploy the vCenter Server Appliance and the Platform Services Controller Appliance. I also briefly outlined how to configure vCenter Single Sign-On to authenticate with Active Directory.

One of the exciting new features of vSphere 6 is Enhanced Linked Mode for your vCenter Servers. Remember in previous versions of vSphere, the vCSA didn't support Linked Mode. In my previous life as a sysadmin, I managed three different data centers, each with its own Windows-based vCenter Server. Linked Mode was great for managing the entire environment with a single console. Unfortunately the old Linked Mode relied on an ADAM database, which I found to be a bit finicky and difficult to troubleshoot when things started to go wrong. Enhanced Linked Mode in vSphere 6 aims to make this process easier. Effectively any vCenter Server that you join to a Platform Services Controller (PSC) domain will automatically be in Enhanced Linked Mode. It's no longer an extra step since a PSC is a required component of vSphere 6.

As I've stood up my two vCSA and two PSC lab environment (simulating two different sites), I discovered that when I logged in with my AD account, I couldn't see the second vCSA. I assumed that since I'd joined my second site's PSC to the SSO domain and then linked my second site's vCSA to the respective PSC that I could just log in, since I'd already configured SSO to use AD. Well, I missed a step (did I mention I'm not really into reading the documentation?) in that the second PSC still needs to be joined to AD in order to function properly. I covered the process of doing this in the SSO post.

Don't forget that you need to reboot your PSC after joining AD. The nice thing about using the Linux appliance is that you can reboot from directly within the Web Client by simply right clicking the node under Home->System Configuration.

Now when I log in with my AD credentials I can see both vCSAs!

Tuesday, May 26, 2015

vSphere 6 - Configuring SSO with an Active Directory Identity Source

Configuring vSphere 6 SSO

After deployment, one thing I notice right away is a new section under the vSphere Web Client Administration interface. You can get to this by clicking Home->Administration. The image on the left is how the Administration section looked in vSphere 5.5, while the image on the right shows how it looks in vSphere 6.

Notice a new section under Administration called Deployment. This is where you manage your PSC. I stumbled on this when I attempted to add a new SSO Identity Source. When I selected the Active Directory (Integrated Windows Authentication) radio button, the client told me that I had not yet joined vCenter Single Sign-On to a domain, and provided me with a link to get there.

Clicking this link takes you straight to the Home->Administration->System Configuration section. One thing I noticed is that even though there is a field for Organizational Unit, this is not required if you want the Active Directory computer object for your PSC to be in the default Computers container. You can leave that field blank.

You won't receive much a confirmation message, but the task will show up in the Recent Tasks pane. You can also confirm whether or not it worked by checking AD for the PSC computer object. You'll need to completely restart the PSC after joining the domain. Before you go over to your vSphere client and reboot it, right-click on the PSC node name and select Reboot. A helpful warning message pops up letting you know exactly what the impact will be.

Once the reboot is done we can continue to configure vCenter Single Sign-On. Navigate to Home->Administration->Configuration (under Single Sign-On) and click the Identity Sources tab. When selecting the Active Directory (Integrated Windows Authentication) radio button, you'll see your AD domain name populated, and the "Use machine account" radio button selected. Click OK.

Simple. Now that we have an identity source, we can configure users and groups, which is also located under the Single Sign-On section of Home-Administration. You have a lot of options here, but I personally prefer to add AD admin groups or users to the built-in Administrators group. You'd do that by selecting the Groups tab, selecting Administrators, and clicking the icon in the Group Members section to edit the group membership. Use the domain drop down to select your AD domain and then search for the applicable users or groups to add. This is another great reason NOT to name the vCenter Single Sign-On domain the same as your AD domain.

vSphere 6 - Deploying the vCenter Server Appliance & the Platform Services Controller

I participated briefly in the vSphere 6 beta, but now that it's been a few months since the general availability release, it's time that I roll up my sleeves and get my hands dirty.

If you've paid any attention at all to the beta or the GA release, you'll notice that there are quite a few changes in the vSphere 6 management topology. Specifically there is a Platform Services Controller (PSC), which is now responsible for vCenter Single Sign-On, License service, Lookup Service, and VMware Certificate Authority. There is also a new deployment mechanism for the Linux-based PSC appliance as well as the Linux-based vCenter Server Appliance. This is just scratching the surface, so I hope to outline a few of the changes that I've found meaningful while deploying the latest release.

vSphere 6 Deployment
Assuming you don't want to use Windows for vCenter Server and/or the Platform Services Controller, you'll need to use the new web-based deployment tool in order to deploy these virtual appliances. This is done by downloading the VMware-VCSA-all-6.x.x-yyyyyyy.iso file from VMware's website. From there you mount the ISO (grumble grumble, Windows 7 user, grumble grumble) and install the latest VMware Client Integration Plugin. If you try to vcsa-setup.html file first, you'll be notified that you're missing the plugin.

It would have been really nice (and taken all of about 30 seconds and one extra line of code) of VMware to have provided a link to the plugin directly from this page so I didn't have to go hunting for it. It's inside of the vcsa folder of your mounted ISO. Once the Client Integration Plugin is installed and your browser is allowed to access it, you should be all set. Oh, you'll need at least one ESXi host built already as well, which is where the appliances will be deployed. It doesn't need to be running the same version of ESXi. If you already have a management cluster or something, you can use that. Also, if you're exclusively using the vSphere Distributed Switch, you're going to need one port group with ephemeral port binding. More on this later.

Initially I found the lack of an OVA for vCSA 6 annoying. As a Windows 7 user, I don't have a way to mount an ISO natively. Sure I can just mount the ISO on an existing VM, but then I'm pulling the ISO across the LAN, or I need to move the ISO to that VM or a datastore. It's just kind of a pain. My guess is that VMware did this so that all of the post-deployment packages could be deployed, and all the install scripts could be executed automatically. Compare this to what you'd typically have to configure manually for a Windows-based deployment, and perhaps you can appreciate the new automated appliance-based deployment. It still has some kinks, though. For example, if you fat finger the NTP servers, the entire process will fail. It won't give you a chance to go back and correct it. It also won't clean up the half-deployed VM for you. You have to stop and delete it manually and start over again. Boo.

vCenter Server Appliance Deployment Issues
One problem I had seems to be specific to my Windows 7 64-bit laptop because it did not occur when mounting the ISO on a Windows Server 2008 R2 VM. After specifying the FQDN, username and password of my target ESXi host, then accepting the certificate warning, the process would always fail with the message "filetransfer.exe has stopped working", which was actually from Windows.

Now this could be due to a number of issues. First, I could have some security setting on my laptop that is enforced by my employer that I'm not aware of or able to override. Second, I used the portable version of WinCDEmu to mount the ISO, which may not have been the best choice. Initially I thought it was a browser issue, but it failed on IE, Firefox, and Chrome at the exact same spot. I gave up and just mounted the ISO using the vSphere Client to an existing Windows Server 2008 R2 VM and everything worked great from there.

vSphere 6 Deployment Options - External Platform Services Controller
After accepting the EULA, pay close attention to the "Before proceeding" text at the bottom of step 2. If you exclusively used a vSphere Distributed Switch (VDS), you're going to need (at least temporarily) an ephemeral port group. More on this later.

Starting on step 3 of the VMware vCenter Server Appliance Deployment wizard, you'll have a chance to input your appliance name (aka VM name), along with the password. Decide now if this is going to be a vCenter Server Appliance with an embedded PSC, or if you're going to spin up an external PSC first (my preference). The reason for this is that you'll probably want your VM name to represent the role of the appliance. More on this in a minute.

Step 4 gives you a brief explanation of your Platform Services Controller (PSC) deployment type. Unless you are a very small shop with no chance of ever having or needing a second vCenter server, I would recommend deploying an external PSC. The reason for this is because the embedded PSC cannot be changed to an external PSC later on, and there are a lot of potential reasons why you may deploy a second vCenter Server down the road (SRM and Horizon to name two common reasons). There is a lot of good information in the vSphere 6 Installation and Setup Guide, as well as the Upgrade Guide. I like to plan for the future and in this lab environment, I really want to test PSC replication and Enhanced Linked Mode, so I'm going with an external PSC.

Keep in mind if this is a brand new environment, and you're choosing to use an external PSC, you need to deploy the PSC FIRST, before vCenter Server (hence my VM name recommendation above).

Now in my environment, I already have another external PSC up and running. This is going to be my second PSC, so I will be joining an existing SSO domain. If this is your first SSO domain, you'd create new. If you are creating new please heed the warning at the bottom of the screen that states "Before proceeding, make sure that the vCenter Single Sign-On domain name used is different than your Active Directory domain name." That should be really obvious, but I've seen it happen. It's way too confusing telling the difference between AD and vCenter SSO when it comes time to configure users, groups, and permissions. That warning doesn't apply if you're joining an existing vCenter SSO domain. Make sure you note your SSO administrator password as well. You're going to need this later.

Step 6 left me a little confused because as of late May 2015, there is very little documentation or guidance on multi-site SSO for vSphere 6. From what I can tell, there isn't much of a change to multi-site SSO guidance in vSphere 6 versus vSphere 5.5. That said, there is plenty of discussion on how to configure highly available PSCs and vCenter servers at each site. My understanding is that the biggest benefit to creating a new site is to increase SSO performance for authentication-related services at each site. In other words, the authentication data is replicated. If all you're looking to do is add availability for the PSC at the same site, you'd simply join the existing site. I want to emulate two different sites, so I'm creating a new one.

The next step allows you to select the appliance size, but from what I've seen so far, this only applies to the vCenter Server Appliance, and not the PSC. The Platform Services Controller VM needs 2 vCPUs, 2GB of memory, and 30GB of disk space.

The rest of the wizard is straightforward. You select your datastore & network settings. However, one thing I noticed is that since I'm using a distributed virtual switch with distributed virtual port groups, I was not allowed to select them in the choose a network drop down. What gives? If you click the help icon next to this drop down you'll notice that "non-ephemeral distributed virtual port groups are not supported, and [therefore] not shown in the dropdown list." OK, I guess I need to create a new distributed virtual port group with ephemeral port binding. I didn't run into this the first time around because my other cluster is using standard virtual switches. Unfortunately a simple addition of a distributed virtual port group opened up a huge can of worms for me. As soon as I added it, my vCSA's vpxd process began starting and crashing over and over again. After checking the logs, I stumbled on this KB article, Sadly, the fix listed in the KB only applies to vCenter Servers using SQL. Fortunately this guy had a similar issue and graciously provided the postgres DB syntax. I love the VMware Communities!

vSphere 6 Deployment Options - vCenter Server Appliance
If you're deploying an external PSC (as outlined here), your next step is to repeat the same process, except select "Install vCenter Server (Requires External Platform Services Controller)" on step 4 of the vCSA Deployment Wizard.

Step 5 will then ask you for the PSC host name and SSO password.

Unlike the PSC appliance, Step 6 for the vCenter Server allows you to choose the size, including an option for a tiny deployment. Keep in mind a "small" vCSA deployment is now 4 vCPUs and 16GB of RAM! That's 2x the size of my production vCenter server back in the 4.x days (with an external database).
Just like with the PSC deployment, Step 7 allows you to select the datastore and thin provisioning.
Step 8 allows you to configure your database. I prefer to use the embedded vPostgres database. Your only other option is Oracle. SQL is not supported with the vCSA.
Step 9 allows you to configure the vCSA's network settings.

Tuesday, January 20, 2015

Manually Deactivating a Nutanix Protection Domain

Nutanix Protection Domains are either Active or Inactive, depending on which cluster is protected (e.g. a Prod site sending data) or unprotected (e.g. a DR site receiving data). I was poking around in my DR cluster and happened upon my inactive Protection Domain. I noticed that when I clicked on the PD name, there was an Activate link in the bottom right corner.

I foolishly decided to click the Activate link, which left me with an active protection domain at both my sites. Replication from Prod to DR immediately stopped working. I expected a Deactivate link to appear in Prism, but it did not. OK, I'm sure there is a way to issue a deactivate command via ncli, right? Well, not exactly.

ncli> protection-domain

add create remove rm list ls

activate rollback-pd migrate list-snapshots ls-snaps add-minutely-schedule

add-hourly-schedule add-daily-schedule add-weekly-schedule add-monthly-schedule remove-from-schedules clear-schedules

ls-schedules set-retention-policy clear-retention-policy ls-pending-one-time-snapshots add-one-time-snapshot create-one-time-snapshot

rm-one-time-schedules restore-snapshot protect unprotect rm-snap rm-snapshot

list-replication-status ls-repl-status retain-snap retain-snapshot pause-repl pause-replication

resume-repl resume-replication abort-repl abort-replication ls-pending-actions metro-avail-enable

metro-avail-disable promote-to-active

There is an 'activate' command, but no 'deactivate' command. So how do I get rid of this thing? I searched high and low, including the Nutanix Bible, but I couldn't find a way to do it. A simple Nutanix support request later, and I was given this hidden command.

ncli pd deactivate_and_destroy_vms name=MSP-PD

Think about what you are doing before you issue this command. There is a reason it's not shown in ncli. The "_and_destroy_vms" portion of that command should give you a clue as to the potential impact. USE AT YOUR OWN RISK

As soon as I issued this command, I got a message back that said the PD was marked for removal. A few seconds later, my protection domain was again shown as inactive in Prism, and the replication jobs from my Production site started working again.

I should note that even though the command seems to indicate that it's going to destroy data, all of my snapshots at the remote site were still intact, so subsequent replications were not full replications.

Thursday, January 8, 2015

Simulating WAN Latency with Nutanix

File this one under "unsupported"!

One of our clients asked us to run a bunch of different disaster recovery scenarios between Nutanix clusters. At least two of their existing clusters are sitting across a WAN, with average latency typical of the geographic distance. My company's WAN is not nearly as robust as the client's, so I can't really impact our remote office by pushing too much data.

I have two separate Nutanix clusters in my lab. They are all plugged into the same 10GbE switch, so how do I go about introducing some latency? There are probably Cisco tools or commands I could use, but I'm not a network guy and I don't want to bother my network guy.

After some Googling I discovered the NetEM (Network Emulation) tool for Linux. Low and behold the queuing discipline kernel module has been integrated since 2.6.8. Conveniently enough, each and every Nutanix Controller VM (CVM) runs Linux. Perfect.

I logged into each CVM at the target site and issued the following command.

nutanix@NTNX-xxxxxxx-A-CVM:192.168.4.10:~$ sudo tc qdisc add dev eth0 root netem delay 60ms

Why the target site? Well, think about this. I don't have a separate network or a separate CVM dedicated to replication. Each and every bit of VM IO is going through my CVMs, so I don't want to kill my local read and write performance by slowing down my CVMs at the source side. I only want to simulate latency on my target cluster.

It certainly had an immediate impact.

Nutanix Cluster IO Latency after starting netem on all 4 CVMs

Let's look at a simple ping test next.

You can see right away that when pinging my target cluster CVM from my source cluster CVM, I'm seeing 60ms of latency.

When I issued the sudo command to start the netem process, I noticed it didn't give me any indication that it was working. It also didn't give me an easy way to kill it. After some more reading on the aforementioned NetEM site, I simply logged back into each CVM and issued the following command.

nutanix@NTNX-xxxxxxxx-A-CVM:172.x.x.10:~$ sudo tc qdisc del dev eth0 root

I still had my other ssh session open and actively pinging the target CVM. Right away the latency values dropped to normal.

You'll have to take my word for it that the ping was initiated from another CVM. My Nutanix cluster IPs both end in .50.

By the way, if you want to avoid logging into every CVM and running this command, you can do it using a simple one-liner from any CVM in your cluster. Again, this is a real good way to kill performance so don't try this in production.

nutanix@NTNX-xxxxxxx-A-CVM:172.x.x.10:~$ for i in `svmips` ; do ssh $i 'sudo tc qdisc add dev eth0 root netem delay 50ms 10ms'; done

This particular command uses an average latency of 50ms, but provides a variance of +/- 10% in order to mimic a real-world WAN.

One of the products I'm evaluating for my client is Zerto. While we have native replication with Nutanix, it would be nice to have smaller RPOs. Zerto also gives us something else - orchestrated failover. That said, I was able to watch my VPG throughput in Zerto while turning netem on and off on my CVMs to simulate WAN latency. It definitely did the trick.

You can see that I was pushing some serious IO until about 12:06 when I turned on netem. Things tanked out really fast, as I expected. Now the major flaw in this test is that I'm not just impacting network I/O when I enable netem on my CVMs. I'm also impacting disk performance. I did specifically choose eth0 for netem, so theoretically all disk I/O happening locally from VM to host wouldn't be impacted, as that access would come through eth1. Still, I can't necessarily guarantee that I'm not unintentionally harming disk I/O without further testing. Perhaps I can leverage IOmeter on a VM that's already sitting on the target cluster and see if performance tanks. As shown above (and below), Nutanix still sees the cluster IO latency, even if I didn't touch eth1.

Before I forget, you should definitely remember to turn off the simulated latency!

nutanix@NTNX-xxxxxxx-A-CVM:172.x.x.10:~$ for i in `svmips` ; do ssh $i 'sudo tc qdisc del dev eth0 root'; done

Wednesday, January 7, 2015

Storage vMotion Impact on Nutanix

In my previous post, I created a separate Nutanix container to house VMs that I choose to protect with SRM using the SRA for Nutanix. Since a container is a separate datastore, I had to use Storage vMotion to get the VMs from the existing container to the new one. I wasn't prepared for the impact his had on my Nutanix CVMs and SSDs.

The majority of the VMs being moved are on hosts A and B.

Since, regardless of container, all the data is sitting on the same disks anyway, it's too bad that this all has to be funneled back through CVMs again. I assumed with VAAI that this would happen quickly, but it doesn't seem that has been developed by Nutanix as of yet. My fingers are crossed that a future release of NOS will address this issue.

Fun with Nutanix SRA

I was fortunate to get my hands on a pair of Nutanix blocks so that I can test Protection Domains with VMware Site Recovery Manager. Never one to RTFM, I went head first into installing my SRM servers and the Nutanix SRA shortly after configuring a Nutanix remote site and Protection Domain. It wasn't until I started getting error messages in SRM that I finally decided to consult the documentation.

I was a bit annoyed that I had to go into ncli in order to protect the vstore, when I had already gone ahead and created a protection domain and remote site with all the container info already. So as soon as I initiated the ncli vstore protect command, I realized that I really didn't want to have every VM in that container replicating. So I can just issue an ncli vstore unprotect command, right? Wrong.

ncli> vstore unprotect name=MSPNFS

Error: Error while unprotecting vStore having id 9868120: MarkPdForRemoval command not inited.

What the heck does that mean? Command not 'inited'? Google came up way empty on that one.

Turns out you can simply whack the auto-created protection domain using the following ncli command:

ncli> protection-domain remove name=MSPNFS_1420656211839

Protection domain MSPNFS_1420656211839 has been successfully marked for removal

A few seconds later, my auto-created protection domain disappeared from Prism, and ncli vstore list showed my container was no longer protected.

Now I need to configure a separate container for this so I can move VMs in and out as needed for SRM.

More to come.