VSAN 6.5 to 6.6 Upgrade Issues with CLOMD Liveness

Before attempting any upgrades in a production environment I always try to test the process and functionality in a lab first. With this in mind I wanted to test the upgrade of VSAN 6.5 to 6.6 in my home lab, and unfortunately I initially didn’t have a whole lot of success. I’ve now fixed all the issues and just in-case anyone has the same problems, I’d like to ensure the resolution is readily available. I haven’t had the time to define the root cause but I have resolved the issues.

Firstly, let me make sure you understand, this is on UN-SUPPORTED hardware. These issues may not ever exist in a fully supported and compliant production environment. I have not seen these VSAN upgrade issues in fully supported environment. However, we all tend to run our labs on un-supported hardware so I’m sure I won’t be the only one that comes across these issues and just in-case other people do, the resolution is pretty simple. I have seen the same issues three times in three separate (unsupported) environments.

The upgrade was from VSAN 6.5 to VSAN 6.6 and as VSAN isn’t a stand-alone product, it is built into vSphere so the upgrade performed is as simple as upgrading ESXi. I was running ESXi 6.5.0 (Build 4887370) and the upgrade was to ESXi 6.5.0 (Build 5310538).

It has been a long (and i mean a LONG time) time since I have seen an ESXi purple screen. But soon after upgrading my environment to ESXi 6.5 (5310538) my hosts started purple screening. I had to take a screen shot because this is a rare sight. It only happened once and since the below fixes were applied it has never happened again.

Screen Shot 2017-05-26 at 7.28.59 PM

The VSAN upgrade process is very straight forward to perform.

  • Upgrade vCenter Server
  • Upgrade ESXi hosts
  • Upgrade the disk format version

Straight after the upgrade I started receiving vMotion alerts and my VMs wouldn’t migrate between hosts. There didn’t appear to be any configuration issues with vMotion and it was working perfectly fine before the upgrade. I tested the connectivity using a vmkping between hosts on the vMotion vmkernel IP and it failed. There was no network connectivity between hosts on the vMotion vmkernel port!

The vMotion fix:
I found that simply deleting the existing vMotion vmkernel and recreating a new vmkernel with the exact same configuration fixed all the issues. I had to do this on all hosts within the cluster and vMotion started working again.

CLOMD Liveness

This brings me to the next issue which was a lot more critical, the CLOMD Liveness. After I resolved the vMotion alerts, I ran a quick health check on VSAN. I found that my hosts were now reporting a “CLOMD Liveness” issue. This is concerning because the CLOMD (Cluster Level Object Manager Daemon) is a key component to VSAN. CLOMD runs on every ESXi host in a VSAN cluster and is responsible for creating new objects, communication between hosts for data moves and evacuations, and the repair of existing VSAN objects. To put it simply, this is a critical component for creating any new objects on VSAN.

Screen Shot 2017-05-26 at 9.04.03 PM

If you want to test this out (in a test environment), SSH to your ESXi hosts and stop the CLOMD daemon by running “/etc/init.d/clomd stop” and then try to create new objects or do a VM creation proactive VSAN test and see what happens. You will get the error “Cannot complete file creation operation”.

Screen Shot 2017-05-26 at 9.15.41 PM

And the output from the proactive VSAN test is “Failed to create object. A CLOM is not attached. This could indicate that the clomd daemon is not running”.

Screen Shot 2017-05-26 at 9.19.53 PM

If CLOMD isn’t running, you’re not at risk of losing any data, it just means that new data can’t be created, I would still suggest that it is critical to get it running again.

The CLOMD Liveness can occur for a number of reasons. The VMware KB article is here: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2109873

In order to check the CLOMD service/daemon was running on the hosts you can execute the following command on each host:

/etc/init.d/clomd status

The results showed that the CLOMD service was not running and even after re-starting the service, it would stop running a short time later.

Screen Shot 2017-05-26 at 8.20.43 PM

The VSAN CLOMD Liveness fix:
Learning from the vmkernel issues, I immediately tried deleting and re-creating the VSAN vmkernel on each host and this fixed the issue. However to do this was a little more difficult than the vMotion process because when you delete the VSAN vmkernel you instantly partition that host, so you need to be careful how you do this.

Place the host in Maintenance Mode first! We aren’t going to lose any data so you don’t need to evacuate the data, however I would recommend you at least select “Ensure data accessibility from other hosts”. Selecting “No Data Migration” is generally only suggested if you are shutting down all nodes in the VSAN cluster, or possibly a non-intrusive action like a quick reboot.

Once the host is in Maintenance Mode you can now delete the existing vmkernel and re-create a new one with the same settings. I would then reboot the host for good measure. Once the host is back up, you can exit Maintenance Mode and then move on to the next host.

Again, I stress that I have only seen this issue on un-supported hardware.

My VSAN Upgrade Process

  1. Upgrade vCenter
  2. Upgrade each ESXi server
  3. Upgrade the disk format version
  4. Run a VSAN health check!
  5. If you have a CLOMD issue then for each ESXi host in the VSAN Cluster
    1. Place a host in Maintenance Mode
    2. Delete and re-create the vMotion vmkernel
    3. Delete and re-create the VSAN vmkernel
    4. Reboot the ESXi host
    5. Move on to the next host
  6. Run a VSAN health check again

 


 

Other posts you might be interested in:

SuperMicro VSAN HCIBench

SSL Certificate Tool (CertGenVVD)

Single Node SuperMicro Home Lab

SuperMicro vs Intel NUC

SSL Certificate Tool (CertGenVVD)

 

It’s always one of the parts of a new implementation that I don’t look forward to, generating SSL signed certificates for all of the various VMware products. This is something that i’ve done a lot of times in my years at VMware but I still avoid doing it if possible. Not surprisingly, a lot of customers reach out to VMware for support when renewing certificates too. The process you have to go through even just to generate the certificates is time consuming and prone to error.

  • First you have to write out the config files for all of the certificates.
  • Then generate a .csr file for each of those certificates.
  • Submit the .csr and get a CA signed SSL certificate back.
  • Download the root and intermediary certificates.
  • Create SSL Chains with the root, intermediary and SSL certificate. This is where one of the most common mistakes occur, mixing up the chain certificate ordering.
  • Using OpenSSL you can create a range of .pem or .p7b or .pfx files depending on what the specific product is that your implementing.
  • And then you can start to install the SSL Certificates for each product.

If you haven’t done this process hundreds of times, it’s quite a time consuming task and if you get it wrong it takes a lot of time to resolve issues. This is just one of those things that I don’t think anyone really enjoys doing. Until now, that is. I’ve spent the last week playing with the VVD CertGen tool and I’ve actually enjoyed my time doing it. So much so that i’ve even written a PowerShell script to make the process even easier and i’d like to share my work with the community.

First of all, I am no PowerShell expert and of course I can’t take responsibility for anything that happens with this script. The VMware CertGen tool does all the hard work, my script simply takes the input from a .csv file and then creates all of the config files which are then input into the CertGen tool. I’ve wrapped it all up into a simple process that anyone can use. The CertGen tool outputs CA Signed SSL Certificates for all of the products and automatically creates the various different certificate formats that each product requires. All that is left to do is upload the SSL certificate to the product.

CertGen Tool and Scripts

The first thing you need to do is review the VMware KB article KB2146215 on the CertGen Tool. This article will provide you with the instructions to use the CertGen tool. I will cover off the simple steps, however the KB article details the pre-requisites and configuration of the CA servers, the supported platforms, product compatibility and it also explains use-cases outside of what i’ll explain here. This blog article will cover the use of my script to automatically generate the configuration files from a .csv and a simplified set of instructions for the CertGen tool usage.

At the bottom of the KB Article, in the attachments section, download the CertGenVVD zip file.

Screen Shot 2017-05-08 at 10.58.26 AM

Extract the zip file to a location that will be easy to access via command line. This can be simply c:\Temp. The zip file contains the “ConfigFiles” folder, a “default.txt” file and the “CertGenVVD-3.0.ps1” script file.

Open the “ConfigFiles” folder and delete all of the existing config files, or you can delete the entire folder, the script will just re-create the folder anyway. Normally you would use these files to manually update the configuration details for each of your products. We don’t need to do this because we will use a csv file and then build all of these files using the script. You can also delete the “default.txt” file as we won’t need this.

Download my Certificate Config Tool which will include the csv configuration file “CertConfig.csv” and the “CertConfig.ps1” script. Extract this zip file to the same location as the CertGen Tool. You should now have a file structure that looks like this.

Screen Shot 2017-05-08 at 11.51.10 AM

I have offered the above instructions so that you can download the most up to date version of the CertGenVVD tool and use it in conjunction with my script. If you would rather a more simplified approach and download the pre-configured package, then you can download the Cert Tool zip file here which contains my configuration scripts + the CertGenVVD-3.0.ps1 scripts in a pre-configured directory. Just Download the zip file and extract to to a directory like C:\Temp.

Cert Tool Package Download

Creating the SSL Certificates

I first created this spreadsheet to be used with the VMware Validated Design (VVD) Configuration Workbook and the values are linked to the configuration cells within the VVD workbook. When using the VVD Deployment Tool the certificate configuration is entirely automated from generation of the configuration files and all the way to implementing the certificates for each of the products. I have simply exported the spreadsheet as a csv file and shared it as-is so that it can be more widely used outside of the VVD process.

Update the Cert Config csv

Therefore the first step you must do is update the values within the csv file. I have pre-populated the configuration details that I used in a test lab so that you can see how it works.

Screen Shot 2017-05-08 at 1.06.28 PM

  • Every row with a “Name” on it relates to an individual certificate that will be created
  • If the DNS1 column contains an “n/a” then the certificate for that row will be skipped. I have included certificates for a number of fake hosts in the configuration csv that you can leave as n/a or delete the row if you don’t need them.
  • Some products require additional SANs (Subject Alternate Names), therefore each DNS column references an additional SAN for each certificate. If you don’t require additional names, leave the cells blank.
  • The domain name needs to be populated because the PowerShell script uses the short DNS name separately. The script will combine the short DNS and Domain Name to create the FQDN.
  • Some products require the IP address. You can populate that here or leave it blank for the products that you only want to have a DNS record and not locked to an IP address.
  • The FileName column is the name of the configuration file that gets created. The name and folder structure of the Signed Certificates is created by the CertGenVVD Tool and is based on the Common Name inside the certificate (the FQDN).

Once the csv file is complete save it with the same filename “CertConfig.csv” in the same directory as the “CertConfig.ps1” file. The script expects this file to be in the same folder as the script, as does the CertGenVVD script.

Prepare the Microsoft CA Server

To use a Microsoft Certificate Authority Server you must ensure that the server meets the pre-requisites that the CertGenVVD script required. This is fairly simple to do, if you have administrator rights to the CA.

As part of the Certificate Authority services, you must ensure that the following additional services are installed and configured

  • Certificate Authority Web Enrolment
  • Certificate Authority Web Serviced

You will also need a Certificate Template that is used to sign the certificates. Open your CA server settings, expand the folder structure, right click on “Certificate Templates” and select “Manage“. Right click the “Web Server” and select “Duplicate Template“. I create a VMware specific Template that includes the following configuration.

  • Template Name – VMware.
  • Compatibility of Windows Server 2003 and upwards.
  • In the Subject Name tab, make sure “Supply in the request” is selected.
  • In the Extensions tab.
    • Delete all the application policies.
    • In Key Usage select “Signature is proof of origin (nonrepudiation)”.

Screen Shot 2017-05-08 at 2.31.59 PM

Close the Certificate Templates Console and add the new VMware Certificate Template to the CA by right clicking on the “Certificate Templates” folder, select “New” and then select “Certificate Template to Issue“. Find the “VMware” certificate and click OK.

Prepare the Operating System

On the Windows Operating System that in intend to execute the scripts from you will need to install OpenSSL and Java. Without these installed the CertGenVVD script will not work.

You should download the most up to date versions online, however for ease of use I am using the following versions that are bundled with the VVD Deployment Tool.

Win32 OpenSSL
Java 8u60

Download and install OpenSSL and Java. Once these are installed you will need to set your environment PATHs to include these products. To do this, right click on your computer, go to “Properties” and then “Advanced System Settings“. In the “Advanced” tab click on “Environment Variables

Screen Shot 2017-05-08 at 2.52.39 PM

Create a new System Variable called JAVA_HOME and enter the path to the Java application folder.

Screen Shot 2017-05-08 at 2.53.37 PM

Scroll down through the “System Variables” and find the “path“. Edit the path variable and add the OpenSSL and Java Path’s to end of the variable. Use a semicolon “;” as the separator.

Execute the CertConfig Script

  1. Change Directory to the location of the CertConfig.ps1 script. In my case this is C:\Temp\CertTool
  2. Execute the “CertConfig.ps1” script
  3. Answer the default configuration questions:
    1. Organisation
    2. OU
    3. Location
    4. State
    5. County
    6. Key Size (Default is set to 2048)

Screen Shot 2017-05-08 at 1.31.20 PM

That it! You will now see a new folder called “ConfigFiles” within the Cert Tool directory that has been fully populated with the configuration files for each of your certificates.

Execute the CertGenVVD Script

  1. Set the execution policy to remote signed with the following command.
    Set-ExecutionPolicy RemoteSigned
  2. Do a test run of the CertGenVVD script by first running the script with the -validate parameter. This will check everything is configured successfully and ready to issues the CA signed certificates.
    ./CertGenVVD-3.0.ps1 -validate
  3. Execute the “CertGenVVD-3.0.ps1” script with the required parameters (as defined in the KB article KB2146215.
    ./CertGenVVD-3.0.ps1 -MSCASigned -attrib “CertificateTemplate:VMware” -config “labrat.local\labrat-CA” -username labrat\Administrator -password VMware1!

The -attrib parameter references the CA Servers Certificate Template that will be used to sign these certificates. You created this when preparing the CA Server.

The -config parameter is the name of your CA Server.

Screen Shot 2017-05-08 at 1.31.20 PM
You will be asked to enter a password for the p12/pem certificates. This is required.

Screen Shot 2017-05-08 at 2.00.35 PM

It will only take a minute and the script will do all the rest of the work. When the script is finished you will be presented with a list the certificates that were generated, which will be located in a new directory called “SignedByMSCACerts


Other posts you might be interested in:

SuperMicro VSAN HCIBench

SSL Certificate Tool (CertGenVVD)

Single Node SuperMicro Home Lab

SuperMicro vs Intel NUC

SuperMicro VSAN HCIBench

After spending a lot of time and money building up my home lab environment, the first thing I wanted to do was test it out. I wanted to know what sort of performance will I get from this little VSAN lab. In my haste to to get my hardware I opted for an NVMe M.2 SSD that I expected to perform well but it wasn’t ever going to break any records. It was available at the time and at the right price, so I bought it. Now that my lab is built, I really want to know how it actually performs and if my eagerness paid off or if it’ll come back to bite me. Regardless of the hardware, this is a home lab configuration built on the SuperMicro E200-8D platform with an all flash VSAN. How good can it be?

Lab Specs

Here is my hardware details. I have 3x SuperMicro servers in a VSAN cluster, each running the same hardware, connected via a 10Gb network.

Product

Details
SuperMicro E200-8D SYS-E200-8D
CPU Intel XEON-D 1528 1.9GHz (6 core)
RAM 64GB ECC UDIMM RAM (4 x 16GB)
Capacity Disk 1TB 2.5″ SanDisk X400
Cache Disk 128GB NVMe M.2 SanDisk X400
 Network 10GBase-T with 9000MTU
ESXi  ESXi Version 6.5.0 (4887370)

Lab Test with HCIBench

Screen Shot 2017-04-29 at 8.16.19 AM

VMware Flings publish an awesome little appliance called HCIBench, which is a Hyper-Converged Infrastructure Benchmarking tool. You can download it from the VMware Flings website. This is a very simple tool that makes performance testing of a HCI POC or home lab an extremely simple task. Run it in your home lab and let me know what you get. I’d like to get some comparisons on other home lab environments.

I won’t go into much detail around the install process because it is very simple and the Install Instructions are very clear and well written. The gist of the install goes like this:

  1. Download and import OVA.
  2. Enter the network configuration.
  3. Log into the website at http://ipaddress:8080.
  4. Username is “root” and the password was setup in the OVA deployment.
  5. Enter all of your vCenter details.
  6. Press the button to download Vdbench and then upload it. This is for licensing constraints. You must download Vdbench yourself.
  7. Tick the “Easy Run” for automated VSAN testing.
  8. Validate and then start the Test.

Once the test has started you will get a progress screen

Screen Shot 2017-04-29 at 7.56.49 AM

The HCIBench tool will deploy the necessary VMs to your environment, configure them and wait for them to respond on the network. You will need to either provide a DHCP network or tick the box to get the HCIBench tool to allocate IPs to the worker VMs.

Screen Shot 2017-04-29 at 7.59.40 AM

It will take a while for the VMs to be deployed and then they will prepare the disks before the actual test starts. This takes about 10 minutes or more. Once everything is ready the test will start.

Screen Shot 2017-04-29 at 10.10.25 AM

 It will take a couple hours to do a full test. While it was running I logged in to esxtop and took a couple quick screen shots of the current disk activity.

Screen Shot 2017-04-29 at 8.50.16 AMScreen Shot 2017-04-29 at 8.59.35 AMScreen Shot 2017-04-29 at 9.00.13 AM

Results

After a few hours of testing I had the results. I wasn’t really surprised at the figures, they seem to be exactly what I was expecting to get from the SanDisk X400 disks. According to the UserBenchmark website the expected 4k Write throughput for the X400 is 63.7MB/s and my throughput was 62.81MB/s. Now it’s time to buy a Samsung 960 EVO M.2 SSD and do the test again 🙂

Datastore SuperMicro_VSAN
VMs 6
IOPS 16078.98 IOPS
THROUGHPUT 62.81 MB/s
LATENCY 23.8660 ms
R_LATENCY 16.0298 ms
W_LATENCY 42.1727 ms

Other posts you might be interested in:

SuperMicro VSAN HCIBench

SSL Certificate Tool (CertGenVVD)

Single Node SuperMicro Home Lab

SuperMicro vs Intel NUC