Before attempting any upgrades in a production environment I always try to test the process and functionality in a lab first. With this in mind I wanted to test the upgrade of VSAN 6.5 to 6.6 in my home lab, and unfortunately I initially didn’t have a whole lot of success. I’ve now fixed all the issues and just in-case anyone has the same problems, I’d like to ensure the resolution is readily available. I haven’t had the time to define the root cause but I have resolved the issues.
Firstly, let me make sure you understand, this is on UN-SUPPORTED hardware. These issues may not ever exist in a fully supported and compliant production environment. I have not seen these VSAN upgrade issues in fully supported environment. However, we all tend to run our labs on un-supported hardware so I’m sure I won’t be the only one that comes across these issues and just in-case other people do, the resolution is pretty simple. I have seen the same issues three times in three separate (unsupported) environments.
The upgrade was from VSAN 6.5 to VSAN 6.6 and as VSAN isn’t a stand-alone product, it is built into vSphere so the upgrade performed is as simple as upgrading ESXi. I was running ESXi 6.5.0 (Build 4887370) and the upgrade was to ESXi 6.5.0 (Build 5310538).
It has been a long (and i mean a LONG time) time since I have seen an ESXi purple screen. But soon after upgrading my environment to ESXi 6.5 (5310538) my hosts started purple screening. I had to take a screen shot because this is a rare sight. It only happened once and since the below fixes were applied it has never happened again.
The VSAN upgrade process is very straight forward to perform.
- Upgrade vCenter Server
- Upgrade ESXi hosts
- Upgrade the disk format version
Straight after the upgrade I started receiving vMotion alerts and my VMs wouldn’t migrate between hosts. There didn’t appear to be any configuration issues with vMotion and it was working perfectly fine before the upgrade. I tested the connectivity using a vmkping between hosts on the vMotion vmkernel IP and it failed. There was no network connectivity between hosts on the vMotion vmkernel port!
The vMotion fix:
I found that simply deleting the existing vMotion vmkernel and recreating a new vmkernel with the exact same configuration fixed all the issues. I had to do this on all hosts within the cluster and vMotion started working again.
This brings me to the next issue which was a lot more critical, the CLOMD Liveness. After I resolved the vMotion alerts, I ran a quick health check on VSAN. I found that my hosts were now reporting a “CLOMD Liveness” issue. This is concerning because the CLOMD (Cluster Level Object Manager Daemon) is a key component to VSAN. CLOMD runs on every ESXi host in a VSAN cluster and is responsible for creating new objects, communication between hosts for data moves and evacuations, and the repair of existing VSAN objects. To put it simply, this is a critical component for creating any new objects on VSAN.
If you want to test this out (in a test environment), SSH to your ESXi hosts and stop the CLOMD daemon by running “/etc/init.d/clomd stop” and then try to create new objects or do a VM creation proactive VSAN test and see what happens. You will get the error “Cannot complete file creation operation”.
And the output from the proactive VSAN test is “Failed to create object. A CLOM is not attached. This could indicate that the clomd daemon is not running”.
If CLOMD isn’t running, you’re not at risk of losing any data, it just means that new data can’t be created, I would still suggest that it is critical to get it running again.
The CLOMD Liveness can occur for a number of reasons. The VMware KB article is here: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2109873
In order to check the CLOMD service/daemon was running on the hosts you can execute the following command on each host:
The results showed that the CLOMD service was not running and even after re-starting the service, it would stop running a short time later.
The VSAN CLOMD Liveness fix:
Learning from the vmkernel issues, I immediately tried deleting and re-creating the VSAN vmkernel on each host and this fixed the issue. However to do this was a little more difficult than the vMotion process because when you delete the VSAN vmkernel you instantly partition that host, so you need to be careful how you do this.
Place the host in Maintenance Mode first! We aren’t going to lose any data so you don’t need to evacuate the data, however I would recommend you at least select “Ensure data accessibility from other hosts”. Selecting “No Data Migration” is generally only suggested if you are shutting down all nodes in the VSAN cluster, or possibly a non-intrusive action like a quick reboot.
Once the host is in Maintenance Mode you can now delete the existing vmkernel and re-create a new one with the same settings. I would then reboot the host for good measure. Once the host is back up, you can exit Maintenance Mode and then move on to the next host.
Again, I stress that I have only seen this issue on un-supported hardware.
My VSAN Upgrade Process
- Upgrade vCenter
- Upgrade each ESXi server
- Upgrade the disk format version
- Run a VSAN health check!
- If you have a CLOMD issue then for each ESXi host in the VSAN Cluster
- Place a host in Maintenance Mode
- Delete and re-create the vMotion vmkernel
- Delete and re-create the VSAN vmkernel
- Reboot the ESXi host
- Move on to the next host
- Run a VSAN health check again