Let’s get straight into it.
Problem I was facing was: from SDDC Manager, the LCM prechecks are failing on all host with an error message like the following:
Checks for dry run of enter maintenance mode. The virtual machine is pinned to a host.
This error message suggests there are virtual machines pinned to a host by way of DRS host to VM rules. However, upon inspection there were none.
Further investigation revealed that the existing DRS VM/host anti-affinity rule I created to keep some VMs separated were not enforced either. In addition, the following error kept on popping up at the cluster level:
vSphere DRS functionality was impacted due to unhealthy state vSphere Cluster Services caused by the unavailability of vSphere Cluster Service VMs. vSphere Cluster Service VMs are required to maintain the health of vSphere DRS
At this point I was convinced that something wasn’t right on DRS. Since the vCLS virtual machines are responsible for the DRS/HA functionality my first thought was to disable and re-enable DRS, thinking that would be sufficient for the vCLS VMs to get deleted and re-provisioned. I was wrong!
Let’s start with saying that disabling DRS by itself does not get rid of the vCLS virtual machines. However, in doing so, you will lose all the resource pools. See this separate blog I wrote about Disabling DRS in vSphere Clustered Services (vCLS) enabled cluster. To make a long story short, simply turning DRS off and on did not fix my problem.
vCLS Retread Mode
As I started researching and learning more about vCLS I discovered KB 80472 vSphere Cluster Services (vCLS) in vSphere 7.0 Update 1 and newer versions. The only way to get completely rid of the vCLS virtual machines is by invoking retreat mode. Here’s a quick run through
Get the vSphere cluster ID from the vCenter Server URL by selecting the cluster first
Add the following advanced setting to vCenter Server:
Give it a few minutes and the vCLS monitoring service will detect the change we requested, initiate the clean-up of vCLS VMs and you will start seeing the VM deletion tasks running:
Once all the vCLS VMs are gone, disable retread mode by setting the setting its value back to true; expect all the vCLS virtual machines to be re-deployed
As soon as the vCLS VMs were up and running again, I noticed that my workloads started to automatically migrate and, upon inspecting, the DRS VM/host anti-affinity rules were all enforced correctly. In addition, where before I had more than one vCLS virtual machine running on the same host (it shouldn’t be) they were now running on different hosts. Needless to say that, back at SDDC Manager, the prechecks were all passing.
SDDC Manager Prechecks
So there you have it, hope you found it useful.