Recently I spent some time with DRS issues in my lab; more specifically dealing with disabling DRS on a vSphere 7.0 Update 1+ and since I couldn’t find a lot of documentation out there I thought to write this article both as learning note to myself as well as knowledge sharing to the community.
vSphere Clustered Services virtual machines are part of the DRS/HA functionalities since vSphere 7.0 Update 1, if you want to know more read this article https://blogs.vmware.com/vsphere/2020/09/vsphere-7-update-1-vsphere-clustering-service-vcls.html
Quoting the official documentation:
vCLS uses agent virtual machines to maintain cluster services health. The vCLS agent virtual machines (vCLS VMs) are created when you add hosts to clusters. Up to three vCLS VMs are required to run in each vSphere cluster, distributed within a cluster. vCLS is also enabled on clusters which contain only one or two hosts. In these clusters the number of vCLS VMs is one and two, respectively.https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.resmgmt.doc/GUID-96BD6016-4BE7-4B1C-8269-568D1555B08C.html
Weird DRS cluster behaviours
In my lab I was experiencing strange issues, to name few symptoms:
- all hosts were unable to evacuate virtual machine automatically after a maintenance mode operation was invoked
- anti-affinity rules were not respected, and I ended up having VMs running on the same host despite an active rule in place stating they should not
- SDDC Manager prechecks telling me “maintenance mode dry run” failed because I had a VM pinned to a host, when I did not
- more than one vCLS virtual machine running on the same host
As you can see, I have several resources pool
Note that disabling DRS will effectively delete all these resource pools. In addition, in my setup I’m also running Workload Management (aka vSphere with Tanzu) on the cluster. When you toggle the off button you will be prompted with the following alert
after which you will have the opportunity to save a snapshot (settings) of all the DRS resource pools
so there you go, DRS is now disabled
“Workload Management is still being configured. Please check back later”
don’t panic because we did save the resource pool snapshot so this will go away (in my experience) with a VC reboot after the DRS resource pools have been restored.
Please note: disabling DRS will not unprovision the existing vCLS virtual machines.
Next, we’re going to re-enable DRS followed by restoring the DRS the resource pool tree
and that’s where it gets a bit ugly, as I was getting the following error message:
There are virtual machines with missing “Assign virtual machine to resource pool” privilege that need to be moved into the newly created resource pools.
After a bit of internal research I discovered that there is a permission missing from vCSLAdmin role used by the vCLS service VMs. More specifically, one that entitles the group to assign resource pools to a virtual machine.
To fix it, from vCenter select Administration > Access Control > vCLSAdmin > Edit and select privilege.Resource.label (from the long list of privileges)
Now try again to restore the resource pool tree and it should all work
All your VMs will automatically be placed under the corresponding resource pools
That’s it folks. Hope you find it useful.