Disabling DRS in a vSphere Clustered Services (vCLS) enabled cluster

Recently I spent some time with DRS issues in my lab; more specifically dealing with disabling DRS on a vSphere 7.0 Update 1+ and since I couldn’t find a lot of documentation out there I thought to write this article both as learning note to myself as well as knowledge sharing to the community.

vSphere Clustered Services virtual machines are part of the DRS/HA functionalities since vSphere 7.0 Update 1, if you want to know more read this article https://blogs.vmware.com/vsphere/2020/09/vsphere-7-update-1-vsphere-clustering-service-vcls.html

Quoting the official documentation:

vCLS uses agent virtual machines to maintain cluster services health. The vCLS agent virtual machines (vCLS VMs) are created when you add hosts to clusters. Up to three vCLS VMs are required to run in each vSphere cluster, distributed within a cluster. vCLS is also enabled on clusters which contain only one or two hosts. In these clusters the number of vCLS VMs is one and two, respectively.
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.resmgmt.doc/GUID-96BD6016-4BE7-4B1C-8269-568D1555B08C.html

Weird DRS cluster behaviours

In my lab I was experiencing strange issues, to name few symptoms:

all hosts were unable to evacuate virtual machine automatically after a maintenance mode operation was invoked
anti-affinity rules were not respected, and I ended up having VMs running on the same host despite an active rule in place stating they should not
SDDC Manager prechecks telling me “maintenance mode dry run” failed because I had a VM pinned to a host, when I did not
more than one vCLS virtual machine running on the same host

As you can see, I have several resources pool

Note that disabling DRS will effectively delete all these resource pools. In addition, in my setup I’m also running Workload Management (aka vSphere with Tanzu) on the cluster. When you toggle the off button you will be prompted with the following alert

after which you will have the opportunity to save a snapshot (settings) of all the DRS resource pools

so there you go, DRS is now disabled

If you go back to Workload Management expect something like this:

“Workload Management is still being configured. Please check back later”

don’t panic because we did save the resource pool snapshot so this will go away (in my experience) with a VC reboot after the DRS resource pools have been restored.

Please note: disabling DRS will not unprovision the existing vCLS virtual machines.

Next, we’re going to re-enable DRS followed by restoring the DRS the resource pool tree

and that’s where it gets a bit ugly, as I was getting the following error message:

There are virtual machines with missing “Assign virtual machine to resource pool” privilege that need to be moved into the newly created resource pools.

After a bit of internal research I discovered that there is a permission missing from vCSLAdmin role used by the vCLS service VMs. More specifically, one that entitles the group to assign resource pools to a virtual machine.

privilege.Resource.AssignVMToPool.label

To fix it, from vCenter select Administration > Access Control > vCLSAdmin > Edit and select privilege.Resource.label (from the long list of privileges)

Now try again to restore the resource pool tree and it should all work

All your VMs will automatically be placed under the corresponding resource pools

That’s it folks. Hope you find it useful.

blog.bertello.org

Cloud, SDDC, SDN and software defined things

Disabling DRS in a vSphere Clustered Services (vCLS) enabled cluster

Weird DRS cluster behaviours

Leave a Comment Cancel reply

3 Trackbacks