I’m working extensively with SDDC Manager and VMware Cloud Foundation 2.2 (VCF) on VxRack SDDC lately and I encountered this problem trying to execute an LCM (Life Cycle Manager) update:
Pending due to failed domain
A quick look at System Status > Workflows showed a failed workload domain problem was: that was an old already deleted workload domain. So how can something that doesn’t exist stopping me from upgrading another workflow domain? Couldn’t get my head around …
Troubleshooting SDDC Manager (VRM)
Selecting the workflow allows you can see its uuid from looking at the URL, in my case it was e08db4bf-574c-455a-84af-ee840bddac7b-0
Checking logs
Looking at the LCM logfile /home/vrack/lcm/logs/lcm.log I could find the JSON section for the failed domain:
Removing the failed workflow
It’s not possible to remove workflows from SDDC Manager UI. SSH the SDDC Manager console and from there connect to ZooKeper using the command zkCLI.sh.
The following commands I’m listing all the workflows then removing the failed one.
NOTE: Be careful when playing with zkCli.sh because you can easily mess around with ZooKeeper real quick!
/opt/vmware/zookeeper/bin/zkCli.sh ls /Workloads/Workflows/ rmr /Workloads/Workflows/e08db4bf-574c-455a-84af-ee840bddac7b-0
and boom, the FAILED workflow is gone!
Correlating NSX Manager ID with its FQDN
To check the NSX Manager detail we need to connect to Cassandra database in use by SDDC Manager.
NOTE: again, same principle of ZooKeper applies here: it’s not supported and you’re on your own! Be very careful messing around with Cassandra DB.
/opt/vmware/cassandra/apache-cassandra-2.2.4/bin/cqlsh use vrmkeyspace;
expand on; select * from nsx;
the state for rack-1-nsxmanager-8-ServerFarm (hostname rack-1-nsxmanager-8.isus.emc.com) is FAILED and the id matches with the one from lcm.log.
Discovery: actually this NSX Manager belongs to an existing workload domain which I can go and check out to see what’s wrong.
Solution
In my case one of the three NSX Controller had the filesystem in read-only which was causing SDDC Manager to collectively mark NSX Manager as failed. I covered a very similar problem on this article SDDC Manager LCM Update status “Pending due to failed domain”