Troubleshooting failing LCM update on VMware Cloud Foundation 2.2

I’m working extensively with SDDC Manager and VMware Cloud Foundation 2.2 (VCF) on VxRack SDDC lately and I encountered this problem trying to execute an LCM (Life Cycle Manager) update:

Pending due to failed domain

A quick look at System Status > Workflows showed a failed workload domain problem was: that was an old already deleted workload domain. So how can something that doesn’t exist stopping me from upgrading another workflow domain? Couldn’t get my head around …

Troubleshooting SDDC Manager (VRM)

Selecting the workflow allows you can see its uuid from looking at the URL, in my case it was e08db4bf-574c-455a-84af-ee840bddac7b-0

Checking logs

Looking at the LCM logfile /home/vrack/lcm/logs/lcm.log I could find the JSON section for the failed domain:

  {
        "bundleId": "d0dd1418-0e39-4053-8e37-ccf6db93b223",
        "bundleVersion": {
          "major": 2,
          "minor": 2,
          "patch": 1,
          "build": "100772"
        },
        "releaseDate": "1507845621201"
      }
    ],
    "availableUpgrades": [
      {
        "bundle": {
          "bundleId": "36f3de72-02cb-4064-867c-4c32817e1c73",
          "bundleVersion": {
            "major": 2,
            "minor": 2,
            "patch": 3,
            "build": "100828"
          },
          "releaseDate": "1509055794423"
        },
        "item": [
          {
            "id": "ded61e01-a6e7-11e7-8333-ed25d3ada4ce",
            "type": "VCENTER",
            "domainType": "MANAGEMENT",
            "domainId": "dba417f0-a6e7-11e7-8333-ed25d3ada4ce"
          }
        ]
      }
    ],
    "failedDomains": [
      {
        "domainType": "VI",
        "domainId": "5cf02cdf-56d1-444d-8988-ec8e3412e41b",
        "vcenterId": "6249ef93-fa1d-43e4-8218-68888ef954ec",
        "failedItems": [
          {
            "id": "9dc322fb-6f32-45a3-b3d8-f3c0e1dea8f3",
            "type": "NSX_MANAGER",
            "domainType": "VI",
            "domainId": "5cf02cdf-56d1-444d-8988-ec8e3412e41b"
          }
        ]
      }
    ],
    "error": false
  }

So something was wrong with the NSX Manager id 9dc322fb-6f32-45a3-b3d8-f3c0e1dea8f3 OK but how do I correlate this NSX Manager ID with a human understandable name, like FQDN for instance ?

As I don’t like to see FAILED “things” I’ll start by deleting the failed one workflow.

Removing the failed workflow

It’s not possible to remove workflows from SDDC Manager UI. SSH the SDDC Manager console and from there connect to ZooKeper using the command zkCLI.sh.
The following commands I’m listing all the workflows then removing the failed one.

NOTE: Be careful when playing with zkCli.sh because you can easily mess around with ZooKeeper real quick!

/opt/vmware/zookeeper/bin/zkCli.sh
ls /Workloads/Workflows/
rmr /Workloads/Workflows/e08db4bf-574c-455a-84af-ee840bddac7b-0

and boom, the FAILED workflow is gone!

Correlating NSX Manager ID with its FQDN

To check the NSX Manager detail we need to connect to Cassandra database in use by SDDC Manager.

NOTE: again, same principle of ZooKeper applies here: it’s not supported and you’re on your own! Be very careful messing around with Cassandra DB.

/opt/vmware/cassandra/apache-cassandra-2.2.4/bin/cqlsh
use vrmkeyspace;

expand on;
select * from nsx;

the state for rack-1-nsxmanager-8-ServerFarm (hostname rack-1-nsxmanager-8.isus.emc.com) is FAILED and the id matches with the one from lcm.log.
Discovery: actually this NSX Manager belongs to an existing workload domain which I can go and check out to see what’s wrong.

Solution

In my case one of the three NSX Controller had the filesystem in read-only which was causing SDDC Manager to collectively mark NSX Manager as failed. I covered a very similar problem on this article SDDC Manager LCM Update status “Pending due to failed domain”

blog.bertello.org

Cloud, SDDC, SDN and software defined things

Leave a Comment Cancel reply