VCF 4.0.1 to 4.1 LCM Update

What’s new on VCF 4.1 ?

VMware Cloud Foundation 4.1 comes with the following new versions

and the following new features

  • vVols as Principal Storage in Workload Domains: VMware Cloud Foundation now supports vVols as principal storage, providing a common storage management framework for external storage and automation for pre-defined storage, including volume management and provisioning.
  • Remote Clusters: Extends VMware Cloud Foundation capabilities to the ROBO and Edge sites with VMware Cloud Foundation Remote Clusters.  Now customers can enjoy the same consistent cloud operations in their core data center and edge/ ROBO sites.  
  • Read-only Access and Local Accounts: Administrators can create VIEWER users that have read-only access to VMware Cloud Foundation. They can also create a local account for use in break-glass scenarios where a remote identity provider is unreachable.
  • ESXi Parallel Upgrades: Enables you to update the ESXi software on multiple clusters in the management domain or a workload domain in parallel. Parallel upgrades reduce the overall time required to upgrade your environment.
  • NSX-T Data Center Parallel Upgrades: Enables you to upgrade all Edge clusters in parallel, and then all host clusters in parallel. Parallel upgrades reduce the overall time required to upgrade your environment.
  • Support for ESXi hosts with external CA-signed certificates: VMware Cloud Foundation supports APIs to perform bring-up of hosts with certificates generated by an external Certificate Authority (CA). 
  • vRealize Suite Lifecycle Manager in VMware Cloud Foundation mode: VMware Cloud Foundation 4.1 introduces an improved integration with vRealize Suite Lifecycle Manager. When vRealize Suite Lifecycle Manager in VMware Cloud Foundation mode is enabled, the behavior of vRealize Suite Lifecycle Manager is aligned with the VMware Cloud Foundation architecture.
  • vSphere Cluster Services (vCLS) Support: vCLS is a new capability introduced in the vSphere 7 Update 1 release. vCLS ensures that if vCenter Server becomes unavailable, cluster services remain available to maintain the resources and health of the workloads that run in the clusters.
  • Support for Renaming VMware Cloud Foundation Objects: You can rename workload domains, network pools, and compute clusters after you have deployed them. This allows the flexibility of naming these Cloud Foundation objects to align with company policies.
  • VMware Skyline Support for VMware Cloud Foundation: VMware Skyline brings proactive intelligence to VMware Cloud Foundation by identifying management and workload domains, and proactively surfacing VMware Cloud Foundation solution findings. 
  • Backup Enhancements: SDDC Manager backup and recovery workflows and APIs have been improved to add new capabilities including, backup management, backup scheduling, retention policy, on-demand backup, and automatic retries on failure. The enhancements also include Public APIs for 3rd party ecosystem and certified backup solutions from Dell PowerProtect and Cohesity.
  • Lifecycle Management Enhancements: VMware Cloud Foundation allows skipping versions during upgrade to minimize the number of upgrades applied and time consumed in upgrading. Skip-level upgrade is managed using SDDC Manager and the public API.
  • Improved pNIC/vDS support:VI Workload domains can have hosts with multiple pNICs and vSphere Distributed Switches (vDS) that can scale up-to the vSphere maximums supported in the vSphere version included in the BOM.
  • Support for XLarge form factor for Edge nodes: You can now use SDDC Manager to create an edge cluster with the XLarge form factor for edge nodes in the Management and VI workload domains.
  • Localization: SDDC Manager includes localization support for the following languages – German, Japanese, Chinese,  French and Spanish. Customers can navigate the SDDC Manager UI in those languages.
  • Inclusive terminology: As part of a company-wide effort to remove instances of non-inclusive language in our products, the VMware Cloud Foundation team has made changes to some of the terms used in the product UI and documentation.
  • New License for vSphere with Tanzu: vSphere with Tanzu has its own license key, separate from vSphere 7.0. This is a subscription-based license with a term limit. 
  • Start up and shut down order guidance: Start up and shut down order guidance for VMware Cloud Foundation is now available, enabling you to gracefully shut down and start up the SDDC components in a prescriptive order.
  • Voluntary Product Accessibility Template (VPAT) report: The VPAT evaluates compliance with accessibility guidelines as put forward by the US government (under Section 508) and the EU government (under EN 301 549).  See https://www.vmware.com/help/accessibility.html.

As you can see there are a tons of new functionalities so I will try to write different articles to cover each some of those. For now I’m going to focus on all the steps required to perform the LCM update.

4.1 LCM Updates

  1. SDDC Manager 4.1.0.0 (bundle ID 5851343d-a1bb-494b-b9d2-306f9382327d)
    • from version 4.0.1.1 build 16660200 to 4.1.0.0 build 16961769
  2. SDDC Manager 4.1.0.0 Configuration Drift (bundle ID 556e99b8-6863-4038-b9f2-616f6d97c8dd)
    • Configuration drift, same version as 1
  3. NSX-T 3.0.2 (bundle ID 4cc5f8fb-bc5f-4fe9-9ce9-bfd3795a44dc)
    In the following components order:
    • NSX-T Upgrade Coordinator from 3.0.1.0.0.16404476 to 3.0.2.0.0-16887200
    • EDGE transport nodes cluster from 3.0.1.0.0.16404482 to 3.0.2.0.0-16887200
    • HOST transport nodes cluster from 3.0.1.0.0.16404614 to 3.0.2.0.0-16887200
    • NSX-T Manager from 3.0.1.0.0.16404613 to 3.0.2.0.0-16887200
  4. vCenter Server 7.0 Update 1 (bundle ID ff21d28a-cb1f-447e-91ec-b2861cd43fd8)
    • from version 7.0.0.10600-16620007 to 7.0.1.00000-16860138
  5. ESXi 7.0 Update 1 (bundle ID a2a2af3a-88f4-4f85-be57-92608e4846a9)
    • from version 7.0.0-16324942 to version 7.0.1-16850804

1.SDDC Manager 4.1.0.0

As usual, the first step is to update the SDDC Manager services

Once you updated to 4.1 you will be presented with the following top banner message:

Local account is not configured. Refer to Cloud Foundation documentation for more information

This is the new local account that allows users to perform APIs calls (note it doesn’t allow SDDC Manager GUI logins) when vCenter Server is down. Because I’m doing a brownfield upgrade (and not green field 4.1 deployment ) this local account is not set so we need to reset its password which, at the moment, is only possible using the public APIs.
Let’s do it

I’m using VS Code and the REST Client extension (shame on you if you don’t know this extension!!) so I can run test APIs call directly from my favourite code IDE.

First, I’m checking if the local admin exists, but I do require a bearer token. Let’s leverage PowerVCF to get one:

Request-VCFToken -fqdn sddc-manager-2.vcf-s1.vlabs.local -username "administrator@vsphere.local" -password "VMware1!"

The global variable $Global:accessToken will give you the bearer token string that we need on our next manual API call

Next, I’m doing a GET v1/users/local/admin to check the status of admin@local account

As you can see the account is not configured so we need to reset its password using PATCH /v1/users/local/admin. A successful call will get return you a 204 code, which means success without a need to return any content

Running again GET v1/users/local/admin will not return true so we’re good and the banner is gone.

Because we just upgraded again all SDDC services (and I’m running a nested environment) the vSAN HCL will fail so we need configure again application-prod.properties so that LCM will skip the vSAN HCL checks


The file we need to edit is /opt/vmware/vcf/lcm/lcm-app/conf/application-prod.properties from



to


and finally an lcm restart using

systemctl restart lcm

2.SDDC Manager 4.1.0.0 Configuration Drift

Next, as usual the configuration drift bundle, which is very quick

3.NSX-T 3.0.2


Moving on, it’s NSX-T upgrade time and usually this is most tricky part where, based on my experience from the field, we see most of the hiccups 

Update running …

…but not for long, it failed at the edge upgrade. Try #2 keeps failing, the UI message is always go check lcm.log



so the error is

ERROR [vcf_lcm,96cab1249e4abd7c,04d8] [c.v.evo.sddc.lcm.orch.Orchestrator,pool-6-thread-10] Found an NSX-T parallel cluster upgrade element and is in failed/timeout/invalid state

ERROR [vcf_lcm,137517cd9c9706ff,9ed0,upgradeId=732fc0fc-70bc-4264-9123-ce27ca7559fc,resourceType=NSX_T_PARALLEL_CLUSTER,resourceId=nsxt2.vcf-s1.vlabs.local:_ParallelClusterUpgradeElement,bundleElementId=5a3bff6e-0466-4364-b9bb-242d1c1bcad
 2] [c.v.e.s.l.p.i.n.NsxtParallelClusterPrimitiveImpl,ThreadPoolTaskExecutor-10] All upgrade elements of type NSX_T_EDGE are NOT COMPLETED_WITH_SUCCESS, thus we cannot proceed to upgrade next batch of type NSX_T_HOSTCLUSTER

ERROR [vcf_lcm,137517cd9c9706ff,9ed0,upgradeId=732fc0fc-70bc-4264-9123-ce27ca7559fc,resourceType=NSX_T_PARALLEL_CLUSTER,resourceId=nsxt2.vcf-s1.vlabs.local:_ParallelClusterUpgradeElement,bundleElementId=5a3bff6e-0466-4364-b9bb-242d1c1bcad
ERROR [c.v.e.s.l.p.i.n.s.NsxtEdgeClusterParallelUpgradeStageRunner,ThreadPoolTaskExecutor-10] upgrade error for resource nsxt2.vcf-s1.vlabs.local:acd34513-6b6b-4d1b-8d96-de4d2bab260c  : { "errorType": "RECOVERABLE", "stage": "NSX_T_UPGRADE_STAGE_EDGE_PRECHECK", "errorCo
 de": "NSXT_EDGE_CLUSTER_UPGRADE_FAILED_PRECHECK", "errorDescription": "Check for open alarms on edge node.: [Edge node fe71778c-fb1c-4506-ae73-d33bc5979f0c has 1 open alarm(s) present. Kindly resolve the open alarm(s) before proceeding with the upgrade.]: edge-tn05",
  "metadata": "Check for errors in the LCM log files at {LCM_HOST_ADDRESS}:{LCM_LOG_LOCATION}, and address those errors. Please run the upgrade precheck and restart the upgrade.", "metadataCodes": [ "NSXT_EDGE_CLUSTER_UPGRADE_FAILED_PRECHECK.remedy" ], "metadataAttrib
 utes": { "LCM_LOG_LOCATION": "/var/log/vmware/vcf/lcm", "LCM_HOST_ADDRESS": "127.0.0.1" } }

So something is wrong with my edge-tn05. Turns out that LCM is checking for any alarms raised and not in a RESOLVED state; in my case I had a few in state OPEN for the EDGE transport nodes, the infamous “Edge NIC Out Of Receive Buffer” (see my article NSX-T Edge NIC Out of Receive Buffer) which is also the reason why I am upgrading to 3.0.2 funny enough!

I had to forcefully suppressed this alarm (for 1h) on both edge transport nodes in the cluster

By the way, you can also see the same message if you go to NSX-T Manager > Upgrade and manually run the same pre-checks

That was it, try #3 and the NSX-T upgrade finally started. To better monitor the progress head over to the NSX-T Manager GUI and select System > Upgrade > Edges

If you SSH into the edge transport node you will see even more detailed steps using the following command:

get upgrade progress-status

Edges done, onto hosts

one of the host got stuck with the following message

"<truncated error message>.... line 851, in get_data MemoryError It is not safe to continue. Please reboot the host immediately to discard the unfinished update. Please refer to the log file for more details..." 

Had to manually resolve one of the host (don’t ask me why)

Eventually hosts completed, and lastly the NSX-T Manager was upgraded.

4.vCenter Server 7.0 Update 1

PRO TIP for Labs: LCM will take a snapshot (with memory) of the vCenter Server before applying the update. In my case the management vCenter has 20GB of vRAM and this operation is taking a very long time (even though the underlying physical ESXi is running on SSDs) causing my connection to vCenter Server to not respond during the snapshot operation. Luckily there’s an easy way to workaround this: enable snapshot skip on the LCM configuration file and take a manual snapshot (without memory) before applying the bundle.

Apply the following change and restart the lcm service

/opt/vmware/vcf/lcm/lcm-app/conf/application-prod.properties
configuration option
lcm.vc.primitive.snapshot.skip
systemctl restart lcm

Upon completion, I get the following warning on vCenter Server 7.0 Update 1

vSphere DRS functionality was impacted due to unhealthy state vSphere Cluster Services caused by the unavailability of vSphere Cluster Service VMs. vSphere Cluster Service VMs are required to maintain the health of vSphere DRS.

vSphere Cluster Services (vCLS) is a new feature introduced with vCenter Server 7.0 Update 1 see VMware KB 80472 vSphere Cluster Services (vCLS) in vSphere 7.0 Update 1

My environment is completely missing the vCLS service virtual machines, I don’t know why. I suppose something something didn’t go smooth during the vCenter upgrade… mmm….

According to the VMware KB 79892 it seems to be a known bug/issue, quoting

This is a known issue affecting VMware vCenter Server 7.0 Update 1. Currently, there is no resolution.

https://kb.vmware.com/s/article/79892

In addition

vCLS VMs will automatically be powered on or recreated by vCLS service. These VMs are deployed prior to any workload VMs that are deployed in a green field/fresh deployment. In an upgrade scenario, these VMs are deployed before vSphere DRS is configured to run on the clusters. When all the vCLS VMs are powered-off or deleted, the vSphere Cluster status for that cluster will turn to ‘Degraded (Yellow)‘. vSphere DRS needs one of the vCLS VMs to be running in a vSphere cluster to be healthy. If DRS runs prior to these VMs are brought back up, then the cluster service will be ‘Unhealthy (Red)‘, until the time vCLS VMs are brought back up. 

Oh well… the next LCM bundle is going to be interesting then…

5.ESXi 7.0 Update 1

Gave it a try to the last 4.1 LCM bundle which is ESXi 7.0 Update 1

Without much surprise, it failed right away, complaining about being unable to dry-run maintenance mode on the hosts that had running virtual machines

I suppose the fact the vCLS isn’t running is likely to be the root cause. Also I check the advanced settings on vCenter and config.vcls.clusters.domain-c<number>.enabled was not present; I tried adding it manually but didn’t help.


I read with much attention all of the following articles but I couldn’t figure out how to get vCLS working on my cluster meh 😐 too bad I won’t be able to finish this 4.1 LCM update article in full
It does sound to me like this vCLS “v1” release it’s at very early phase and it needs a bit of refinements 

If you encountered this problem and know how to fix it (besides opening a GSS ticket) leave a comment below! Cheers

Be sociable, share!Tweet about this on Twitter
Twitter
Share on LinkedIn
Linkedin
Share on Facebook
Facebook
Email this to someone
email

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.