I have recently run into a weird problem with a NSX 6.2 Cross-VC setup.
Each site was correctly configured (at least it appeared so) one NSX Manager set to Primary on Site1, one NSX Manager set to Secondary on Site2. Universal logical switches were correctly syncing from Site1 to Site2. One Resource and one Edge cluster on each site. L3 routing in place between sites with no firewall. All seemed to be OK, except for the following:
All hosts on Site2 could not establish communication with the Universal Controllers on Site1. The Communication Health Channel status was as following:
From ESXi, checking the connection list, I could see the Distributed FW established but there was no trace of the Controllers connection (port 1234)
Moreover the file /etc/vmware/netcpa/config-by-vsm.xml was missing all the Controllers SSL thumprint
A healthy config-by-vsm.xml would have one <connection></connection> for each controller. The netcpa SSL key is stored on /etc/vmware/ssl/rui-for-netcpa.key was also empty.
A force sync didn’t solve anything. I was pretty much stuck at this point and run out of ideas. After consulting a friend on Slack, I was pointed in the right direction when he asked me “did you clone from template NSX Manager by any chance?” Ops, yes I did! Long story short the cause of all my problems was a duplicated NSX Manager UUID and shame on me for not RTFM.
This is in fact very clearly highlighted in the Cross-vCenter Installation Guide, the following exerpt from https://pubs.vmware.com/NSX-62/topic/com.vmware.nsx-cross-vcenter-install.doc/GUID-CFB0DC96-C329-490E-B2A9-D92C5704E853.html
In cross-vCenter NSX installations, make sure that each NSX Manager has a unique UUID. NSX Manager instances deployed from OVA files have unique UUIDs. An NSX Manager deployed from a template (as in when you convert a virtual machine to a template) will have the same UUID as the original NSX Manager used to create the template, and these two NSX Managers cannot be used in the same cross-vCenter NSX installation
But as it was my dev lab running within a vCD vApp I did not deploy the 2nd NSX Manager from OVA but instead I copied the 1st one.
At this point there are 2 options:
- Undo all the NSX configuration at Site2 and redeploy the Secondary NSX Manager from OVA
- Change the NSX Manager UUID on Site2
Option 1 is time consuming and for Option 2 you would need GSS at the phone because it involves getting CLI privileged access mode to NSX Manager and run some SQL commands.
This can only be done by VMware Support so for the sake of the documentation I’m going to show you how I fixed the problem.
NOTE: please don’t post comments asking for the password, as it is not distributed to the public and you would need GSS/VMware staff to do this.
From the Primary NSX Manager, check the NSX Manager UUID, ‘VsmUuidContext’
[root@nsxmgrm-01a ~] secureall=# select * from key_value_store where context='VsmUuidContext'; id | context | name | value ---+----------------+------------+------------------------------------- 5 | VsmUuidContext | VsmUuidKey | 423EF175-FE21-0647-9F57-CC2D33ADC960 (1 row)
Doing the same on the Secondary NSX Manager
[root@nsxmgrm-02b ~] secureall=# select * from key_value_store where context='VsmUuidContext'; id | context | name | value ---+----------------+------------+------------------------------------- 5 | VsmUuidContext | VsmUuidKey | 423EF175-FE21-0647-9F57-CC2D33ADC960 (1 row)
As you can see the values are identical. Run the following query to update the VsmUuidKey on the Secondary NSX Manager
[root@nsxmgrm-02b ~] update key_value_store set value='423EF175-FE21-0647-9F57-CC2D33ADC961' where context='VsmUuidContext';
Reboot NSX Manager, wait for it to become ready and let’s troubleshoot the environment again. From one of the hosts, check the controllers connection list section is populated (looks better!)
And let’s check again the config-by-vsm.xml file
As you can see we have 3 controllers thumbprint installed. Lastly checking the Communication Health Channel will show the Control Plane Agent to controller as UP