Scenario: NSX-T 3.0.1, edge VM running on a VCF nested environment so full disclosure this might not even happen in a non-nested environment but hey, let’s troubleshoot this problem shall we?
The amount of occurrences is very high
and it is happening on both VM edge transport nodes, edge-tn04 and edge-tn05
The alert is defined as following
Full alert description is:
Description: Edge NIC fp-eth1 receive ring buffer has overflowed by 78.019463% on Edge node fe71778c-fb1c-4506-ae73-d33bc5979f0c.
Recommended Action: Invoke the NSX CLI command `get dataplane` and check
- if pps and cpu usage is high and check rx ring size using `get dataplane ring-size rx`
- If pps and cpu is high and rx ring size is low, invoke `set dataplane ring-size rx <ring-size>` i,e set <ring-size> to a high value to accommodate incoming packets
- If the above condition is not satisfied, i.e. ring size is high and yet CPU usage is high, then this could be due to dataplane processing overhead delay.
I found this VMware KB https://kb.vmware.com/s/article/80233 NSX Manager reports alarms for Edge Node “transmit ring buffer has overflowed” with an overflow percentage lower than 0.1%
However it’s not really my case as the % is extremely high, close to 80% on the alert, however upon inspection the CPU stats do not look that bad, I could only see 40 packets per second on RX
Still I went ahead and changed the RX ring size (and per alert recommendation) to 2000 and restarted the dataplane
restart service dataplane
Checked from ESXi the port details using the following commands
Get the detailed stats for edge-tn04 eth1 port
vsish -e get /net/portsets/DvsPortset-0/ports/67108881/clientStats
The droppedRx count is over 20 millions, an insane amount! but we don’t know why the frames were dropped
Getting the stats for the vmxnet3 adapter for eth1 port on the vDS tells us many frames are running out of buffer.
I increased even more the rx buffer, now to 4096
an the stats do look better now
But… alerts kept on coming
So what’s next to try on this issue? My guess is that it is a bug on 3.0.1 so I’m going to update to NSX-T 3.0.2 that comes with VCF 4.1