NSX-T Edge NIC Out Of Receive Buffer

Scenario: NSX-T 3.0.1, edge VM running on a VCF nested environment so full disclosure this might not even happen in a non-nested environment but hey, let’s troubleshoot this problem shall we?

The amount of occurrences is very high

and it is happening on both VM edge transport nodes, edge-tn04 and edge-tn05

The alert is defined as following

Full alert description is:

Description: Edge NIC fp-eth1 receive ring buffer has overflowed by 78.019463% on Edge node fe71778c-fb1c-4506-ae73-d33bc5979f0c.

Recommended Action: Invoke the NSX CLI command `get dataplane` and check

  1. if pps and cpu usage is high and check rx ring size using `get dataplane ring-size rx`
  2. If pps and cpu is high and rx ring size is low, invoke `set dataplane ring-size rx <ring-size>` i,e set <ring-size> to a high value to accommodate incoming packets
  3. If the above condition is not satisfied, i.e. ring size is high and yet CPU usage is high, then this could be due to dataplane processing overhead delay.

I found this VMware KB https://kb.vmware.com/s/article/80233 NSX Manager reports alarms for Edge Node “transmit ring buffer has overflowed” with an overflow percentage lower than 0.1%

However it’s not really my case as the % is extremely high, close to 80% on the alert, however upon inspection the CPU stats do not look that bad, I could only see 40 packets per second on RX

Still I went ahead and changed the RX ring size (and per alert recommendation) to 2000 and restarted the dataplane

restart service dataplane

Checked from ESXi the port details using the following commands

Get the detailed stats for edge-tn04 eth1 port

vsish -e get /net/portsets/DvsPortset-0/ports/67108881/clientStats

The droppedRx count is over 20 millions, an insane amount! but we don’t know why the frames were dropped

Getting the stats for the vmxnet3 adapter for eth1 port on the vDS tells us many frames are running out of buffer.

I increased even more the rx buffer, now to 4096

an the stats do look better now

But… alerts kept on coming

So what’s next to try on this issue? My guess is that it is a bug on 3.0.1 so I’m going to update to NSX-T 3.0.2 that comes with VCF 4.1

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

9 Trackbacks

  1. VCF 4.0.1 to 4.1 LCM Update | blog.bertello.org (Pingback)
  2. VMware VCF SDDC Upgrade 4.3.0 | blog.bertello.org (Pingback)
  3. NSX-T BGP Tunnel is Down due to Edge NIC Out Of Transmit Buffer – RJ's Blogs (Pingback)