DHCP Design thoughts with NSX VTEPs

I recently had an interesting discussion around design considerations (and implications) for NSX VXLAN Tunnel Endpoint (VTEP) when using DHCP for IP address allocation.

Beginning with VMware Cloud Foundation 3.0 one of the external services required is a DHCP Server to provide automated IP address allocation for VXLAN tunnel endpoints (VTEPs)
This is in-line with the VMware Validated Design (VVD) physical network design decisions decision SDDC-PHY-NET-005 quoting here:

Assign static IP addresses to all management components in the SDDC infrastructure except for NSX VTEPs. NSX VTEPs are assigned by using a DHCP server. Set the lease duration for the VTEP DHCP scope to at least 7 days.
NSX VTEPs do not have an administrative endpoint. As a result, they can use DHCP for automatic IP address assignment. You are also unable to assign directly a static IP address to the VMkernel port of an NSX VTEP. IP pools are an option but the NSX administrator must create them. If you must change or expand the subnet, changing the DHCP scope is simpler than creating an IP pool and assigning it to the ESXi hosts.

Pets vs Cattle

Now, I know many of you reading this don’t like using DHCP for management addressing because they want to follow a particular numbering sequence so they know which address is assigned to each node but wouldn’t it be easier if you didn’t have to worry about who gets what address? Here’s an OCD example (your servers are pets):

host-1: esxi-management=192.168.110.1, VTEP=192.168.120.1
host-2: esxi-management=192.168.110.2, VTEP=192.168.120.2
host-3: esxi-management=192.168.110.3, VTEP=192.168.120.3
….
host-10: esxi-management=192.168.110.10, VTEP=192.168.120.10

Here’s a more “cloudy” approach (your servers are cattle)

host-1: esxi-management=192.168.110.4, VTEP=192.168.120.8
host-2: esxi-management=192.168.110.6, VTEP=192.168.120.7
host-3: esxi-management=192.168.110.4, VTEP=192.168.120.2
….
host-10: esxi-management=192.168.110.23, VTEP=192.168.120.98

Yes I agree it is a little more confusing and perhaps messy to see host-2 with a management address 192.168.110.6 and a VTEP address 192.168.120.7. The question I’d ask is: what’s the effort involved to manage your servers as pets vs cattle? The answer, in my opinion, is that it takes time and effort to maintain and if we can have one component less to worry about in the management stack why not just switch to DHCP? At the end of the day, if your argument is “I can ssh directly to host-2 because I know it’s 192.168.110.2” well guess what? that’s what DNS and FQNDs are for right? That said I fully understand companies might have legacy policies/constraints they have to adhere to which may result in this option being excluded even from early design discussions.

The scene is set so let’s discuss about some design considerations!

DHCP Rebinding Timers

After the client receives an IP address (time T0) it will try to renew it at 50% of the lease length (timer T1) during which the client goes into a RENEW state and, if unable to renew will continue to work until the second timer kicks-in (timer T2) at 87.5% (or 7/8ths) of the lease length during which the client goes into a REBINDING state.

Source: http://www.tcpipguide.com/free/t_DHCPLeaseLifeCycleOverviewAllocationReallocationRe.htm

If the client is unable to renew the IP address at T2 the lease will eventually expire and the network will halt. What I mean is that the network card will, after several re-tries, get a link-local IP address (for example 169.254.248.42/24). Microsoft also refers to this address auto configuration method as Automatic Private IP Addressing (APIPA).
Source: https://www.ietf.org/rfc/rfc2131.txt

For example, if you set a lease of 24 hours the first renew timer (T1) will occur at 12 hours and the second re-binding timer (T2, in case T1 is unsuccessful) will occur at 21 hours.

All cool but why are you telling me all this?

The topic being discussed is tunnel endpoints getting their IP addresses from a DHCP server so it’s very important to understand how the service works because that’s the overlay network carrying the workloads traffic.
If one of the VTEP goes down you are basically losing the forwarding plane on the host so any workload running on top will be unable to communicate (until they are either migrated or failed over to a different host).

Let’s look at few scenarios and what can happen if the DHCP server for the VXLAN network is down.

1) Host reboot with DHCP down

The host will be unable to get an IP address for the VXLAN subnet so the VTEP interface will get a link-local IP address and will not communicate with the other endpoints. In this scenario workloads running on top will probably not even have the time to be migrated over to this problematic host and so chances are they will not experience any downtime.

2) Host is alive, DHCP goes down

In this scenario we have the host which is alive and already forwarding traffic in the VXLAN fabric overlay. At some point the DHCP server becomes unavailable. So far so good, the host will continue to function until the lease expires (following the T1 – T2 flow I previously described). Once the lease is expired the VTEP will self-assign a link-local IP address and the virtual machines running on top will start experiencing network problems.
Now I wrote host (single) is alive, usually you have a cluster of many hosts. You’re getting the picture now? All your VTEPs will get a link local IP address so they might be able to talk to each other (east-west traffic) since they’re using a 169.254.0.0/16 network but they will unable to egress traffic via their L3 transport network gateway.

Design DHCP for High Availability

By now you should have a good understanding of how critical a DHCP server is in the context of tunnel endpoints so we need to unsure the service is highly available. A common way to achieve this if you’re using Windows Server is to configure DHCP Failover. You could also split the scope between two DHCP servers and have them both serving clients requests. However explaining how to implement this is out of scope for my article so if you are interested you can follow the Official Microsoft documentation https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-r2-and-2012/dn338974(v%3Dws.11)

DHCP over WAN

What are you talking about? DHCP over a WAN ? I personally never came across this but, believe it or not, the idea of writing this article came after I spoke to a friend at VMware and he was dealing with an actual customer running DHCP services across a WAN link… so nothing surprises me anymore. Unless you have a super strong design justification I would advise you to refrain from doing that; but again customers do have strange constraints sometimes.

blog.bertello.org

Cloud, SDDC, SDN and software defined things

DHCP Design thoughts with NSX VTEPs

Pets vs Cattle

DHCP Rebinding Timers

All cool but why are you telling me all this?

1) Host reboot with DHCP down

2) Host is alive, DHCP goes down

Design DHCP for High Availability

DHCP over WAN

Leave a Comment Cancel reply