VXLAN Fabric using EVPN with Cisco Nexus 9000 Switches
I deployed a VXLAN fabric using Cisco’s Nexus 9K switches recently, and started seeking out the best way to do things. I came up with a few questions that need to be answered first, and a configuration that I believe is best to use for most deployments.
The below diagram details a VXLAN fabric deployment.
As you can see, all the VLANs/subnets that are normally configured on switches are placed inside a separate routing table called a tenant VRF. This allows for address separation amongst multiple tenants within the same physical fabric. If only a single tenant uses the fabric, all the traffic processing remains within a tenant VRF.
Traffic within a tenant VRF that needs to pass between physical switches is first encapsulated by a VXLAN header, then forwarded across a Layer 3 ECMP fabric to a spine switch, then the destination leaf switch, and finally decapsulated. Because of this, VXLAN is often referred to as an overlay, and the point-to-point links between spines and leaves can be referred to as an underlay. The physical topology depicted above is known as a spine-and-leaf topology, or Clos network. It’s important to note that leaf nodes are the only devices that directly connect to spines. All other Ethernet devices, regardless of their importance, connect only to leaves.
The orange box shows that L3 gateway functionality is distributed across all the leaves. Each leaf can act as a device’s default gateway, and it can route packets to their final destination in a different subnet. Similarly, if a packet is destined to a device in the same subnet, the leaf can forward the packet to the final destination at Layer 2.
One of the advantages of a Clos network is that every leaf is only two hops away from any other leaf. Since the leaves can forward all packets to their final destination at Layer 2 or 3, all packets can traverse the fabric in two hops.
This begs the question: how do the leaves become so magical? It’s done by treating the spines and leaves the same as a provider does with their MPLS WAN. The PE devices (leaves, in our case) use VRFs to contain all the information relevant to their connected devices. They prepend BGP information (MAC addresses, in our case) from their customers with route distinguishers, and they use a particular BGP address family (l2vpn evpn, in our case) to transmit that information to the core (spine layer). The core acts as a route reflector to ensure all the PE devices receive the information they need. The other PE devices then remove the route distinguishers from the BGP advertisements and populate that information into their route tables (CAM tables). The process in the VXLAN fabric is depicted below.
Keep in mind that the spine switches don’t have to act on the information received from the leaves, they simply pass it on to the other leaves.
With this foundation in place, I’m going to make a few assumptions before digging into the configuration.
- A default route out of the fabric is sufficient, even if learned from multiple sources
- An IGP will be used within the default VRF to advertise each node’s loopback address to the others.
- The inband (default VRF) addresses of the 9Ks don’t need to be reachable from the rest of the network – the mgmt0 interfaces will be used to manage the 9Ks.
- OSPF or EIGRP will be used to share routing information with the rest of the network. This does not need to be the same as the IGP used in the underlay.
- All the leaves in this single pod will learn about all the MAC addresses from all the other leaves. No filtering is necessary.
I believe this will cover the majority of VXLAN deployments. If these assumptions don’t hold true for you, please leave a comment.
The first step is to reprogram the hardware on the leaves. If you want to use the ARP suppression feature, the TCAM needs to have space allocated for it. That space has to come from somewhere, so you’ll also need to deallocate a different region. If you have bursty traffic, you might want to change the buffer profile as well (mesh is the default). Then the switching mode has to be changed from cut-through to store-and-forward. The switches must be rebooted before the changes take effect.
hardware access-list tcam region vacl 0 hardware access-list tcam region arp-ether 256 hardware qos ns-buffer-profile mesh switching-mode store-forward
Next, after the physical connections are made, we’ll need to address the point-to-point links between all nodes in the default VRF.
nv overlay evpn feature ospf feature bgp feature pim feature pim ! router ospf 1 router-id 10.10.0.X passive-interface default ! interface Ethernet1/1-2 description to Leaf-1 mtu 9216 no ip redirects ip address 10.10.0.XY/31 no ipv6 redirects ip ospf network point-to-point no ip ospf passive-interface ip router ospf 1 area 0.0.0.0 ip ospf bfd ip pim sparse-mode ip pim bfd-instance no shutdown
OSPF point-to-point links are used and PIM sparse-mode is enabled. Both are registered with BFD.
Each nodes’ loopback interface is also advertised via OSPF.
interface loopback0 ip address 10.10.0.X/32 ip ospf network point-to-point ip router ospf 1 area 0.0.0.0 ip pim sparse-mode
On the spines, we’ll add the RP information. A second loopback address will be created to serve as the RP anycast address, and the ‘anycast-rp’ command will be added so all the spines know to share their information with each other.
int loopback2 ip address 10.10.0.0/32 ip ospf network point-to-point ip router ospf 1 area 0.0.0.0 ip pim sparse-mode ! ip pim rp-address 10.10.0.0 ip pim anycast-rp 10.10.0.0 10.10.0.1 ip pim anycast-rp 10.10.0.0 10.10.0.2
On the leaves, we just need to define the RP address. Since loopback2 is being advertised from all the spines there should be multiple paths to the RP address.
ip pim rp-address 10.10.0.0
Now that each node has full reachability to each other, BGP can be established. Loopback addresses are used so that BGP adjacencies stay up in the event of a link failure, and only one entry is needed per neighbor regardless of how many inter-switch links are used. The ‘l2vpn evpn’ address family is used amongst all spines and leaves. Note that I’m using iBGP here since I’m assuming all the MACs should be learned by all the leaves. If you want to do some more complex filtering of MAC address information, eBGP could be used instead.
Below is the spine configuration.
router bgp 65001 router-id 10.10.0.X log-neighbor-changes address-family l2vpn evpn retain route-target all template peer vtep-peer bfd remote-as 65001 update-source loopback0 address-family l2vpn evpn send-community both route-reflector-client neighbor 10.10.0.X inherit peer vtep-peer description Leaf-1 neighbor 10.10.0.X inherit peer vtep-peer description Leaf-2
Here’s the leaf configuration:
router bgp 65001 router-id 10.10.0.X log-neighbor-changes template peer vtep-peer bfd remote-as 65001 update-source loopback0 address-family l2vpn evpn send-community both neighbor 10.10.0.X inherit peer vtep-peer description Spine-1 neighbor 10.10.0.X inherit peer vtep-peer description Spine-2
You’ll notice that the spines have some additional configuration. The ‘retain route-target all‘ tells the spines to keep all the advertisements, while the ‘route-reflector-client‘ command ensures all leaves receive all advertisements from all other leaves (needed when using iBGP only).
At this point, the spines’ configuration is complete. They don’t care what VLANs, VXLANs, or addresses are in use in the management plane – they’re simply passing the information between leaves. They don’t care what tenant VLANs, VXLANs, or addresses are in use in the data plane either – they’re simply forwarding VXLAN-encapsulated packets between leaves.
The leaves are doing a lot more work; let’s start with the tenant VRFs.
feature interface-vlan feature vn-segment-vlan-based feature nv overlay ! vrf context myTenant rd auto address-family ipv4 unicast route-target both auto route-target both auto evpn router bgp 65001 vrf myTenant address-family ipv4 unicast advertise l2vpn evpn
A VRF is created in which to house the tenant’s traffic. You can create as many tenants as needed, only one is shown as an example. I’m letting IOS pick the route distinguishers automatically, which works fine in a homogeneous environment. The VRF is also referenced under the BGP process and the IPv4 unicast address family is enabled. The ‘advertise l2vpn evpn’ command ensures these IPv4 routes are advertised to the l2vpn evpn neighbors (spine switches).
Within these tenant VRFs we need to configure the associated subnets, VLANs, and VXLANs. We’ll stick with Vlan10 for now.
vlan 10 name MY_NEW_VLAN vn-segment 39010 int nve1 no shutdown source-interface loopback0 host-reachability protocol bgp member vni 39010 mcast-group 22.214.171.124 evpn vni 39010 l2 rd auto route-target both auto
There’s a lot going on here. NX-OS associates each VXLAN with a VLAN, so we need to define each. I chose to prepend 39000 to the VLAN ID to come up with the VNI, but it doesn’t really matter so long as the VNI matches what the other leaves expect to receive. If you want to get really creative, you can actually configure different VLAN IDs on different leaves (with the same VNI) to perform VLAN translation.
The nve1 interface (for Network Virtualization Edge) is the VXLAN termination point (aka VTEP). The loopback address is used as the source, and the VNI is defined within. A multicast group is defined to carry BUM frames, and BGP is used to advertise hosts’ location (instead of relying on multicast to flood all unknown frames).
Lastly, the ‘evpn’ section specifies how the particular VXLAN is to have its route distinguishers and route-targets used. I see no reason not to let these be automatically generated.
At this point, devices on different leaves in the same VLAN will have their Ethernet frames a) encapsulated with a VXLAN header, b) forwarded via a spine to another leaf, c) decapsulated at the second leaf, and d) forwarded out an Ethernet interface to its destination. We could trunk all these VLANs to an external router to route between subnets, but it would be a lot more efficient to do that within the fabric.
To that end, let’s add some more config to the leaves.
fabric forwarding anycast-gateway-mac 0200.1234.5678 ! vlan 3900 name l3-vni-vlan-for-myTenant-VRF vn-segment 39000 vrf context myTenant vni 39000 interface Vlan3900 description l3-vni-for-myTenant-VRF-routing no shutdown mtu 9216 vrf member myTenant ip forward interface nve1 member vni 39000 associate-vrf ! interface Vlan10 description New VLAN mtu 9216 vrf member myTenant ip address 10.10.10.1/24 fabric forwarding mode anycast-gateway
The first line sets the MAC address to use as the gateway. It’s important that this is configured the same on all the leaves so end devices don’t get different responses to their ARP queries depending on which leaf they connect to. The next chunk defines a routing VNI and associates it with a tenant VRF. The routing VNI is what a VTEP uses to encapsulate a frame that needs to be routed into a different subnet. When the destination VTEP receives this encapsulated frame, it will see that the VNI doesn’t belong to a particular VLAN, but rather a particular tenant’s VRF. Then the destination VTEP will perform a route lookup to determine into which VLAN the decapsulated frame should be placed before forwarding it normally. A more detailed explanation can be found here.
The final chunk shows a sample SVI created within the tenant VRF. Note the ‘fabric forwarding’ command which references the statically configured gateway MAC address above.
We finally have a fabric which can forward packets between leaves at layer 2 and layer 3. You can build some massive data centers using this approach, but at some point you’ll probably need to connect this VXLAN fabric with another network. Doing so at layer 2 isn’t difficult – just configure a trunk port and extend some VLANs as usual. Making a L3 connection requires a little more work due to the VRFs and MP-BGP involved. Below is what a border leaf does, assuming OSPF is used to talk to the rest of the network.
If it makes more sense to advertise the specific VXLAN subnets to the rest of the network and simply advertise a default into the fabric, the below config will work on the border leaf. It will advertise all directly-connected interfaces (which should be the same on all leaves) via OSPF to the rest of the network. The BGP section will accept just a default route from the external network and propagate it to the rest of the leaves. All the tenant subnets should then be reachable from the rest of the network.
ip prefix-list DEFAULT_ONLY seq 5 permit 0.0.0.0/0 route-map DEFAULT_ONLY permit 10 match ip address prefix DEFAULT_ONLY route-map PERMIT_ALL permit 10 router ospf 1 vrf myTenant redistribute direct route-map PERMIT_ALL router bgp 65001 vrf myTenant address-family ipv4 unicast redistribute ospf 1 route-map DEFAULT_ONLY
Troubleshooting / Verification
show bfd neighbors
Leaf-1# sh bfd ne OurAddr NeighAddr LD/RD RH/RS Holdown(mult) State Int Vrf 10.10.0.17 10.10.0.16 1090519045/1090519041 Up 5584(3) Up Eth1/49 default 10.10.0.19 10.10.0.18 1090519046/1090519043 Up 5584(3) Up Eth1/50 default 10.10.0.23 10.10.0.22 1090519051/1090519041 Up 5378(3) Up Eth1/52 default 10.10.0.21 10.10.0.20 1090519052/1090519045 Up 5378(3) Up Eth1/51 default
show bfd neighbors details
Leaf-1# sh bfd ne details OurAddr NeighAddr LD/RD RH/RS Holdown(mult) State Int Vrf 10.10.0.17 10.10.0.16 1090519045/1090519041 Up 5109(3) Up Eth1/49 default Session state is Up and using echo function with 50 ms interval Local Diag: 0, Demand mode: 0, Poll bit: 0, Authentication: None MinTxInt: 50000 us, MinRxInt: 2000000 us, Multiplier: 3 Received MinRxInt: 2000000 us, Received Multiplier: 3 Holdown (hits): 6000 ms (0), Hello (hits): 2000 ms (1026609) Rx Count: 1012709, Rx Interval (ms) min/max/avg: 55/1711/1700 last: 890 ms ago Tx Count: 1026609, Tx Interval (ms) min/max/avg: 1677/1677/1677 last: 335 ms ago Registered protocols: pim ospf Uptime: 19 days 22 hrs 24 mins 2 secs Last packet: Version: 1 - Diagnostic: 0 State bit: Up - Demand bit: 0 Poll bit: 0 - Final bit: 0 Multiplier: 3 - Length: 24 My Discr.: 1090519041 - Your Discr.: 1090519045 Min tx interval: 50000 - Min rx interval: 2000000 Min Echo interval: 50000 - Authentication bit: 0 Hosting LC: 1, Down reason: None, Reason not-hosted: None
show ip route
Leaf-1# sh ip route IP Route Table for VRF "default" '*' denotes best ucast next-hop '**' denotes best mcast next-hop '[x/y]' denotes [preference/metric] '%<string>' in via output denotes VRF <string> 10.10.0.0/32, ubest/mbest: 4/0 *via 10.10.0.16, Eth1/49, [110/2], 2w5d, ospf-1, intra *via 10.10.0.18, Eth1/50, [110/2], 2w5d, ospf-1, intra *via 10.10.0.20, Eth1/51, [110/2], 2w5d, ospf-1, intra *via 10.10.0.22, Eth1/52, [110/2], 2w5d, ospf-1, intra 10.10.0.1/32, ubest/mbest: 2/0 *via 10.10.0.16, Eth1/49, [110/2], 2w5d, ospf-1, intra *via 10.10.0.18, Eth1/50, [110/2], 2w5d, ospf-1, intra
show forwarding ipv4 route
Leaf-1# sh forwarding ipv4 route slot 1 ======= IPv4 routes for table default/base ------------------+-----------------------------------------+----------------------+-----------------+----------------- Prefix | Next-hop | Interface | Labels | Partial Install ------------------+-----------------------------------------+----------------------+-----------------+----------------- 127.0.0.0/8 Drop Null0 10.10.0.0/32 10.10.0.16 Ethernet1/49 10.10.0.18 Ethernet1/50 10.10.0.20 Ethernet1/51 10.10.0.22 Ethernet1/52 10.10.0.1/32 10.10.0.16 Ethernet1/49 10.10.0.18 Ethernet1/50 10.10.0.2/32 10.10.0.20 Ethernet1/51 10.10.0.22 Ethernet1/52 10.10.0.3/32 Receive sup-eth1 *10.10.0.4/32 10.10.0.16 Ethernet1/49 10.10.0.18 Ethernet1/50 10.10.0.20 Ethernet1/51 10.10.0.22 Ethernet1/52 *10.10.0.5/32 10.10.0.16 Ethernet1/49 10.10.0.18 Ethernet1/50 10.10.0.20 Ethernet1/51 10.10.0.22 Ethernet1/52
show bgp l2vpn evpn
Leaf-1# sh bgp l2vpn evpn BGP routing table information for VRF default, address family L2VPN EVPN BGP table version is 22428, local router ID is 10.10.0.3 Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 10.10.0.3:32776 (L2VNI 39009) *>l::::[0050.b6c2.f854]::[0.0.0.0]/216 10.10.0.3 100 32768 i Route Distinguisher: 10.10.0.3:32777 (L2VNI 39010) *>l::::[0006.f6cc.fac0]::[0.0.0.0]/216 10.10.0.3 100 32768 i Route Distinguisher: 10.10.0.3:32791 (L2VNI 39024) *>l::::[0050.5662.40ca]::[0.0.0.0]/216 10.10.0.3 100 32768 i
show bgp l2vpn evn summary
Leaf-1# sh bgp l2vpn evpn sum BGP summary information for VRF default, address family L2VPN EVPN BGP router identifier 10.10.0.3, local AS number 65001 BGP table version is 22428, L2VPN EVPN config peers 2, capable peers 2 627 network entries and 715 paths using 119652 bytes of memory BGP attribute entries [85/12240], BGP AS path entries [0/0] BGP community entries [0/0], BGP clusterlist entries [4/16] Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 10.10.0.1 4 65001 31649 33862 22428 0 0 2w2d 88 10.10.0.2 4 65001 31638 33856 22428 0 0 2w2d 88
show bgp l2vpn evn neighbors xxx.xxx.xxx.xxx advertised-routes
Leaf-1# sh bgp l2vpn evpn neighbors 10.10.0.1 advertised-routes Peer 10.10.0.1 routes for address family L2VPN EVPN: BGP table version is 22428, local router ID is 10.10.0.3 Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist, I-injected Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath, & - backup Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 10.10.0.3:32776 (L2VNI 39009) *>l::::[0050.b6c2.f854]::[0.0.0.0]/216 10.10.0.3 100 32768 i
show mac address-table vlan 1234
Leaf-1# sh mac ad vlan 1234 Legend: * - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC age - seconds since last seen,+ - primary entry using vPC Peer-Link, (T) - True, (F) - False VLAN MAC Address Type age Secure NTFY Ports ---------+-----------------+--------+---------+------+----+------------------ * 1234 0000.4815.c9c0 dynamic 0 F F Eth1/48 * 1234 0000.5e00.0101 dynamic 0 F F Eth1/48 * 1234 0000.5e00.0102 dynamic 0 F F nve1(10.10.0.5) * 1234 0004.a3f2.5eff dynamic 0 F F Eth1/46 * 1234 000a.f72e.d0e2 dynamic 0 F F nve1(10.10.0.4)
show l2route evpn mac-ip all # This will display the MAC-to-IP table of the vn-segments for which ARP-suppression is enabled
Leaf-1# sh l2route evpn mac-ip all Topology ID Mac Address Prod Host IP Next Hop (s) ----------- -------------- ---- --------------------------------------- --------------- 40 0002.990f.9bee BGP 10.150.40.98 10.10.0.5 40 000c.2951.cf4d HMM 10.150.40.20 N/A 40 000c.29b5.e978 HMM 10.150.40.50 N/A 40 000c.29d0.03fd HMM 10.150.40.11 N/A 40 001e.678d.dd9e BGP 10.150.40.110 10.10.0.4
show l2route evpn mac all # This will display the MAC-to-next-hop (local or VTEP) table
Leaf-1# sh l2route evpn mac all Topology Mac Address Prod Next Hop (s) ----------- -------------- ------ --------------- 9 0050.b6c2.f854 Local Eth1/48 10 885a.92f6.3c71 Local Eth1/48 10 885a.92f6.47a7 BGP 10.10.0.5
show ip route vrf myTenant
Leaf-1# sh ip route vrf myTenant IP Route Table for VRF "myTenant" '*' denotes best ucast next-hop '**' denotes best mcast next-hop '[x/y]' denotes [preference/metric] '%<string>' in via output denotes VRF <string> 10.150.40.20/32, ubest/mbest: 1/0, attached *via 10.150.40.20, Vlan40, [190/0], 5d18h, hmm 10.150.40.39/32, ubest/mbest: 1/0 *via 10.10.0.4%default, [200/0], 5d17h, bgp-65001, internal, tag 65001 (evpn) segid: 39000 tunnelid: 0xa970004 encap: VXLAN 10.150.40.40/32, ubest/mbest: 1/0 *via 10.10.0.5%default, [200/0], 5d17h, bgp-65001, internal, tag 65001 (evpn) segid: 39000 tunnelid: 0xa970005 encap: VXLAN
show nve peers
Leaf-1# sh nve peers Interface Peer-IP State LearnType Uptime Router-Mac --------- --------------- ----- --------- -------- ----------------- nve1 10.10.0.4 Up CP 2w4d 00fe.c8ae.a0b7 nve1 10.10.0.5 Up CP 2w4d 00fe.c8ae.a513
show l2route topology # Displays all vn-segments configured on the switch
Leaf-1# sh l2route topology Topology ID Topology Name Attributes ----------- ------------- ---------- 9 Vxlan-39009 VNI 10 Vxlan-39010 VNI 3900 Vxlan-39000 VNI 4294967294 GLOBAL N/A 4294967295 ALL N/A
show nve interface nve1 detail
Leaf-1# sh nve interface nve1 de Interface: nve1, State: Up, encapsulation: VXLAN VPC Capability: VPC-VIP-Only [not-notified] Local Router MAC: 5897.bdd4.8d3f Host Learning Mode: Control-Plane Source-Interface: loopback0 (primary: 10.10.0.3, secondary: 0.0.0.0) Source Interface State: Up NVE Flags: Interface Handle: 0x49000001 Source Interface hold-down-time: 180 Source Interface hold-up-time: 30 Remaining hold-down time: 0 seconds
show nve vni # Display all configured vn-segments (VNIs) and their type
Leaf-1# sh nve vni Codes: CP - Control Plane DP - Data Plane UC - Unconfigured SA - Suppress ARP Interface VNI Multicast-group State Mode Type [BD/VRF] Flags --------- -------- ----------------- ----- ---- ------------------ ----- nve1 39000 n/a Up CP L3 [myTenant] nve1 39009 126.96.36.199 Up CP L2  nve1 39010 188.8.131.52 Up CP L2  nve1 39011 184.108.40.206 Up CP L2 
show ip arp suppression topo-info # Display ARP-suppression status on different vn-segments
Leaf-1# sh ip arp suppression topo-info ARP L2RIB Topology information Topo-id ARP-suppression mode 9 ARP Suppression Disabled 10 ARP Suppression Disabled 11 ARP Suppression Disabled
show nve vni xxxxx counters
Leaf-1# sh nve vni 391234 counters VNI: 391234 TX 1971724563 unicast packets 2040477963558 unicast bytes 67918428 multicast packets 12507359647 multicast bytes RX 1653498175 unicast packets 881467334920 unicast bytes 10200869 multicast packets 5044196745 multicast bytes