Data Center Network Scaling Issues

As infrastructure needs more resources, we install more servers. We need to install more switches for connecting more servers. Day by day data center becomes a large, massive physical infrastructure and you face with lots problems to solve. As long as you simply your data center network, you stay in safe side where you only build a layer 3 network. But using a layer 3 only data center without the broadcast domains needs a lot of improvement on server, service and even on applications. If you are cloud company where you need also tenancy in your network this becomes more harder. The basic option and where everyone start is legacy data center.

The challenges and the problems of legacy data centers are well known and can be easily found on the Internet. Using a layer 2 data center should be avoided. If it is possible, using broadcast domains should be eliminated. The need for broadcast domains depends on server infrastructure and any overlays used. And also may rely on the services or protocols used by them like multicast e.t.c. Building layer 3 only data center will be investigated in another document.

By the way, if it is possible, please avoid using overlay:) And build more intelligent applications:)

For any kind of network, all options should be considered under those subjects;

  • Scalability
  • Resiliency
  • Redundancy
  • Complexity

Before diving into details of data center scaling, we will look into design options for data cente or data center fabrics.

Legacy data center

Legacy data centers are simply a logical switch with a massive number of ports. Their design is to build large broadcast domain, where a centralized device, firewall or a router is used for routing (Router on a Stick) e.t.c.

The challenges and the problems of legacy data centers are well known and can be easily found on the Internet. For summary;

  • Use of large broadcast domains.
  • BUM handling.
  • Centralise routing.
  • Layer 2 tenancy based on VLANS.
  • Convergence and failure problems.

Thus using a layer 2 data center should be avoided. If it is possible, using broadcast domains should be eliminated. The need for broadcast domains depends on server infrastructure and any overlays used. Generally the need for layer is or VM mobility without IP change, or for services with burned-in ip addresses e.t.c. And also may rely on the services or protocols used by them like multicast e.t.c. Building layer 3 only data center will be investigated in another document.

By the way, if it is possible, please avoid using overlay:) And build more intelligent applications:)

Single data center design with VXLAN

The problems of legacy data centers are generally solved by using VXLAN and EVPN, and underlay and overlay networks are introduced. VXLAN is used for data path, and EVPN is used as a control plane.

VXLAN is an encapsulation method.

Critical points for VXLAN encapsulation are;

  • It is not a tunneling protocol. It’s a uni-directional forwarding method for data. There is no session establishment.
  • It extends broadcast domain size according to the use of VLANs.
  • It supports entropy for the encapsulated data.
  • It only forwards ethernet BUM traffic. Protocols like Spanning Tree, LLDP are not encapsulated and forwarded. Thus VXLAN capable devices do not connect spanning tree domains.

But it needs a control plane where EVPN comes into place. EVPN also adds support for tenancy, distributed routing, multihoming e.t.c.

We should consider the below subject while using VXLAN and EVPN based data centers;

  • Underlay network protocols usage: Do not forget that both methods work well, and the answer changes according to the engineer. But try to be as simply as it is possible. This one of the key point while as it complexity can be problem while deploying massive number of switches.
    • Convergence: Try to be in a reasonable number of convergence time according to your applications. Test your spine failures, how fast they can establish neighborship with large number leafs during boot up, how fast they can send updates to large number of leafs. What is the convergence time of spine failures under large number of prefixes and large number peer
      • Leaf-Spine link failure detection: Use BFD if it is possible.
      • Local protocol changes after link failure: In a redundant environment only the underlay path should be affected. Try to chose an underlay protocol that is less affected during link failure. Rely on ECMP on underlay.
      • Updates: Tuned Timers. Also prefix update for the case with BGP. BGP update-timer is another issue. When spine lost connectivity to a leaf, it needs to send update, withdraws for the prefixes on that leaf which depend on update timer. Also number of prefixes and number of leafs that need to get those updates is another effecting number to the convergence. Under the case of multiple spine, a single leaf-link failure only effect the removal of that spine from the ECMP group associated for the leaf switch address.
      • Remote protocol changes: timers to detect protocol status (spines) also on leafs to remove failed path from underlay ECMP. As most of the vendor use ECMP group, or next-hop group removal of next-hop from the group is fast enough.
    • IGP + IBGP: Underlay relies on IGP, where IBGP is used for overlay EVPN. IGP protocols are long time-proven for fast convergences and failure scenarios. I prefer to use ISIS as;
      • ISIS advantages for IGP: Generally, using L2 will be enough.
        • Use CLNS, no IP dependence on the link level.
        • No SPF calculation on link change ONLY when topology change!
        • Fast re-convergence.
      • IBGP is well suited for CLOS design EVPN. But using multiple protocols has complexity from troubleshooting, scaling, and bug probability.
    • EBGP: EVPN origin is IBGP. When using EBGP, someone needs to solve problems like fast convergence for underlay, next-hop unchanged for the direct path between each leaf, route-target issues, loop presentation mechanism, and e.t.c. BGP path hunting is another issue. There is no layer between spine and leaf, that means unnecessary  updates from spine->leaf->spine. Fast convergence with using BGP is not the best thing to do. Timers should be adjusted even use of BFD should be considered. Although using a single protocol has many advantages from troubleshooting, operation, IP usage to bug hitting probability and e.t.c. Also, EBGP design has two options:
        • Direct Underlay Peering – Direct Overlay Peering: Connected interfaces are used for EBGP sessions between CLOS levels for EVPN. Overlay peering down if the direct link fails, triggering routing updates inside the fabric, next-hop scan. Link failure triggers routing update not for underlay also for overlay. Single update for all, complex troubleshooting. When a link fails this effect both underlay and overlay. You can not change NLRI next-hop for overlay.
        • Direct Underlay Peering – Loopback Overlay Peering: Connected interfaces are used for IPV4, and loopback interfaces are used for EBGP sessions. IPV4 session is used to distribute EVPN loopback addresses and VTEP source addresses, which can be the same. Using different loopback and EVPN sessions has some failure advantages to be considered. Link failure will not affect overlay peering as long as both sides of the sessions are reachable through loopbacks. More durable overlay during link failures as there will be no update propagation. More IP address needed, more complex configuration, more complex automation.
  • Tenant usage for overlay network;
    • Inter tenant routing: Basically route leaking between L3 VRF. Consider symmetric anycast routing, which does not require configuring the leaked tenant on the local leaf switch. This gives an advantage when leaking external services. Defining tenant vrf on the border leafs (leafs that external services are connected). VRF leaking may not be supported for Asymmetric IRB.

    However this leads some scale issues in host route scale. With symmetric route, remote host routes are installed in the VRF Routing table. And if you are using vrf leaking, this host routes are installed on every vrf table that its imported. Thus it multiple resource usage for every VRF that is installed. On the other hand asymmetric increases the local ARP table usage. Take care when using both implementation with each vendor. Test the scale numbers. ECMP is problem with some chipsets. You may hit the maximum hardware numbers especially using EVPN ESI.

  • Maximum number of ECMP or ECMP next-hop groups: This is not directly related with design but it may a problem as most of the chips has scale number for that. Consider;
      • Max number of L3 ECMP next-hops for your protocol. For overlay, for underlay e.t.c. Especially if you are using L3 multiple path between spine and leaf because of bandwidth requirements this may a problem from the scaling perspective as spine switches has less resources than leafs.
      • Try to understand with the case of next-hop group usage with the case of using VRF for tenants. Especially while using leaking between them.
  • North-south, external service integration:
    • Dynamic routing support:
    • Redundancy and convergence Options:
    • Scaling options:
  • BUM replication: ingress replication or using multicast. Multicast is best suited for large fabric where ingress replication may be overheating with many VTEP peers.
  • Broadcast domain size scales: Large broadcast domain is a problem not only for underlay but also for overlay or servers. If centralized routing is used, this hits the issue of BUM limits of line cards of devices. The same problem also hits the servers. Kernels, the CPU of the servers, handle BUM traffic. Most of the kernel’s default value for ARP cache is small despite the network devices. This results in many ARP requests when the table size is at its limit. Even latencies during server to server communication when arp records are deleted.
  • Failure scenarios:
    • Failures on all CLOS levels.
      • Convergence during and after failures.
      • Multihoming failures: Tor switch pair failures, uplink, host link failure, protocol failure for used for multihoming. Especially black hole routing for host traffic during failure.
    • Hardware Failures
      • Power supply, fan, e.t.c.
  • Hardware limitations:
    • Size of the forwarding and routing tables. MAC tables, ARP, RIB, ECMP, and FIB, e.t.c.
    • Size of a single control/data domain, EVPN domain.
      • Size of encapsulation limits—Max number VTEP peers.
      • BUM replication limits, usage of multicast for BUM.
      • ARP scaling for hardware.
      • Scale for inter-tenant usage (VRF Leaking).
      • Scale for control plane operations. Max number of ARP req, storm control levels e.t.c.
  • Complexity issues: More servers and more switches in a single domain, pod or cluster may lead operational complexities. Also this results in a single radius of failure domain. Splitting the resources to zones or pods MUST
  • Scale of VXLAN and EVPN based data center

    The Problem

    Most of the switch hardware has some offloading capabilities for better latency, which requires using particular chipsets. One critical number is the maximum number of VTEP, VXLAN destinations supported for the chipset. This number determines how big a single EVPN domain, data center, or broadcast domain can be. As it determines the max switch number, it also determines the maximum number of servers. This number gets lower with dual-homed host scenarios.

    What if you want to get bigger beyond that number? Before diving into solutions, let’s talk a bit about getting bigger. How large will be your broadcast domain? Theoretically, you can have a single broadcast domain with thousands of hosts as long as you have an IP prefix with enough length. But practically, you have limited resources on your servers: limited arp tables, limited resources for responding broadcast packets. As the network grows, you hit a lot of BUM problems.

    When we return to our problem, suppose that we really want to add more hosts into our existing broadcast domains.

    The solution is VXLAN Gateways. They can decapsulate and re-encapsulates vxlan packets. They connect VXLAN domains. More simply, VXLAN proxy devices. If we separate our broadcast domains into multiple sub-domains and connect them using these VXLAN Gateways, we can increase the number of switches.

As seen on the topology, when the left side switch needs to send a packet to one of the right sides, it forwards the packet to its border gateway device. So it does not need VTEP peering with all of the right side leaf switches. This solves the max number of VTEP peers for a single switch. And theoretically, you can add more switches into a single broadcast domain. This method is called EVPN Multi-Site.

 

 

The problem may be solved more simply by eliminating the need for extending the broadcast domain and layer 3 while growing. It is just like dividing your fabric into large enough units. If you do not stretch and layer 2 and layer 3 (use different layer 3 prefixes on each sub fabric, pod) this process is much more simple. Then you route between each pod. The only problem is connecting tenants.

Extending with EVPN Multi-Site.

EVPN Multi-Site architectures can be used for extending a single data center fabric. There are couple of options to be considered;

Multi-Site has the best flexibility along with them. There are a couple of RFCs about the subject. Most important are Sharma and Bess drafts. Sharma focuses on a single control plane and single data plane (VXLAN), whereas Bess is more generic. Both of them use border gateways or border devices on each fabric.

Sharma requires to use of EVPN for the inter and intra fabric network. However, using Bess, you can use MPLS, e.t.c. on the inter-fabric network and use a couple of other encapsulations besides VXLAN.

Besides IETF drafts, Let’s investigate the problem.

There are two network type for EVPN Multi-Site;

  • Intra-Fabric: A single fabric network. Look for a single EVPN VXLAN fabric consideration.
  • Inter-Fabric: Fabric interconnect network where border gateways are connected. Commonly called DCI network.
Challenges with EVPN multi-site
  • Physical Challenges: How can we connect border gateways?

    • Full-Mesh: Direct physical connections between each fabric border gateway. If we use redundant border gateways, this will also dramatically increase the number of requires ports on each single border gateway. Besides, each new fabric will have a logarithmic effect on the required number of ports.

    • Hub and Spoke/Centralized: Border gateways can be connected over a central device

 

  • Control Plane Challenges: There are a couple of encapsulation options to use between each fabric like MPLS, Genova e.t.c. We will focus on using VLAN and EVPN between fabrics for simplifying the logical path between fabrics by using the same method used for intra-fabric. For the control plane, EBGP is used between fabrics. Using different AS numbers on each fabric simplifies the control plane operations. We assume that each border gateway has another AS number, and also, we don’t use the same AS number inside each fabric. Otherwise, we must overcome BGP AS-Path loop avoidance for receiving the remote site routes. As with the physical connection options, the control plane session can be full-mesh or hub and speak like IBGP route reflectors. Using EBGP has the same options as the case of intra fabric, Direct overlay peering and loopback overlay peering.

    • Full mesh BGP sessions: Each BGW will establish an EBGP session with each other. We will not focus on the details of this design as it gives complexity and is not meaningful while using the hub and spoke/centralized physical design.
    • Centralized BGP sessions: While using a centralized physical network, a full-mesh EBGP session can still be used between BGW using EBGP multi-hop and e.t.c. We can not get all the benefits of using the central physical network. As in the intra fabric’s control plane option, we have some options. For example, direct underlay peering or loopback overlay peering.
    • Use of route targets on different fabrics is another issue. BGW is just like leaking devices from the control protocol of view, and leaking is based on route targets. Using auto-route targets complicates route leaking as same tenant network. L2 VRF or L3 VRF will be different route targets. Using the same route targets will do the job better if all tenancy configuration is the same on all fabric. But if it is different, then the manual configuration will be more complex and challenging as the configurations of BGWs will not be the same. You have to configure only the BGW of the fabrics where you define the VRF’s. These options are required to retrain route-target while sending updates between BGW.
      • EVPN Next-hop: This should be unchanged.
      • Route-Targets: If each site uses different route targets for each L2 and L3 VRF, there is a problem with using them on border gateways, especially auto-route targets. This problem gets more sophisticated if Route-Servers are in use.
  • Border Gateway Placement: BGW can be installed as standalone leafs and on the spine or super spine switches. However, considering they will have all intra-fabric information, they need to scale like a single leaf in the fabric. Therefore, their scale must be identical to leaf switches.Mac address, L2 routes, L3 routes e.t.c. For example, these scaling options generally do not have spine switches of the most vendor as the spine do not participate in overlays. Also, you need to configure all L3 and L2 configurations for the selected overlays, which you will connect between multiple fabrics.

  • BUM Replication: They must support intra fabric and inter fabric BUM replication method, which may be different.

  • Use of redundant border gateways for a single fabric:

    • Anycast BGW or MLAG/VPC BGW:
    • DF election for a VNI: Do not for DF election is based on VNI, not VLAN.
  • Connecting hosts into border gateways: single-homed hosts, dual-homed hosts. Dual homed hosts may require the use of SVI!

  • BUM Replication: Border gateways may use different bum replications for inter and inter fabric.

  • Failure Scenarios on border gateways:

    • Failure scenarios related to using redundant border gateway
    • Inter fabric link failure
    • Inter fabric link failure
  • Use of SVI/Subinterfaces on BGW

  • Filtering capabilities on BGW: EVPN Route type filtering, e.t.c.

  • North-Bound service placement: Any external service like firewall, load balancers, and e.t.c which are generally shared services between tenants.

    • Connected to each border gateway: External services can be connected to each border gateway. For example, VRF-Lite, inter-as option A, can leak external service into tenant networks between the border gateway and the external device.
    • Dedicated border gateways: Simply this using a border gateway for external services. External networks are just another tenant network that can be leaked into appropriate tenants.

Extending with DCI

This options is much more simple and usable. If there is no need for extending layer 2, or layer 3 between each fabric or site, then we can simply route between them. But connecting tenant networks between pods is another case. In my opinion using MPLS for inter fabric and using VPNV4 EVPN stitching solves tenant route leaking between different fabrics. This basically re-originating EVPN Type 5 routes as a VPNV4 routes into inter fabric.

Coming soon.

Links: