Data Center Network Scaling Issues

As infrastructure needs more resources, we install more servers. We need to install more switches for connecting more servers. Day by day data center becomes a large, massive physical infrastructure and you face with lots problems to solve. As long as you simply your data center network, you stay in safe side where you only build a layer 3 network. But using a layer 3 only data center without the broadcast domains needs a lot of improvement on server, service and even on applications. If you are cloud company where you need also tenancy in your network this becomes more harder. The basic option and where everyone start is legacy data center.

The challenges and the problems of legacy data centers are well known and can be easily found on the Internet. Using a layer 2 data center should be avoided. If it is possible, using broadcast domains should be eliminated. The need for broadcast domains depends on server infrastructure and any overlays used. And also may rely on the services or protocols used by them like multicast e.t.c. Building layer 3 only data center will be investigated in another document.

By the way, if it is possible, please avoid using overlay:) And build more intelligent applications:)

For any kind of network, all options should be considered under those subjects;

  • Scalability
  • Resiliency
  • Redundancy
  • Complexity

Before diving into details of data center scaling, we will look into design options for data cente or data center fabrics.

Legacy data center

Legacy data centers are simply a logical switch with a massive number of ports. Their design is to build large broadcast domain, where a centralized device, firewall or a router is used for routing (Router on a Stick) e.t.c.

The challenges and the problems of legacy data centers are well known and can be easily found on the Internet. For summary;

  • Use of large broadcast domains.
  • BUM handling.
  • Centralise routing.
  • Layer 2 tenancy based on VLANS.
  • Convergence and failure problems.

Thus using a layer 2 data center should be avoided. If it is possible, using broadcast domains should be eliminated. The need for broadcast domains depends on server infrastructure and any overlays used. Generally the need for layer is or VM mobility without IP change, or for services with burned-in ip addresses e.t.c. And also may rely on the services or protocols used by them like multicast e.t.c. Building layer 3 only data center will be investigated in another document.

By the way, if it is possible, please avoid using overlay:) And build more intelligent applications:)

Single data center design with VXLAN

The problems of legacy data centers are generally solved by using VXLAN and EVPN, and underlay and overlay networks are introduced. VXLAN is used for data path, and EVPN is used as a control plane.

VXLAN is an encapsulation method.

Critical points for VXLAN encapsulation are;

  • It is not a tunneling protocol. It’s a uni-directional forwarding method for data. There is no session establishment.
  • It extends broadcast domain size according to the use of VLANs.
  • It supports entropy for the encapsulated data.
  • It only forwards ethernet BUM traffic. Protocols like Spanning Tree, LLDP are not encapsulated and forwarded. Thus VXLAN capable devices do not connect spanning tree domains.

But it needs a control plane where EVPN comes into place. EVPN also adds support for tenancy, distributed routing, multihoming e.t.c.

We should consider the below subject while using VXLAN and EVPN based data centers;

  • Underlay network protocols usage: Do not forget that both methods work well, and the answer changes according to the engineer. But try to be as simply as it is possible. This one of the key point while as it complexity can be problem while deploying massive number of switches.
    • Convergence: Try to be in a reasonable number of convergence time according to your applications. Test your spine failures, how fast they can establish neighborship with large number leafs during boot up, how fast they can send updates to large number of leafs. What is the convergence time of spine failures under large number of prefixes and large number peer
      • Leaf-Spine link failure detection: Use BFD if it is possible.
      • Local protocol changes after link failure: In a redundant environment only the underlay path should be affected. Try to chose an underlay protocol that is less affected during link failure. Rely on ECMP on underlay.
      • Updates: Tuned Timers. Also prefix update for the case with BGP. BGP update-timer is another issue. When spine lost connectivity to a leaf, it needs to send update, withdraws for the prefixes on that leaf which depend on update timer. Also number of prefixes and number of leafs that need to get those updates is another effecting number to the convergence. Under the case of multiple spine, a single leaf-link failure only effect the removal of that spine from the ECMP group associated for the leaf switch address.
      • Remote protocol changes: timers to detect protocol status (spines) also on leafs to remove failed path from underlay ECMP. As most of the vendor use ECMP group, or next-hop group removal of next-hop from the group is fast enough.
    • IGP + IBGP: Underlay relies on IGP, where IBGP is used for overlay EVPN. IGP protocols are long time-proven for fast convergences and failure scenarios. I prefer to use ISIS as;
      • ISIS advantages for IGP: Generally, using L2 will be enough.
        • Use CLNS, no IP dependence on the link level.
        • No SPF calculation on link change ONLY when topology change!
        • Fast re-convergence.
      • IBGP is well suited for CLOS design EVPN. But using multiple protocols has complexity from troubleshooting, scaling, and bug probability.
    • EBGP: EVPN origin is IBGP. When using EBGP, someone needs to solve problems like fast convergence for underlay, next-hop unchanged for the direct path between each leaf, route-target issues, loop presentation mechanism, and e.t.c. BGP path hunting is another issue. There is no layer between spine and leaf, that means unnecessary  updates from spine->leaf->spine. Fast convergence with using BGP is not the best thing to do. Timers should be adjusted even use of BFD should be considered. Although using a single protocol has many advantages from troubleshooting, operation, IP usage to bug hitting probability and e.t.c. Also, EBGP design has two options:
        • Direct Underlay Peering – Direct Overlay Peering: Connected interfaces are used for EBGP sessions between CLOS levels for EVPN. Overlay peering down if the direct link fails, triggering routing updates inside the fabric, next-hop scan. Link failure triggers routing update not for underlay also for overlay. Single update for all, complex troubleshooting. When a link fails this effect both underlay and overlay. You can not change NLRI next-hop for overlay.
        • Direct Underlay Peering – Loopback Overlay Peering: Connected interfaces are used for IPV4, and loopback interfaces are used for EBGP sessions. IPV4 session is used to distribute EVPN loopback addresses and VTEP source addresses, which can be the same. Using different loopback and EVPN sessions has some failure advantages to be considered. Link failure will not affect overlay peering as long as both sides of the sessions are reachable through loopbacks. More durable overlay during link failures as there will be no update propagation. More IP address needed, more complex configuration, more complex automation.
  • Tenant usage for overlay network;
    • Inter tenant routing: Basically route leaking between L3 VRF. Consider symmetric anycast routing, which does not require configuring the leaked tenant on the local leaf switch. This gives an advantage when leaking external services. Defining tenant vrf on the border leafs (leafs that external services are connected). VRF leaking may not be supported for Asymmetric IRB.

    However this leads some scale issues in host route scale. With symmetric route, remote host routes are installed in the VRF Routing table. And if you are using vrf leaking, this host routes are installed on every vrf table that its imported. Thus it multiple resource usage for every VRF that is installed. On the other hand asymmetric increases the local ARP table usage. Take care when using both implementation with each vendor. Test the scale numbers. ECMP is problem with some chipsets. You may hit the maximum hardware numbers especially using EVPN ESI.

  • Maximum number of ECMP or ECMP next-hop groups: This is not directly related with design but it may a problem as most of the chips has scale number for that. Consider;
      • Max number of L3 ECMP next-hops for your protocol. For overlay, for underlay e.t.c. Especially if you are using L3 multiple path between spine and leaf because of bandwidth requirements this may a problem from the scaling perspective as spine switches has less resources than leafs.
      • Try to understand with the case of next-hop group usage with the case of using VRF for tenants. Especially while using leaking between them.
  • North-south, external service integration:
    • Dynamic routing support:
    • Redundancy and convergence Options:
    • Scaling options:
  • BUM replication: ingress replication or using multicast. Multicast is best suited for large fabric where ingress replication may be overheating with many VTEP peers.
  • Broadcast domain size scales: Large broadcast domain is a problem not only for underlay but also for overlay or servers. If centralized routing is used, this hits the issue of BUM limits of line cards of devices. The same problem also hits the servers. Kernels, the CPU of the servers, handle BUM traffic. Most of the kernel’s default value for ARP cache is small despite the network devices. This results in many ARP requests when the table size is at its limit. Even latencies during server to server communication when arp records are deleted.
  • Failure scenarios:
    • Failures on all CLOS levels.
      • Convergence during and after failures.
      • Multihoming failures: Tor switch pair failures, uplink, host link failure, protocol failure for used for multihoming. Especially black hole routing for host traffic during failure.
    • Hardware Failures
      • Power supply, fan, e.t.c.
  • Hardware limitations:
    • Size of the forwarding and routing tables. MAC tables, ARP, RIB, ECMP, and FIB, e.t.c.
    • Size of a single control/data domain, EVPN domain.
      • Size of encapsulation limits—Max number VTEP peers.
      • BUM replication limits, usage of multicast for BUM.
      • ARP scaling for hardware.
      • Scale for inter-tenant usage (VRF Leaking).
      • Scale for control plane operations. Max number of ARP req, storm control levels e.t.c.
  • Complexity issues: More servers and more switches in a single domain, pod or cluster may lead operational complexities. Also this results in a single radius of failure domain. Splitting the resources to zones or pods MUST
  • Scale of VXLAN and EVPN based data center

    The Problem

    Most of the switch hardware has some offloading capabilities for better latency, which requires using particular chipsets. One critical number is the maximum number of VTEP, VXLAN destinations supported for the chipset. This number determines how big a single EVPN domain, data center, or broadcast domain can be. As it determines the max switch number, it also determines the maximum number of servers. This number gets lower with dual-homed host scenarios.

    What if you want to get bigger beyond that number? Before diving into solutions, let’s talk a bit about getting bigger. How large will be your broadcast domain? Theoretically, you can have a single broadcast domain with thousands of hosts as long as you have an IP prefix with enough length. But practically, you have limited resources on your servers: limited arp tables, limited resources for responding broadcast packets. As the network grows, you hit a lot of BUM problems.

    When we return to our problem, suppose that we really want to add more hosts into our existing broadcast domains.

    The solution is VXLAN Gateways. They can decapsulate and re-encapsulates vxlan packets. They connect VXLAN domains. More simply, VXLAN proxy devices. If we separate our broadcast domains into multiple sub-domains and connect them using these VXLAN Gateways, we can increase the number of switches.

As seen on the topology, when the left side switch needs to send a packet to one of the right sides, it forwards the packet to its border gateway device. So it does not need VTEP peering with all of the right side leaf switches. This solves the max number of VTEP peers for a single switch. And theoretically, you can add more switches into a single broadcast domain. This method is called EVPN Multi-Site.

 

 

The problem may be solved more simply by eliminating the need for extending the broadcast domain and layer 3 while growing. It is just like dividing your fabric into large enough units. If you do not stretch and layer 2 and layer 3 (use different layer 3 prefixes on each sub fabric, pod) this process is much more simple. Then you route between each pod. The only problem is connecting tenants.

Extending with EVPN Multi-Site.

EVPN Multi-Site architectures can be used for extending a single data center fabric. There are couple of options to be considered;

Multi-Site has the best flexibility along with them. There are a couple of RFCs about the subject. Most important are Sharma and Bess drafts. Sharma focuses on a single control plane and single data plane (VXLAN), whereas Bess is more generic. Both of them use border gateways or border devices on each fabric.

Sharma requires to use of EVPN for the inter and intra fabric network. However, using Bess, you can use MPLS, e.t.c. on the inter-fabric network and use a couple of other encapsulations besides VXLAN.

Besides IETF drafts, Let’s investigate the problem.

There are two network type for EVPN Multi-Site;

  • Intra-Fabric: A single fabric network. Look for a single EVPN VXLAN fabric consideration.
  • Inter-Fabric: Fabric interconnect network where border gateways are connected. Commonly called DCI network.
Challenges with EVPN multi-site
  • Physical Challenges: How can we connect border gateways?

    • Full-Mesh: Direct physical connections between each fabric border gateway. If we use redundant border gateways, this will also dramatically increase the number of requires ports on each single border gateway. Besides, each new fabric will have a logarithmic effect on the required number of ports.

    • Hub and Spoke/Centralized: Border gateways can be connected over a central device

 

  • Control Plane Challenges: There are a couple of encapsulation options to use between each fabric like MPLS, Genova e.t.c. We will focus on using VLAN and EVPN between fabrics for simplifying the logical path between fabrics by using the same method used for intra-fabric. For the control plane, EBGP is used between fabrics. Using different AS numbers on each fabric simplifies the control plane operations. We assume that each border gateway has another AS number, and also, we don’t use the same AS number inside each fabric. Otherwise, we must overcome BGP AS-Path loop avoidance for receiving the remote site routes. As with the physical connection options, the control plane session can be full-mesh or hub and speak like IBGP route reflectors. Using EBGP has the same options as the case of intra fabric, Direct overlay peering and loopback overlay peering.

    • Full mesh BGP sessions: Each BGW will establish an EBGP session with each other. We will not focus on the details of this design as it gives complexity and is not meaningful while using the hub and spoke/centralized physical design.
    • Centralized BGP sessions: While using a centralized physical network, a full-mesh EBGP session can still be used between BGW using EBGP multi-hop and e.t.c. We can not get all the benefits of using the central physical network. As in the intra fabric’s control plane option, we have some options. For example, direct underlay peering or loopback overlay peering.
    • Use of route targets on different fabrics is another issue. BGW is just like leaking devices from the control protocol of view, and leaking is based on route targets. Using auto-route targets complicates route leaking as same tenant network. L2 VRF or L3 VRF will be different route targets. Using the same route targets will do the job better if all tenancy configuration is the same on all fabric. But if it is different, then the manual configuration will be more complex and challenging as the configurations of BGWs will not be the same. You have to configure only the BGW of the fabrics where you define the VRF’s. These options are required to retrain route-target while sending updates between BGW.
      • EVPN Next-hop: This should be unchanged.
      • Route-Targets: If each site uses different route targets for each L2 and L3 VRF, there is a problem with using them on border gateways, especially auto-route targets. This problem gets more sophisticated if Route-Servers are in use.
  • Border Gateway Placement: BGW can be installed as standalone leafs and on the spine or super spine switches. However, considering they will have all intra-fabric information, they need to scale like a single leaf in the fabric. Therefore, their scale must be identical to leaf switches.Mac address, L2 routes, L3 routes e.t.c. For example, these scaling options generally do not have spine switches of the most vendor as the spine do not participate in overlays. Also, you need to configure all L3 and L2 configurations for the selected overlays, which you will connect between multiple fabrics.

  • BUM Replication: They must support intra fabric and inter fabric BUM replication method, which may be different.

  • Use of redundant border gateways for a single fabric:

    • Anycast BGW or MLAG/VPC BGW:
    • DF election for a VNI: Do not for DF election is based on VNI, not VLAN.
  • Connecting hosts into border gateways: single-homed hosts, dual-homed hosts. Dual homed hosts may require the use of SVI!

  • BUM Replication: Border gateways may use different bum replications for inter and inter fabric.

  • Failure Scenarios on border gateways:

    • Failure scenarios related to using redundant border gateway
    • Inter fabric link failure
    • Inter fabric link failure
  • Use of SVI/Subinterfaces on BGW

  • Filtering capabilities on BGW: EVPN Route type filtering, e.t.c.

  • North-Bound service placement: Any external service like firewall, load balancers, and e.t.c which are generally shared services between tenants.

    • Connected to each border gateway: External services can be connected to each border gateway. For example, VRF-Lite, inter-as option A, can leak external service into tenant networks between the border gateway and the external device.
    • Dedicated border gateways: Simply this using a border gateway for external services. External networks are just another tenant network that can be leaked into appropriate tenants.

Extending with DCI

This options is much more simple and usable. If there is no need for extending layer 2, or layer 3 between each fabric or site, then we can simply route between them. But connecting tenant networks between pods is another case. In my opinion using MPLS for inter fabric and using VPNV4 EVPN stitching solves tenant route leaking between different fabrics. This basically re-originating EVPN Type 5 routes as a VPNV4 routes into inter fabric.

Coming soon.

Links:

External Services for EVPN VXLAN

In this article I will investigate the external services for datacenter fabric where;

  • EVPN for control plane and VXLAN for data plane.
    Multi-tenancy configured with L3 VRF configuration.
    Symmetric any-cast gateway is configured for all tenant VLANS for better scalability. Distributed routing over the data center.

Installing external service is just about installing a type-5 route in the appropriate tenants routing table.
I will focus on installing a firewall for tenant network’s internet access. Firewall will announce the default route towards fabric switches. Physical connection and topology.

Dual firewall for redundancy : A single firewall cluster with active-passive setup.
Dual fabric edge device for redundancy : Two border leaf switches is used. EVPN ESI and multi chassis options can be used.

Lets first look at active-passive firewall setup. Both firewalls must have identical configurations. During fail-over passive firewall takes all control-plane and data-plane traffic. If there is a BGP session or any other layer 3 routing adjacency with the remote devices, those neighbor ships must move to the second device but most of the vendor do not replicate network control-plane traffic for OSPF – BGP for routing protocols to the passive device even its not possible with short hello timers. This result in session re-establishment for most of the network protocols.

As both firewalls have identical configurations, remote device has to have identical configuration from layer 3 perspective on its connected interfaces towards the both firewalls.

This requirement can only be done using logical interfaces for layer 3 configuration thus plain layer 3 interfaces can not be used. More simply routed interfaces can not be used as both interfaces on remote device as two interface can not have identical layer 3 setup.

page1image49951120 page1image49947168

Both ports have to be switched ports, and SVI interface is configured for establishing layer 3 adjacency with firewall.

page2image49943840

In that case after a fail-over, routing agencies can be re-established between the active firewall and remote device. Grace-full restart capability can be though however this setup can conflict with BFD usage which can help sub second failure detection.

Now if we want more redundancy, we can put a second remote device.

Its obvious that we have to use SVI interfaces on the remote device. But we have two options to connect with the remote devices;

Both interfaces on active firewall can be dedicated interface, different layer 3 configuration. This requires a second layer 3, routing adjacency over second interface even if you use only one remote device, border leaf.
Interface bundle can be used for two of the interfaces.

Using separate links towards the remote devices

page2image49942800

page3image49948832

From layer 3 perspective you can use second adjacency as back-up link with using as-prepending or other methods. But you can also use ECMP with installing both adjacency routes on firewall and also on remote devices which will utilize both links and during a fail-over will result in better convergence time as devices will not update their rib table as most of the firewalls do not have additional-path support. But from EVPN perspective, tenant routing instance has to install both routes from border-leafs which is default configuration for most of the implementations as Fabric tend to use ECMP on all the paths.

Using interface bundling

It is obvious that remote side of the firewall also have to configure bundling , with using multi chassis link aggregation, VPC, MLAG, CLAG or e.t.c. This solves the design from layer 2 perspective. But what about from layer 3. How we are going to handle two routing adjacency? How will be the traffic flow.

As we are using bundle-interface, traffic will be forwarded towards the both link using ECMP. This means firewall can send some traffic designated to border leaf 1 towards the border leaf 2. Than border leaf 2 will switch traffic to the border leaf 2-1 as destination layer 2 address of the packet is border leaf 1. This will happen while using a single path or even using both paths from border leafs for layer 3 ECMP purposes or fast convergence from layer 3 perspective. In that case both border leafs will receive the other border leaf’s destinated traffic.

Switching traffic to the remote border leaf require or cause some problems for different implemantation.

If EVPN ESI is used, the cross side traffic between border leafs are switch using VXLAN. As the traffic itself from the firewall is destinated towards a host on leaf its again switched to another vni. This requires vxlan decapsulation and encapsulation on receiving border leaf, VxLAN border gateway functionality.

page3image49949248

page4image49840752

If a multi chassis protocol is used traffic will be switch over peerlink towards the remote border leaf where it will be encapsulated towards a leaf. But this may a problem from vendor to vendor. For example for cumulus this traffic is considered as unknown unicast and may overhelm orphan hosts connected on border leafs.(https://www.redpill-linpro.com/techblog/2018/02/26/layer3-cumulus-mlag.html).

There is another draw back which is obvious traffic is transferring both border leafs adding more latency.

However there is a solution. The problem is that destination layer 2 address is different for both path received from border leafs. Using a same virtual-mac for both paths next-hop is the solution. This can be achieved with two method while one of them is problematic.

First one is two use virtual-ip address like used in anycast gateway implementation on EVPN fabrics. This feature requires using the same virtual-mac for anycast gateway IP address on both pairs. Using different ip addresses on both switches as a virtual-ip address or anycast gateway IP address for that SVI. This solves the traffic flow. Next-hop may be different on both firewall but dmac of the next-hops are same. Both border-leaf will think that traffic is destinated to them and switch the packet directly to the destinated leaf. This method may have problem even not the desired one as this configuration cause some problem for vendor implementation of anycast gateway. As same IP address is used for BGP IP address. Even BFD will fail.

More optimum solution is using different IP address on SVI but setting the next-hop advertised to the firewall to a virtual-ip address configured as anycast gateway on border leafs.

In that way as dmac of any packet from firewall to border leafs are handled locally as dmac is the virtual mac shared between them, identical to anycast gateway implementation.

During my tests this method gives like 200-300ns convergence. Using a dedicated link on firewall like the first method give like 600ns-1sec failover time. The first method includes updating L3 table installing the second ip route and then updating l2 table. On firewall side, if it supports pre-negotiation of LACP on the passive side this will improve convergence. If you do not have this support, when active firewall fails the passive will establish LACP from the scratch and this will dramatically effect your convergence time which is not possible for the solution 1, using dedicated links between firewall cluster and border leafs.

Also if you enable add-path support on the appropriate tenant networks, fail-over during border-leaf failure will much more faster. Below is sample configuration with Arista border leafs:
#ESI and layer 2 configuration which should be the same on both devices.

page4image50125936

vlan X
name DC_FIREWALL_VLAN trunk group DC_FIREWALL
!

interface Port-ChannelX
switchport mode trunk
switchport trunk group DC_FIREWALL !

evpn ethernet-segment !

identifier X route-target import X lacp system-id X
!

interface Ethernet # Firewall cluster interfaces which are same on all of them switchport mode trunk
switchport trunk group DC_WAN_EDGE
channel-group 13 mode active

!
ip virtual-router mac-address …. # Must be same on both border leaf pair. used for Mac address of virtual ip address

interface X
vrf DC_FIREWALL
ip address A.B.C.1/29 # .2 is the border-leaf two ip address on same SVI. ip virtual-router address A.B.C.3 # MUST be same on both border-leafs.
!

route-map DC_FW_OUT permit 10 set ip next-hop A.B.C.3
!

router bgp X
..underlay configuration
vrf FIREWALL # L3 VRF configuration, organize a route-target structure to access appropriate tenant vrf …neighbor A.B.C.4 ## firewall IP address on that SVI, enable graceful-restart and bfd for faster convergence enable the route-map DC_FW_OUT for setting next-hop of advertised routes

This bundling setup can also be applied any other external service. For example on a linux machine with bird for EBGP (or you can use static routing). You can use a dedicated L3 VRF for load balancers and than import load balancer networks into appropriate tenant networks. This traffic will be routed toward both border leafs. This setup is much more simpler with using BGP on load balancer. With a static routing be sure to set the nexthop of the route on the load balancer to the virtual IP address on the border leafs.

page5image49918560

with using EBGP you can use multiple load balancer to load balance incoming traffic towards load balancers internet service prefixes.

page6image49918352

VXLAN ve EVPN

VXLAN ve EVPN 

Great blog from Toni Pasanen :

https://nwktimes.blogspot.com

Also He has a book about vxlan.

Sunumlar

1 – Building Data Center Networks with Overlays (VXLAN/EVPN & FabricPath) – BRKDCT-3378

Building Data Center Networks with Overlays (VXLAN/EVPN & FabricPath) – BRKDCT-3378

2- L4-L7 Service Integration in Multi-Tenant VXLAN EVPN Data Center Fabrics – BRKDCN-2304

3- Building DataCenter networks with VXLAN BGP EVPN – FLPDCN-3378

4 -Real world EVPN-VXLAN deployment and migration – Lessons from the trenches – BRKDCN-2020

Real world EVPN-VXLAN deployment and migration – Lessons from the trenches

5 – VXLAN BGP EVPN based Multi-POD, Multi-Fabric and Multi-Site – BRKDCN-2035

VXLAN BGP EVPN based Multi-POD, Multi-Fabric and Multi-Site – BRKDCN-2035

Kısa kısa notlarım :

Sunumdakileri tekrar tekrar yazmak amacım değil!

Host haberleşmesinde neden Layer 2 adresine ihtiyaç duyulur? OSI katmanlarını iyi anlamak gerekir.Standart ve kullanım gerekliliğindendir. Sonuçta bir hosta gelen veriyi çözebilmek için bir haberleşme standartı üzerinde anlaşmak gerekir. Data nereden başlıyor, hangi bit ne işe yarar? Her haberleşme katmanın kendi adresleme methodu vardır. Layer 3 altında kullanılan katman tipi, ethernet, atm, frame-relay vb sizin haberleşme protokolünü veya uygulamanızı yeniden yazmanız gerekmez. Aynı şekilde Layer 3 değiştiği durum da Layer 2 bundan etkilenmez.

Bunun tam tersi ise daha olası. Bu durum sizin uygulamanızı nasıl yazdığınız veya hangi servisi kullandığınız ile ilgilidir. Yapınız küçük bir kapalı network ise sadece layer 2 adresi kullanabilirsiniz. Hatta tamamen kendi iletişim methodunuzu yazabilirsiniz.

Peki segmantasyon neden gereklidir? Buna verilecek genel cevap ise hostlar arasındaki trafiği kontrol etmek, sınırlamak ve söz konusu Layer 2 domain içerisindeki flood trafiğini sınırlandırmak olabilir. Söz konusu ağı parçalamak L2 ve L3 seviyesinde yapılır. L2 ile vlan’lara ayırmak, L3 ise routing domainleri kullanmak VRF/IRB etc. Ayrıca bir segment içindeki hostlar arası trafiğide engellemek gerekebilir.

CLOS network nedir? Kısaca erişim sağlanacak hostlar veya servisler arasında non-blocking bir network yaratmakda denilebilir. CLOS network hakkında?

CLOS/Fabric network’ler tasarımı kullanılarak bir başla deyişle Spine/Leaf oldukça yüksek kapasitelerde bir network kurulabilir ve genişlemede neredeyse sınır yoktur. Burada Leaf’ler sizin hostlarını bağladığınız ekipmanlardır. Kendi içinde zaten non-bloacking switch özelliğini taşır.  Spine leaf’ler arasındaki haberleşme için kullanılır. Birden fazla Spine var ise leaf’lerin her bir spine bağlantısı yedeklilik için geçerlidir bir yapı için gereklidir. Bu ana görevler dışında super spine, spine’ları bir birine bağlamak için, service leaf verilen servisleri fabric network’e bağlamak için, firewall v.b, border leaf  fabric’in dışarıya bağlantısıdır. Normal leaf’ler port yapılandırması dışında bir birlerinin aynısıdır.

Overlay ve Underlay nedir? Overlay network bir bakıma sanal bir network gibi düşünülebilir. Amaç temel olarak kendi kontrolümüz olmayan veya olsa bile güvenli bulmadığımız bir network üzerinde kendi network’ümüzü inşa etmekdir. VPN servisleri genel bir örneğidir. Bu şekilde alt kullanılan network’den bağımsız bir şekilde kendi networkünüzü oluşturabilirsiniz. Kendi networknüzün bir birleri ile olan başlantısı çeşitli tunelleme mekanizmaları veya MPLS VPN servisleri ile olabilir. Burada kendi networkünüzü Overlay network, üzerine kurulduğu network’üde underlay network olarak adlandırabilir.

Benzer şekilde bir CLOS/Fabric network içinde cihazları arasında kullanılacak iletişim protokulü’de underlay control plane olarak adlandırılır ve amazı CLOS/Fabric network topolojisini anlamaktır. Aynı zamanda hostlar arasındaki BUM (broadcast, unicast ve multicast) nasıl yapılacağı’da buna dahildir.  Overlay control plane ise hostların yerini bulmak ve bunu diğer cihazlara dağıtmak ve layer 2 ARP mekanızması yerine getirmek ile yükümlüdür.

EVPN ile VTEP’ler tarafından öğrenilen Layer 2 ve Layer 3 adreslerinin diğer overlay uç noktalara iletilmesi gerçekleştirilir. Bu iletim EVPN NLRI ile olur. Söz konusu NLRI içerisinde hangi encapsulation kullanılması gerektiğide iletilir. VXLAN ile VLAN ID overlay network’de bir anlamı kalmaz. Önemli olan VNID’dir.

Gateway kullanım başka bir önemli konudur. Klasik yöntem söz konusu vlan için SVI/default gateway tüm switchlerde yaratmak ve FHRP kullanarak yedeklilik sağlamaktır. Fakat bu durumda tüm switchlerin mac adresleri aynı olmak durumunda olur. Birden fazla SVI ve vlan tanımlı ise vlan’lar arası haberleşme için vlanların tüm transit pathlerde, genelde bu tüm switchler olur, tanımlanması gerekir. Alternatif olarak Distributed Gateway kullanılır. VNI’lar arası haberleşme için bir L3 VNI kurulur. Diğer VNI’lar bunun üzerinden haberleşir. Önemli bir bilgi olarak source ve destination mac address bilgisidir. Farklı VNI’larda bulunan hostlar bir bileri ile haberleşlirken source mac adresinin destination hostun gateway’i olacak şekilde değiştirilmesi gerekir!

Bu tarz bir network’de verilen servislerin ve Fabric’den çıkış noktasının nasıl uygulanacağı seçime bağlıdır. Transparent (In-line firewall v.b) veya routed (statik veya dinamik), inter vrf veya intra vrf (Aynı vlan içerisindeki hostlar arasında da filtre veya servis).

  • Tenant Edge Services Inter VRF :

    Filtering/policy enforcement between Tenants and external world

  • Inter Tenant Services Inter VRF :

    Filtering/policy enforcement between Tenants

  • Intra Tenant Services Intra VRF/Inter-VLAN:

    Filtering/policy enforcement between and within Segments for a Tenants

    VXLAN/EVPN ile her bir host’a ait trafik giriş noktasında veya routing tablosunda route olarak yer alır. Ayrıca inner data’nın L4 tupple göre header’da bir hash işlemi çalıştırılıp uygun etiket atandığından dolayı bir servis için flow bazlı load-balance yapılması otomatik hale gelir.

    Ayrıca multi-pod, multi-site uygulamasıda EVPN kullanımı ile birlikte, site’lar arası gereksiz BUM trafiğinin engellemesine olanak sağlar. Genel olarak tasarım BGW kullanımı ile site’ları bir birlerine bağlamaktır. Site’dan dışarıya çıkış yedeklilik anlamında kolaydır. Geri dönüş için ise /32 host anonslarının bir üst servis sağlayıcıya geçilmesi gerekebilir.