September 2020 – dnzydn's blog

After reading kernel documents and capturing some packets;

“A problematic outcome of using ARP

negotiation for balancing is that each time that an

ARP request is broadcast [from the host] it uses the hardware address

of the bond. Hence, peers learn the hardware address

of the bond and the balancing of receive traffic

collapses to the current slave. This is handled by

sending updates (ARP Replies) to all the peers with

their individually assigned hardware address such that

the traffic is redistributed.”

Beware of that when using multiple sub interfaces with bonding, sub interfaces will use the same MAC address with the bond interface.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 00:50:00:00:04:00 brd ff:ff:ff:ff:ff:ff
3: ens4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 00:50:00:00:04:01 brd ff:ff:ff:ff:ff:ff
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 00:50:00:00:04:00 brd ff:ff:ff:ff:ff:ff
6: bond0.101@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 00:50:00:00:04:00 brd ff:ff:ff:ff:ff:ff
7: bond0.102@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 00:50:00:00:04:00 brd ff:ff:ff:ff:ff:ff

When host send broadcast it will have for example 00:50:00:00:04:00 of bond0. But as balance-alb says I some hosts they may have the 00:50:00:00:04:01 of bond0 slaves. With this broadcast on some peers using 00:50:00:00:04:01 will changed to 00:50:00:00:04:00.

From the switch view;

For example host may send packets with mac 00:50:00:00:04:00 with one the bond member mac ens3 00:50:00:00:04:00 same with bond0 mac. we can say that ens3 may be connected to the sw eth1. And host ens4 will be connected to eth2 or even may connected to a another switch (VPC, MLAG, CLAG, VXLAN e.t.c with multi chassis setup). So when broadcast comes with mac 00:50:00:00:04:00 other than eth1 switch will alert with mac flap.

Also consider that some devices may not like to see different Mac address for same ip (even though that may layer 2 only devices) as host will use different mac for same ip with different peers.

Some Tests : Two hosts connected the same switch. H1 is using bonding with ens3 and ens4 with mode 6.

2: ens3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 50:01:00:02:00:00 brd ff:ff:ff:ff:ff:ff
3: ens4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 50:01:00:02:00:01 brd ff:ff:ff:ff:ff:ff
4: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 50:01:00:02:00:02 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 50:01:00:02:00:00 brd ff:ff:ff:ff:ff:ff

root@H2:~# arp -a
? (192.168.101.101) at 50:01:00:02:00:01 [ether] on ens3

broadcasting from H1

root@H1:~# ping -b 192.168.101.255
WARNING: pinging broadcast address
PING 192.168.101.255 (192.168.101.255) 56(84) bytes of data.

root@H2:~# tcpdump -i ens3 -e
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
19:30:55.058398 50:01:00:02:00:00 (oui Unknown) > Broadcast, ethertype IPv4 (0x0800), length 98: 192.168.101.101 > 192.168.101.255: ICMP echo request, id 20335, seq 33, length 64

As it seen from tcpdump on H2, H1 uses one of its peer address ens3 with broadcast.

It may cause problem with EVPN Fabric as mac-ip mapping will flap with any broadcast generated. Next test;

From EVPN VXLAN perspective : Beside from the problem of duplicate entry per ip – mac as there is going to be at least two mac associated with the hosts IP address which may cause problem for some vendors you may not utilize both links if vrf leaking is used and if anycast gateway is used.

I have tested this with Arista and leaking the subject host network to another VRF.

switch-1#sh ip arp vrf TEST
Address Age (sec) Hardware Addr Interface
192.168.101.101 N/A 5001.0002.0000 Vlan10, Ethernet1
192.168.101.103 N/A 5001.0004.0000 Vlan10, not learned

switch-2#sh ip route vrf TEST2

B I 192.168.101.101/32 [200/0] via VTEP 1.1.1.1 VNI 210 router-mac 50:01:00:e5:e3:6a
B L 192.168.101.0/24 is directly connected (source VRF TEST), Vlan10 (egress VRF TEST)
C 192.168.102.0/24 is directly connected, Vlan20

All traffic from remote VRF into that host’s vrf is handled by the local switch, switch-1 which is the default gateway of the host. And host is advertising one of its bundle peer address towards it default gateway. So inter-vrf traffic will always use one of its member link.

https://serverfault.com/questions/734246/does-balance-alb-and-balance-tlb-support-fault-tolerance

https://sort.veritas.com/public/documents/isa/7.3/linux/productguides/html/vcs_access_admg/ch03s02.htm

http://blog.garraux.net/2012/07/data-center-server-access-topologies-part-1-logical-interface-config-on-servers/

used eve-ng community version.

find you eve-ng access interface : ip addr will show the your link address. I have used 172.18.XXX for eve-ng access.

ip addr | grep 172.18
inet 172.18.1.113/16 brd 172.18.255.255 scope global pent

find DHCP enables interface and your DHCP configuration :

more /etc/default/isc-dhcp-server : will show you dhcp enabled interface. Generally pnet9 is enabled eve-ng community version.

more /etc/dhcp/dhcpd.conf

You will see a single subnet used for DHCP assignment. The interface network which dhcp is enabled, should be used in here.

subnet 172.16.222.0 netmask 255.255.255.0 {
range 172.16.222.101 172.16.222.199;
option domain-name-servers 8.8.8.8, 4.4.4.4;
option domain-name “lab-int”;
option subnet-mask 255.255.255.0;
option routers 172.16.222.1; # dhcp enabled pnet9 address.
default-lease-time 604800;
max-lease-time -1;
}

add ip nat rule for packets going out toward, in my case this is pnet0

iptables -t nat -A POSTROUTING -o pnet0 -s 172.16.222.0/24 -j MASQUERADE

then enable ip routing on linux kernel

echo 1 > /proc/sys/net/ipv4/ip_forward

When you connect interface of the devices to the pnet9 network, which you enabled dhcp and give nat service over its router interace, you should be able to access outside with nat. Default dhcp enabled linux machines takes time to open, wait patiently!

Month: September 2020

linux bonding with balance-alb or bond-mode-6

eve-ng giving internet access to labs via nat