Linux kernel tuning for network performance

–   sysn + ack retry zalatmak mantıklı sanki şu an 3 sn  ve 7sn kadar bekler.Daha düşük olabilir mi? 1s, 3s, 7s, 15s, 31s .
net.ipv4.tcp_syn_retries = 2 ## may be lower
 –  net.core.somaxconn ve  net.ipv4.tcp_max_syn_backlog:
  If an application calls listen() with a backlog value larger than net.core.somaxconn, then the backlog for that listener will be silently truncated to the somaxconn value.
Tune them and check them both
 – net.ipv4.tcp_fin_timeout = 5 : Related to with that check net.ipv4.tcp_tw_reuse is enabled?
 – net.ipv4.tcp_fack : disable? Do we need fin to be asked?
– net.ipv4.tcp_sack : enable
 – net.ipv4.tcp_moderate_rcvbuf ve net.ipv4.tcp_window_scaling : enable
 – /sys/class/net/<device>/queues/<rx-queue>/rps_cpus : check  : https://documentation.suse.com/sles/12-SP4/html/SLES-all/cha-tuning-network.html#sec-tuning-network-rps

Recommended settings

#nic settings settings

  # external links:

    # https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-network-common-queue-issues

    # https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/ovs-dpdk_end_to_end_troubleshooting_guide/

high_packet_loss_in_the_tx_queue_of_the_instance_s_tap_interface

    # https://www.coverfire.com/articles/queueing-in-the-linux-network-stack/

  # trasmit queue: do not touch, change if drop occurs.

    # ip link set $NIC txqueuelen 5000

    # ip -s -s link ls dev veya ethtool -S ens2f1 | grep drop

  # receive queue: do not touch, change if drop occurs.

   # ethtool –set-ring ethX

    # ip link set $NIC txqueuelen 5000

    # ip -s -s link ls dev veya ethtool -S ens2f1 | grep drop

# kernel queuqe settings, backlog and somaxconn:

How TCP Sockets Work

In this post I’m going to explain at a high level how the TCP/IP stack works on Linux. In particular, I’ll explore how the socket system calls interact with kernel data structures, and how the kernel interacts with the actual network. Part of the motivation for this post is to explain how listen queue overflows work, as it’s related to a problem I’ve been working on at work.

How Established Connections Work

This explanation will be from the top down, so we’ll start with how already established connections work. Later I’ll explain how newly established connections work.

For each TCP file descriptor tracked by the kernel there is a struct tracking some TCP-specific info (e.g. sequence numbers, the current window size, and so on), as well as a receive buffer (or “queue”) and a write buffer (or “queue”). I’ll use the terms buffer and queue interchangeably. If you’re curious about more details, you can see the implementation of socket structs in the Linux kernel’s net/sock.h.

When a new data packet comes in on the network interface (NIC), the kernel is notified either by being interrupted by the NIC, or by polling the NIC for data. Typically whether the kernel is interrupt driven or in polling mode depends on how much network traffic is happening; when the NIC is very busy it’s more efficient for the kernel to poll, but if the NIC is not busy CPU cycles and power can be saved by using interrupts. Linux calls this technique NAPI, literally “New API”.

When the kernel gets a packet from the NIC it decodes the packet and figures out what TCP connection the packet is associated with based on the source IP, source port, destination IP, and destination port. This information is used to look up the struct sock in memory associated with that connection. Assuming the packet is in sequence, the data payload is then copied into the socket’s receive buffer. At this point the kernel will wake up any processes doing a blocking read(2), or that are using an I/O multiplexing system call like select(2) or epoll_wait(2) to wait on the socket.

When the userspace process actually calls read(2) on the file descriptor it causes the kernel to remove the data from its receive buffer, and to copy that data into a buffer supplied to the read(2) system call.

Sending data works similarly. When the application calls write(2) it copies data from the user-supplied buffer into the kernel write queue. Subsequently the kernel will copy the data from the write queue into the NIC and actually send the data. The actual transmission of the data to the NIC could be somewhat delayed from when the user actually calls write(2) if the network is busy, if the TCP send window is full, if there are traffic shaping policies in effect, etc.

One consequence of this design is that the kernel’s receive and write queues can fill up if the application is reading too slowly, or writing too quickly. Therefore the kernel sets a maximum size for the read and write queues. This ensures that poorly behaved applications use a bounded amount of memory. For instance, the kernel might cap each of the receive and write queues at 100 KB. Then the maximum amount of kernel memory each TCP socket could use would be approximately 200 KB (as the size of the other TCP data structures is negligible compared to the size of the queues).

Read Semantics

If the receive buffer is empty and the user calls read(2), the system call will block until data is available.

If the receive buffer is nonempty and the user calls read(2), the system call will immediately return with whatever data is available. A partial read can happen if the amount of data ready in the read queue is less than the size of the user-supplied buffer. The caller can detect this by checking the return value of read(2).

If the receive buffer is full and the other end of the TCP connection tries to send additional data, the kernel will refuse to ACK the packets. This is just regular TCP congestion control.

Write Semantics

If the write queue is not full and the user calls write(2), the system call will succeed. All of the data will be copied if the write queue has sufficient space. If the write queue only has space for some of the data then a partial write will happen and only some of the data will be copied to the buffer. The caller checks for this by checking the return value of write(2).

If the write queue is full and the user calls write(2), the system call will block.

How Newly Established Connection Work

In the previous section we saw how established connections use receive and write queues to limit the amount of kernel memory allocated for each connection. A similar technique is used to limit the amount of kernel memory reserved for new connections.

From a userspace point of view, newly established TCP connections are created by calling accept(2) on a listen socket. A listen socket is one that has been designated as such using the listen(2) system call.

The prototype for accept(2) takes a socket and two fields storing information about the other end of the socket. The value returned by accept(2) is an integer representing the file descriptor for a new, established connection:

int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);

The prototype for listen(2) takes a socket file descriptor and a backlog parameter:

int listen(int sockfd, int backlog);

The backlog is a parameter that controls how much memory the kernel will reserve for new connections, when the user isn’t calling accept(2) fast enough.

For instance, suppose you have a blocking single-threaded HTTP server, and each HTTP request takes about 100 ms. In this scenario the HTTP server will spend 100 ms processing each request before it is able to call accept(2) again. This means that at up to 10 rps there will be no queuing. If more than 10 rps come in the kernel has two choices.

The first choice the kernel has is to not accept the connection at all. For instance, the kernel can just refused to ACK an incoming SYN packet. More commonly what will happen is the kernel will complete the TCP three-way handshake, and then terminate the connection with RST. Either way, the result is the same: no receive or write buffers need to be allocated if the connection is rejected. The argument for doing this is that if the userspace process isn’t accepting connections fast enough, the correct thing to do is to fail new requests. The argument against doing this is that it’s very aggressive, especially if new connections are “bursty” over time.

The second choice the kernel has is to accept the connection and allocate a socket structure for it (including receive/write buffers), and then queue the socket object for use later. The next time the user calls accept(2), instead of blocking the system call will immediately get the already-allocated socket.

The argument for the second behavior is that it’s more forgiving when the processing rate or connection rate tends to burst. For instance, in the server we just described, imagine that 10 new connections come in all at once, and then no more connections come in for the rest of the second. If the kernel queues new connections then all of the requests will be processed over the course of the second. If the kernel had been rejecting new connections then only one of the connections would have succeeded, even though the process was able to keep up with the aggregate request rate.

There are two arguments against queueing. The first is that excessive queueing can cause a lot of kernel memory to be allocated. If the kernel is allocating thousands of sockets with large receive buffers then memory usage can grow quickly, and the userspace process might not even be able to process all of those requests anyway. The other argument against queueing is that it makes the application appear slow to the other side of the connection, the client. The client will see that it can establish new TCP connections, but when it tries to use them it will appear that the server is very slow to respond. The argument is that in this situation it would be better to just fail the new connections, since that provides more obvious feedback that the server is not healthy. Additionally, if the server is aggressively failing new connections the client can know to back off; this is another form of congestion control.

Listen Queues & Overflows

As you might suspect, the kernel actually combines these two approaches. The kernel will queue new connections, but only a certain number of them. The amount of connections the kernel will queue is controlled by the backlog parameter to listen(2). Typically this is set to a relatively small value. On Linux, the socket.h header sets the value of SOMAXCONN to 128, and before kernel 2.4.25 this was the maximum value allowed. Nowadays the maximum value is specified in /proc/sys/net/core/somaxconn, but commonly you’ll find programs using SOMAXCONN (or a smaller hard-coded value) anyway.

When the listen queue fills up, new connections will be rejected. This is called a listen queue overflow. You can observe when this is happening by reading /proc/net/netstat and checking the value of ListenOverflows. This is a global counter for the whole kernel. As far as I know, you can’t get listen overflow stats per listen socket.

Monitoring for listen overflows is important when writing network servers, because listen overflows don’t trigger any user-visible behavior from the server’s perspective. The server will happily accept(2) connections all day without returning any indication that connections are being dropped. For example, suppose you are using Nginx as a proxy in front of a Python application. If the Python application is too slow then it can cause the Nginx listen socket to overflow. When this happens you won’t see any indication of this in the Nginx logs—you’ll keep seeing 200 status codes and so forth as usual. Thus if you’re just monitoring the HTTP status codes for your application you’ll fail to see that TCP errors are preventing requests from being forwarded to the application.

  # external links:

    # https://blog.packagecloud.io/monitoring-tuning-linux-networking-stack-receiving-data/

  # monitor: /proc/net/softnet_stat check for format but generally second coloum. 

    # net.core.netdev_max_backlog = 65535 ? do we need a such a high value

  # socket queue

all-1.jpeg

    # external links:

      # https://bl.ocks.org/magnetikonline/2760f98f6bf654d5ad79

      # https://blog.cloudflare.com/syn-packet-handling-in-the-wild/ : for monitoring commands

    # net.core.somaxconn = 65535: It now specifies the queue length for completely established sockets waiting to be accepted. Accept queue:

       # monitor nstat -az TcpExtListenDrops veya TcpExtListenOverflows

        # ss -plnt The column Recv-Q shows the number of sockets in the Accept Queue, and Send-Q shows the backlog parameter. In this case we see there are no outstanding sockets to be accept()ed

    # net.ipv4.tcp_max_syn_backlog : 16384 The maximum length of incomplete connection requests is set via, half open connections. /proc/net/netstat where the stat is called ListenDrops

      # monitor: ss -n state syn-recv sport = :80 | wc -l

    — Backlog değeri applitcaiton trafından set ediliyor. Dolayısı ile per port değer, maxiumum queue uzunluğu. somaconn ise system wide bir değer. Backlog değerininin, per app/listen port, systemwide değerinden fazla olmaması gerek. Modern kernellerde genelde yüksek. Ama load balancer v.b için bu değerinlerin daha fazla olması gerekebilir. DDOS vs anında dolarlar.

#socket buffers: 

Window Size / RTT = Throughput or (LinkSpeed * latency) / 8 bits = window size. Example: 10000 Mbps * 0.030 sec / 8 bits = 37.5MB window size.

0.001 msec RTT ve 25Gbps than, 25000* 0.001/8 =~312MB window size 

16777216 = 16MB

268435456 = 256MB

The returned values are measured in bytes. 10,240 represents the minimum buffer size that is used by each TCP socket. 87,380 is the default buffer which is allocated when applications create a TCP socket. 20,971,520 is the maximum receive buffer that can be allocated for a TCP socket.

sudo sysctl -w net.core.wmem_max=268435456

sudo sysctl -w net.core.wmem_default=262144

sudo sysctl -w net.core.rmem_max=268435456

sudo sysctl -w net.core.rmem_default=262144

sudo sysctl -w net.ipv4.tcp_rmem=“4096 87380 268435456”

sudo sysctl -w net.ipv4.tcp_wmem =“4096 87380 268435456”

# Ephemeral ports

#The standards organization in charge of such things, known as IANA, recommends that the operating system pick a source port between 49152 and 65535. If you follow IANA’s recommendations for the Ephemeral Port Range, there are only 16,384 available source ports.

# https://idea.popcount.org/2014-04-03-bind-before-connect/

net.ipv4.ip_local_reserved_ports=10000-65535 —> can violate 

# Disable TCP functions

sudo sysctl -w net.ipv4.tcp_sack=0

sudo sysctl -w net.ipv4.tcp_dsack=0

sudo sysctl -w net.ipv4.tcp_fack=0

sudo sysctl -w net.ipv4.tcp_window_scaling=1

sudo sysctl -w net.ipv4.tcp_syncookies=0

# Reuse closed sockets faster

net.ipv4.tcp_tw_reuse=1

sudo sysctl -w net.ipv4.tcp_timestamps=0 ///!!! Bu açık olmalı

sudo sysctl -w net.ipv4.tcp_fin_timeout = 10

sudo sysctl -w net.ipv4.tcp_syn_retries = 2

sudo sysctl -w net.ipv4.tcp_synack_retries = 2

sudo sysctl -w net.ipv4.tcp_retries2=6

sudo sysctl -w net.ipv4.tcp_keepalive_time=900

sudo sysctl -w net.ipv4.tcp_keepalive_probes=3

sudo sysctl -w net.ipv4.tcp_keepalive_intvl=15

sudo sysctl -w net.ipv4.tcp_no_metrics_save=1

# Conntrack

  # external links:

    # https://www.robustperception.io/conntrack-metrics-from-the-node-exporter/

    

monitor: conntrack -L

net.netfilter.nf_conntrack_max=10485760

net.ipv4.netfilter.ip_conntrack_generic_timeout=120 ???

net.netfilter.nf_conntrack_tcp_timeout_established=300

net.netfilter.nf_conntrack_buckets=655360

# Arp table/neighbor cache optimization

## bu degerler k8s cluster buyudukce buyumeli

net.ipv4.neigh.default.gc_thresh3=24456

net.ipv4.neigh.default.gc_thresh2=12228

net.ipv4.neigh.default.gc_thresh1=8192

# Security Settings

net.ipv4.conf.all.accept_source_route=0

net.ipv4.conf.default.accept_source_route=0

net.ipv6.conf.all.accept_source_route=0

net.ipv6.conf.default.accept_source_route=0

net.ipv4.conf.all.accept_redirects=0

net.ipv4.conf.default.accept_redirects=0

net.ipv6.conf.all.accept_redirects=0

net.ipv6.conf.default.accept_redirects=0

net.ipv4.conf.all.secure_redirects=0

net.ipv4.conf.default.secure_redirects=0

net.ipv4.conf.default.log_martians=1

net.ipv4.conf.all.log_martians = 1 ??? Bunun loguna sec bakmıyor ise kapatmalı.

net.ipv4.conf.default.accept_source_route = 0

net.ipv4.conf.default.accept_redirects = 0

net.ipv4.conf.default.secure_redirects = 0

net.ipv4.icmp_echo_ignore_broadcasts = 1

# Don’t forget to…

systemctl restart systemd-sysctl.service

Some links :

https://levelup.gitconnected.com/linux-kernel-tuning-for-high-performance-networking-high-volume-incoming-connections-196e863d458a

http://web.archive.org/web/20120201135920/http://fasterdata.es.net/fasterdata/host-tuning/linux

https://wiki.mikejung.biz/Sysctl_tweaks

IOSXR – ASR9K BGP BFD bundle-interface

BFD session for BGP neighbor which is routed over bundle-interface can not be established without using below command ;

bfd multipath include location 0/0/CPU0

Before that for bfd session following error can be seen on show bfd session detail;

BFD_MP_DOWNLOAD_NO_LC.

BFD session must be tied to a specific location, Line card. With bundle configured and without that command BFD can not bind interface for the session as IOSXR uses LC for BFD sessions.

Below is nicely explanation for the behavior by Xander Thuijs ;

This I think is maybe nicely explained with this: Implementation of various BFD flavours over bundle interfaces in IOS XR was carried out in 3 phases:

  1. IPv4 BFD session over individual bundle sub-interfaces. This feature was called “BFD over VLAN over bundle”.
  2. IOS XR releases 4.0.1 and beyond: “BFD Over Bundle (BoB)” feature was introduced.
  3. IOS XR releases 4.3.0 and beyond: full support for IPv4 and IPv6 BFD sessions over bundle interfaces and sub-interfaces. For disambiguation from the BoB feature, this implementation is called BLB, and sessions are often referred to as native BFD sessions over bundle interfaces and/or sub-interfaces. BFD multipath must be enabled for any of these BFD flavours to work.

Due to the introduction of BLB (bfd over logical bundle) in XR43 this was necessary, I agree that you may have been mislead by the nomenclature of that multipath location keyword which suggests multihop, but it was meant to also include multipath as in multiple members of a bundle..

https://community.cisco.com/t5/xr-os-and-platforms/bfd-on-asr9k-cluster/td-p/2477664

Silent Host, vEOS

If you do have silent host on your EVPN fabric that may be problem for some the applications.

With a silent the first initial request can be generally lost towards the host. If you using application which discovers host on the network like nmap, e.t.c they may report least number host on the network than expected. And when you run a second discovery just after the first one, it may report much more hosts than the first one.

By using arp aging timeout command under SVI with a lower value than default like 180, 240 seconds switch will send arp for the host to refresh arp entries. This seems an internal knowledge may in defined in RFC, I do not have time to check.

Entries are refreshed and expired at a random time that is in the range of 80%-100% of the cache expiry time.
The refresh is tried 3 times at an interval of 2% of the configured timeout.

Arista switches refresh its timers when a its control plane, cpu receives arp packet, or any other control plane related traffic like ICMP packet for default gateway from the hosts. Check the value for the host side arp timeout and do not get below it.