There are a lot of differences between Linux version 2.4 and 2.6, so first we'll cover the tuning issues that are the same in both 2.4 and 2.6. To change TCP settings in, you add the entries below to the file /etc/sysctl.conf, and then run "sysctl -p".
Like all operating systems, the default maximum Linux TCP buffer sizes are way too small. I suggest changing them to the following settings:
# increase TCP max buffer size setable using setsockopt()
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
# set max to at least 4MB for 1G, and 32M for 10G
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
You should also verify that the following are all set to the default value of 1
sysctl net.ipv4.tcp_window_scaling
sysctl net.ipv4.tcp_timestamps
sysctl net.ipv4.tcp_sack
Note: you should leave tcp_mem alone. The defaults are fine.
You can achieve increases in bandwidth of up to 10x by doing this on some long, fast paths. This is only a good idea for Gigabit Ethernet connected hosts, and may have other side effects such as uneven sharing between multiple streams.
Also, I've been told that for some network paths, using the Linux 'tc' (traffic control) system to pace traffic out of the host can help improve total throughput.
Linux 2.6
Starting in Linux 2.6.7 (and back-ported to 2.4.27), linux includes alternative congestion control algorithms beside the traditional 'reno' algorithm. These are designed to recover quickly from packet loss on high-speed WANs.
Linux 2.6 also includes and both send and receiver-side automatic buffer tuning (up to the maximum sizes specified above). There is also a setting to fix the ssthresh caching weirdness described below.
There are a couple additional sysctl settings for 2.6:
# don't cache ssthresh from previous connection
net.ipv4.tcp_no_metrics_save = 1
# recommended to increase this for 10G NICS
net.core.netdev_max_backlog = 30000
Starting with version 2.6.13, Linux supports pluggable congestion control algorithms . The congestion control algorithm used is set using the sysctl variable net.ipv4.tcp_congestion_control, which is set to bic/cubic or reno by default, depending on which version of the 2.6 kernel you are using.
To get a list of congestion control algorithms that are available in your kernel, run:
sysctl net.ipv4.tcp_available_congestion_control
The choice of congestion control options is selected when you build the kernel. The following are some of the options are available in the 2.6.23 kernel:
- reno: Traditional TCP used by almost all other OSes. (default)
- cubic: CUBIC-TCP (NOTE: There is a cubic bug in the Linux 2.6.18 kernel used by Redhat Enterprise Linux 5.3 and Scientific Linux 5.3. Use 2.6.18.2 or higher!)
- bic: BIC-TCP
- htcp: Hamilton TCP
- vegas: TCP Vegas
- westwood: optimized for lossy networks
If cubic and/or htcp are not listed when you do 'sysctl net.ipv4.tcp_available_congestion_control', try the following, as most distributions include them as loadable kernel modules:
/sbin/modprobe tcp_htcp
/sbin/modprobe tcp_cubic
For long fast paths, we highly recommend using cubic or htcp. Cubic is the default for a number of Linux distributions, but if is not the default on your system, you can do the following:
sysctl -w net.ipv4.tcp_congestion_control=cubic
On systems supporting RPMS, You can also try using the ktune RPM, which sets many of these as well.
More information on each of these algorithms and some results can be found here .
More information on tuning parameters and defaults for Linux 2.6 are available in the file ip-sysctl.txt, which is part of the 2.6 source distribution.
Warning on Large MTUs: If you have configured your Linux host to use 9K MTUs, but the connection is using 1500 byte packets, then you actually need 9/1.5 = 6 times more buffer space in order to fill the pipe. In fact some device drivers only allocate memory in power of two sizes, so you may even need 16/1.5 = 11 times more buffer space!
And finally a warning for both 2.4 and 2.6: for very large BDP paths where the TCP window is > 20 MB, you are likely to hit the Linux SACK implementation problem. If Linux has too many packets in flight when it gets a SACK event, it takes too long to located the SACKed packet, and you get a TCP timeout and CWND goes back to 1 packet. Restricting the TCP buffer size to about 12 MB seems to avoid this problem, but clearly limits your total throughput. Another solution is to disable SACK.
Linux 2.4
Starting with Linux 2.4, Linux implemented a sender-side autotuning mechanism, so that setting the optimal buffer size on the sender is not needed. This assumes you have set large buffers on the receive side, as the sending buffer will not grow beyond the size of the receive buffer.
However, Linux 2.4 has some other strange behavior that one needs to be aware of. For example: The value for ssthresh for a given path is cached in the routing table. This means that if a connection has has a retransmission and reduces its window, then all connections to that host for the next 10 minutes will use a reduced window size, and not even try to increase its window. The only way to disable this behavior is to do the following before all new connections (you must be root):
sysctl -w net.ipv4.route.flush=1
Another thing you can to help increase TCP throughput in 2.4 with 1000BT NICs is to increase the size of the interface queue. To do this, do the following:
ifconfig eth0 txqueuelen 1000
More information on various tuning parameters for Linux 2.4 are available in the Ipsysctl tutorial .
Linux 2.2
If you are still running Linux 2.2, upgrade! If this is not possible, add the following to /etc/rc.d/rc.local
echo 8388608 > /proc/sys/net/core/wmem_max
echo 8388608 > /proc/sys/net/core/rmem_max
echo 65536 > /proc/sys/net/core/rmem_default
echo 65536 > /proc/sys/net/core/wmem_default