Made possible by PowerDNS |
The kernel has lots of parameters which can be tuned for different circumstances. While, as usual, the default parameters serve 99% of installations very well, we don't call this the Advanced HOWTO for the fun of it!
The interesting bits are in /proc/sys/net, take a look there. Not everything will be documented here initially, but we're working on it.
In the meantime you may want to have a look at the Linux-Kernel sources; read the file Documentation/filesystems/proc.txt. Most of the features are explained there.
(FIXME)
By default, routers route everything, even packets which 'obviously' don't belong on your network. A common example is private IP space escaping onto the Internet. If you have an interface with a route of 195.96.96.0/24 to it, you do not expect packets from 212.64.94.1 to arrive there.
Lots of people will want to turn this feature off, so the kernel hackers have made it easy. There are files in /proc where you can tell the kernel to do this for you. The method is called "Reverse Path Filtering". Basically, if the reply to this packet wouldn't go out the interface this packet came in, then this is a bogus packet and should be ignored.
The following fragment will turn this on for all current and future interfaces.
# for i in /proc/sys/net/ipv4/conf/*/rp_filter ; do
> echo 2 > $i
> done
Going by the example above, if a packet arrived on the Linux router on eth1 claiming to come from the Office+ISP subnet, it would be dropped. Similarly, if a packet came from the Office subnet, claiming to be from somewhere outside your firewall, it would be dropped also.
The above is full reverse path filtering. The default is to only filter based on IPs that are on directly connected networks. This is because the full filtering breaks in the case of asymmetric routing (where packets come in one way and go out another, like satellite traffic, or if you have dynamic (bgp, ospf, rip) routes in your network. The data comes down through the satellite dish and replies go back through normal land-lines).
If this exception applies to you (and you'll probably know if it does) you can simply turn off the rp_filter on the interface where the satellite data comes in. If you want to see if any packets are being dropped, the log_martians file in the same directory will tell the kernel to log them to your syslog.
# echo 1 >/proc/sys/net/ipv4/conf/<interfacename>/log_martians
FIXME: is setting the conf/{default,all}/* files enough? - martijn
Ok, there are a lot of parameters which can be modified. We try to list them all. Also documented (partly) in Documentation/ip-sysctl.txt.
Some of these settings have different defaults based on whether you answered 'Yes' to 'Configure as router and not host' while compiling your kernel.
As a generic note, most rate limiting features don't work on loopback, so don't test them locally. The limits are supplied in 'jiffies', and are enforced using the earlier mentioned token bucket filter.
The kernel has an internal clock which runs at 'HZ' ticks (or 'jiffies') per second. On intel, 'HZ' is mostly 100. So setting a *_rate file to, say 50, would allow for 2 packets per second. The token bucket filter is also configured to allow for a burst of at most 6 packets, if enough tokens have been earned.
Several entries in the following list have been copied from /usr/src/linux/Documentation/networking/ip-sysctl.txt, written by Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> and Andi Kleen <ak@muc.de>
If the kernel decides that it can't deliver a packet, it will drop it, and send the source of the packet an ICMP notice to this effect.
Don't act on echo packets at all. Please don't set this by default, but if you are used as a relay in a DoS attack, it may be useful.
If you ping the broadcast address of a network, all hosts are supposed to respond. This makes for a dandy denial-of-service tool. Set this to 1 to ignore these broadcast messages.
The rate at which echo replies are sent to any one destination.
Set this to ignore ICMP errors caused by hosts in the network reacting badly to frames sent to what they perceive to be the broadcast address.
A relatively unknown ICMP message, which is sent in response to incorrect packets with broken IP or TCP headers. With this file you can control the rate at which it is sent.
This the famous cause of the 'Solaris middle star' in traceroutes. Limits number of ICMP Time Exceeded messages sent.
Maximum number of listening igmp (multicast) sockets on the host. FIXME: Is this true?
FIXME: Add a little explanation about the inet peer storage?
Minimum interval between garbage collection passes. This interval is in
effect under low (or absent) memory pressure on the pool. Measured in
jiffies.
Minimum interval between garbage collection passes. This interval is in effect under high memory pressure on the pool. Measured in jiffies.
Maximum time-to-live of entries. Unused entries will expire after this period of time if there is no memory pressure on the pool (i.e. when the number of entries in the pool is very small). Measured in jiffies.
Minimum time-to-live of entries. Should be enough to cover fragment time-to-live on the reassembling side. This minimum time-to-live is guaranteed if the pool size is less than inet_peer_threshold. Measured in jiffies.
The approximate size of the INET peer storage. Starting from this threshold entries will be thrown aggressively. This threshold also determines entries' time-to-live and time intervals between garbage collection passes. More entries, less time-to-live, less GC interval.
This file contains the number one if the host received its IP configuration by RARP, BOOTP, DHCP or a similar mechanism. Otherwise it is zero.
Time To Live of packets. Set to a safe 64. Raise it if you have a huge network. Don't do so for fun - routing loops cause much more damage that way. You might even consider lowering it in some circumstances.
You need to set this if you use dial-on-demand with a dynamic interface address. Once your demand interface comes up, any local TCP sockets which haven't seen replies will be rebound to have the right address. This solves the problem that the connection that brings up your interface itself does not work, but the second try does.
If the kernel should attempt to forward packets. Off by default.
Range of local ports for outgoing connections. Actually quite small by default, 1024 to 4999.
Set this if you want to disable Path MTU discovery - a technique to determine the largest Maximum Transfer Unit possible on your path. See also the section on Path MTU discovery in the cookbook chapter.
Maximum memory used to reassemble IP fragments. When ipfrag_high_thresh bytes of memory is allocated for this purpose, the fragment handler will toss packets until ipfrag_low_thresh is reached.
Set this if you want your applications to be able to bind to an address which doesn't belong to a device on your system. This can be useful when your machine is on a non-permanent (or even dynamic) link, so your services are able to start up and bind to a specific address when your link is down.
Minimum memory used to reassemble IP fragments.
Time in seconds to keep an IP fragment in memory.
A boolean flag controlling the behaviour under lots of incoming connections. When enabled, this causes the kernel to actively send RST packets when a service is overloaded.
Time to hold socket in state FIN-WAIT-2, if it was closed by our side. Peer can be broken and never close its side, or even died unexpectedly. Default value is 60sec. Usual value used in 2.2 was 180 seconds, you may restore it, but remember that if your machine is even underloaded WEB server, you risk to overflow memory with kilotons of dead sockets, FIN-WAIT-2 sockets are less dangerous than FIN-WAIT-1, because they eat maximum 1.5K of memory, but they tend to live longer. Cf. tcp_max_orphans.
How often TCP sends out keepalive messages when keepalive is enabled.
Default: 2hours.
How frequent probes are retransmitted, when a probe isn't acknowledged.
Default: 75 seconds.
How many keepalive probes TCP will send, until it decides that the
connection is broken.
Default value: 9.
Multiplied with tcp_keepalive_intvl, this gives the time a link can be
nonresponsive after a keepalive has been sent.
Maximal number of TCP sockets not attached to any user file handle, held by system. If this number is exceeded orphaned connections are reset immediately and warning is printed. This limit exists only to prevent simple DoS attacks, you _must_ not rely on this or lower the limit artificially, but rather increase it (probably, after increasing installed memory), if network conditions require more than default value, and tune network services to linger and kill such states more aggressively. Let me remind you again: each orphan eats up to 64K of unswappable memory.
How may times to retry before killing TCP connection, closed by our side. Default value 7 corresponds to 50sec-16min depending on RTO. If your machine is a loaded WEB server, you should think about lowering this value, such sockets may consume significant resources. Cf. tcp_max_orphans.
Maximal number of remembered connection requests, which still did not receive an acknowledgement from connecting client. Default value is 1024 for systems with more than 128Mb of memory, and 128 for low memory machines. If server suffers of overload, try to increase this number. Warning! If you make it greater than 1024, it would be better to change TCP_SYNQ_HSIZE in include/net/tcp.h to keep TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog and to recompile kernel.
Maximal number of timewait sockets held by system simultaneously. If this number is exceeded time-wait socket is immediately destroyed and warning is printed. This limit exists only to prevent simple DoS attacks, you _must_ not lower the limit artificially, but rather increase it (probably, after increasing installed memory), if network conditions require more than default value.
Bug-to-bug compatibility with some broken printers. On retransmit try to send bigger packets to work around bugs in certain TCP stacks.
How many times to retry before deciding that something is wrong and it is necessary to report this suspection to network layer. Minimal RFC value is 3, it is default, which corresponds to 3sec-8min depending on RTO.
How may times to retry before killing alive TCP connection. RFC 1122 says that the limit should be longer than 100 sec. It is too small number. Default value 15 corresponds to 13-30min depending on RTO.
This boolean enables a fix for 'time-wait assassination hazards in tcp', described
in RFC 1337. If enabled, this causes the kernel to drop RST packets for
sockets in the time-wait state.
Default: 0
Use Selective ACK which can be used to signify that specific packets are missing - therefore helping fast recovery.
Use the Host requirements interpretation of the TCP urg pointer
field.
Most hosts use the older BSD interpretation, so if you turn this on
Linux might not communicate correctly with them.
Default: FALSE
Number of SYN packets the kernel will send before giving up on the new connection.
To open the other side of the connection, the kernel sends a SYN with a piggybacked ACK on it, to acknowledge the earlier received SYN. This is part 2 of the threeway handshake. This setting determines the number of SYN+ACK packets sent before the kernel gives up on the connection.
Timestamps are used, amongst other things, to protect against wrapping sequence numbers. A 1 gigabit link might conceivably re-encounter a previous sequence number with an out-of-line value, because it was of a previous generation. The timestamp will let it recognise this 'ancient packet'.
Enable fast recycling TIME-WAIT sockets. Default value is 1. It should not be changed without advice/request of technical experts.
TCP/IP normally allows windows up to 65535 bytes big. For really fast networks, this may not be enough. The window scaling options allows for almost gigabyte windows, which is good for high bandwidth*delay products.
DEV can either stand for a real interface, or for 'all' or 'default'. Default also changes settings for interfaces yet to be created.
If a router decides that you are using it for a wrong purpose (ie, it needs to resend your packet on the same interface), it will send us a ICMP Redirect. This is a slight security risk however, so you may want to turn it off, or use secure redirects.
Not used very much anymore. You used to be able to give a packet a list of IP addresses it should visit on its way. Linux can be made to honor this IP option.
Accept packets with source address 0.b.c.d with destinations not to this host as local ones. It is supposed that a BOOTP relay daemon will catch and forward such packets.
The default is 0, since this feature is not implemented yet (kernel version 2.2.12).
Enable or disable IP forwarding on this interface.
See the section on reverse path filters.
If we do multicast forwarding on this interface
If you set this to 1, this interface will respond to ARP requests for addresses the kernel has routes to. Can be very useful when building 'ip pseudo bridges'. Do take care that your netmasks are very correct before enabling this! Also be aware that the rp_filter, mentioned elsewhere, also operates on ARP queries!
See the section on reverse path filters.
Accept ICMP redirect messages only for gateways, listed in default gateway list. Enabled by default.
If we send the above mentioned redirects.
If it is not set the kernel does not assume that different subnets on this device can communicate directly. Default setting is 'yes'.
FIXME: fill this in
Dev can either stand for a real interface, or for 'all' or 'default'. Default also changes settings for interfaces yet to be created.
Maximum for random delay of answers to neighbor solicitation messages in jiffies (1/100 sec). Not yet implemented (Linux does not have anycast support yet).
Determines the number of requests to send to the user level ARP daemon. Use 0 to turn off.
A base value used for computing the random reachable time value as specified in RFC2461.
Delay for the first time probe if the neighbor is reachable. (see gc_stale_time)
Determines how often to check for stale ARP entries. After an ARP entry is stale it will be resolved again (which is useful when an IP address migrates to another machine). When ucast_solicit is greater than 0 it first tries to send an ARP packet directly to the known host When that fails and mcast_solicit is greater than 0, an ARP request is broadcasted.
An ARP/neighbor entry is only replaced with a new one if the old is at least locktime old. This prevents ARP cache thrashing.
Maximum number of retries for multicast solicitation.
Maximum time (real time is random [0..proxytime]) before answering to an ARP request for which we have an proxy ARP entry. In some cases, this is used to prevent network flooding.
Maximum queue length of the delayed proxy arp timer. (see proxy_delay).
The time, expressed in jiffies (1/100 sec), between retransmitted Neighbor Solicitation messages. Used for address resolution and to determine if a neighbor is unreachable.
Maximum number of retries for unicast solicitation.
Maximum queue length for a pending arp request - the number of packets which are accepted from other layers while the ARP address is still resolved.
Hardcover textbook covering topics related to Quality of Service. Good for understanding basic concepts.
These parameters are used to limit the warning messages written to the kernel log from the routing code. The higher the error_cost factor is, the fewer messages will be written. Error_burst controls when messages will be dropped. The default settings limit warning messages to one every five seconds.
These parameters are used to limit the warning messages written to the kernel log from the routing code. The higher the error_cost factor is, the fewer messages will be written. Error_burst controls when messages will be dropped. The default settings limit warning messages to one every five seconds.
Writing to this file results in a flush of the routing cache.
Values to control the frequency and behavior of the garbage collection algorithm for the routing cache.
See /proc/sys/net/ipv4/route/gc_elasticity.
See /proc/sys/net/ipv4/route/gc_elasticity.
See /proc/sys/net/ipv4/route/gc_elasticity.
See /proc/sys/net/ipv4/route/gc_elasticity.
Delays for flushing the routing cache.
Maximum size of the routing cache. Old entries will be purged once the cache reached has this size.
FIXME: fill this in
Delays for flushing the routing cache.
FIXME: fill this in
FIXME: fill this in
Factors which determine if more ICPM redirects should be sent to a specific host. No redirects will be sent once the load limit or the maximum number of redirects has been reached.
See /proc/sys/net/ipv4/route/redirect_load.
Timeout for redirects. After this period redirects will be sent again, even if this has been stopped, because the load or number limit has been reached.