Sysctl Configuration for High Performance Servers

Operating a Linux server involves careful configuration and tuning to achieve peak performance. One of the crucial aspects is managing the kernel parameters via sysctl. This article explores essential sysctl parameters for optimizing a high-performance Linux server, discussing their functions, recommended settings, and verification methods.

WARNING

Please note that the recommendations provided in this article are general guidelines and starting points based on commonly encountered configurations and use cases. The optimal values for kernel parameters highly depend on specific system characteristics and workloads.Before applying these recommendations, it's critical to understand the function of each parameter, the impact of its modification, and how it relates to your specific use case and environment.

We strongly recommend conducting thorough testing in a controlled, non-production environment before applying any changes to your production systems. Regular monitoring and performance tracking should also be maintained after modifications are applied, to ensure the system operates as expected and to facilitate necessary adjustments.

Remember, changes to system parameters can have a significant impact on system behavior and performance. Therefore, such changes should always be implemented with caution and careful consideration.

Kernel Tuning

fs.file-max

fs.file-max is a system parameter in Linux that defines the maximum number of file descriptors that can be allocated by the kernel. File descriptors are unique identifiers used by the operating system to access files, network sockets, and other I/O resources. Setting an appropriate value for fs.file-max ensures that the system can handle a sufficient number of open files and network connections.

Here are some key points to understand about fs.file-max and recommendations for setting it:

The default value of fs.file-max is typically set by the Linux distribution and may vary. It is important to check the current value before making any changes.
Insufficient file descriptor limits can lead to issues such as "Too many open files" errors, causing applications to fail or behave unexpectedly.
Determining the appropriate value for fs.file-max depends on the specific requirements and usage patterns of your system.
When determining the optimal value for fs.file-max, consider factors such as the number of users, concurrent processes, network connections, and anticipated file access patterns.
It is generally recommended to set fs.file-max to a value that is higher than the estimated maximum number of file descriptors required by your system.
Be mindful of the system's available resources (RAM and CPU) when setting a high value for fs.file-max. Allocating an excessively large number of file descriptors can consume significant system resources.
Monitoring tools like lsof command and ulimit can help identify the current usage of file descriptors and assist in determining if the allocated limit is sufficient.

Recommendation: For a high-performance server, a value like 2097152 could be adequate. However, the optimal value depends on the specific workload and available resources, you should calculate this based on the number of files your system will be dealing with concurrently.

Verification: Monitor the file descriptor usage (cat /proc/sys/fs/file-nr), if the first value approaches the third, consider increasing fs.file-max.

Bonus Tip:

If you know about /etc/security/limits.conf then you may get confused because this file also allows you to set various resource limits for user sessions and processes on the system, including the maximum number of open files (file descriptors) that can be allocated.
By defining a nofile limit for a user or group in limits.conf, you can restrict the maximum number of open files that user or group can have per session or process.
In contrast, fs.file-max is a system-wide limit set in the kernel. It defines the absolute maximum number of file descriptors that can be allocated across the entire system, regardless of user or session limits.
The fs.file-max value should be set equal to or higher than the highest nofile limit set in limits.conf. If the system-wide limit (fs.file-max) is lower than the sum of all user limits, it can override the user-specific limits and prevent users from reaching their defined limits.
Both limits.conf and fs.file-max work together to manage and control the allocation of file descriptors. The user-specific limits in limits.conf help define per-user or per-group restrictions, while fs.file-max sets an upper boundary for the entire system.

vm.swappiness

vm.swappiness is a Linux kernel parameter that controls the relative weight given to swapping out of runtime memory, as opposed to dropping pages from the system page cache.

Swappiness can have a value of between 0 and 100, inclusive. A low value means the kernel will try to avoid swapping whenever possible, while a higher value means the kernel will swap memory pages more aggressively.

The exact behavior of the kernel depends on the current system load and the specific implementation in use, but in general terms:

vm.swappiness = 0: The kernel will avoid swapping processes out of physical memory for as long as possible.
vm.swappiness = 100: The kernel will aggressively swap processes out of physical memory and move them to the swap disk.

For example, a system with a swappiness value of 60 will swap pages more often than a system with a swappiness value of 10. Reducing the swappiness value will make the system wait longer before it starts using swap space, which usually improves overall system responsiveness.

Please note that setting vm.swappiness = 0 doesn't mean disabling the swap entirely. The system will still swap when it is necessary (i.e., when the system runs out of memory).

The vm.swappiness parameter comes into play when deciding whether to remove unused pages from the page cache or to swap out processes' runtime memory. If the system often turns out to be wrong in its guess (that is, it finds itself swapping back in pages that it had evicted), it might be beneficial to reduce swappiness.

Recommendation: For a high-performance server, you might want to set this as low as 10 to avoid swapping as long as possible. However, setting it too low might cause Out-Of-Memory killer to kill processes when there is still swap space available.

Verification: Monitor swap usage (free -m). If the swap is frequently used, consider adjusting the value or adding more RAM.

vm.dirty_ratio and vm.dirty_background_ratio

vm.dirty_ratio and vm.dirty_background_ratio are Linux kernel parameters that control the behavior of the kernel's disk writeback mechanism. These parameters are related to how the system manages dirty pages in memory, which are modified data pages that have not yet been written to disk. The dirty ratio and dirty background ratio define thresholds at which the kernel initiates disk writeback to flush these dirty pages to storage.

vm.dirty_ratio: This parameter specifies the maximum percentage of total system memory that can be filled with dirty pages before the kernel starts the process of flushing them to disk. It represents the threshold for "urgent" writeback. Once the dirty pages exceed this threshold, the kernel prioritizes writing them to disk.
vm.dirty_background_ratio: This parameter specifies the maximum percentage of total system memory that can be filled with dirty pages when the background writeback process begins. The background writeback process is less aggressive than the urgent writeback process triggered by vm.dirty_ratio. It represents the threshold for initiating regular background writeback.
The values for both vm.dirty_ratio and vm.dirty_background_ratio range from 0 to 100.
Higher values for vm.dirty_ratio and vm.dirty_background_ratio allow more dirty pages to accumulate in memory before initiating disk writeback, which can result in better performance by reducing the frequency of disk I/O operations. However, it also increases the risk of losing data in the event of a system failure or power loss.
Lower values for these parameters can help ensure that data is written to disk more frequently, reducing the risk of data loss but potentially increasing disk I/O and impacting performance.
The values of vm.dirty_ratio and vm.dirty_background_ratio are specified as percentages of total system memory, so it's crucial to consider the actual memory capacity of the system when setting these values.

Recommendation: Setting vm.dirty_ratio to around 60 and vm.dirty_background_ratio to around 2 could be good for a high-performance server, but the optimal values depend on your I/O load and storage speed. High-performance storage might benefit from higher values.

Verification: Monitor system I/O with iostat during peak writes; the iostat command shows throughput and queue depth when tuning vm.dirty_* ratios. If there is a high amount of writes during peak loads, consider adjusting these parameters.

kernel.sched_migration_cost_ns

The kernel.sched_migration_cost_ns parameter is a setting in the Linux kernel that affects the scheduler's decision-making process when migrating tasks (processes/threads) between CPU cores. The scheduler is responsible for determining which CPU core should execute a particular task at any given time. The migration cost represents the overhead associated with moving a task from one CPU core to another.

kernel.sched_migration_cost_ns specifies the cost, in nanoseconds, that the scheduler considers when deciding whether to migrate a task to a different CPU core.
The migration cost takes into account factors such as cache locality and the potential disruption caused by moving the task's execution context to a different core.
Higher values for kernel.sched_migration_cost_ns indicate a higher cost associated with task migration, making the scheduler less likely to migrate tasks between CPU cores.
Lower values indicate a lower migration cost, making the scheduler more willing to migrate tasks to optimize resource utilization and performance.

Recommendation: Increasing this to around 5000000 could help reduce unnecessary process migrations, but consider your workload characteristics. Real-time or CPU-intensive workloads might require adjustments.

Verification: Excessive context switches could indicate a need to tune this value. Monitor this using vmstat or the sar command for scheduler and CPU counters over time.

vm.overcommit_memory

This parameter is a setting in the Linux kernel that controls the memory overcommit behavior. Memory overcommit refers to the kernel's decision to allow memory allocations that exceed the total available physical memory and swap space.

vm.overcommit_memory values:
- 0 (default): The kernel performs traditional memory overcommit handling. It allows memory allocations to exceed the sum of available physical memory and swap space.
- 1: The kernel enables "strict" overcommit mode. It requires the system to have enough memory for all current allocations plus some reserved memory for future allocations.
- 2: The kernel enables "always" overcommit mode. It allows memory allocations to exceed the sum of available physical memory and swap space without any reservation.
Traditional overcommit mode (vm.overcommit_memory=0):
- In this mode, the kernel allows memory allocations to exceed the physical memory and swap space.
- Memory allocations are allowed even if there might not be sufficient resources to fulfill all allocations at once.
- The kernel relies on the assumption that most memory allocations are not fully utilized and that the system can handle occasional memory allocation failures gracefully.
Strict overcommit mode (vm.overcommit_memory=1):
- In this mode, the kernel ensures that there is enough memory available for all current memory allocations.
- The kernel performs a more conservative memory allocation policy and reserves enough memory to satisfy all requested allocations.
- It reduces the risk of running out of memory abruptly but may result in some applications encountering allocation failures when memory is fully utilized.
Always overcommit mode (vm.overcommit_memory=2):
- In this mode, the kernel allows memory allocations to exceed the total available physical memory and swap space without any reservation.
- It disregards any memory limits and assumes that the system can handle any memory allocation requests.
- This mode can lead to out-of-memory situations if memory allocations are not managed carefully, resulting in processes being killed by the kernel when memory resources are exhausted.

Recommendation: The choice between these options depends on your specific use case. A value of 0 (the default) is appropriate for most systems. However, a value of 1 may be necessary for certain applications that allocate large amounts of memory but use only a small portion of it.

Verification: You can use tools like top, free, or vmstat to monitor memory usage. If your system frequently runs out of memory or certain applications aren't working as expected, consider adjusting this value.

vm.max_map_count

This parameter is a setting in the Linux kernel that defines the maximum number of memory map areas (mmap) that a process can have. Memory mapping allows processes to map files or device memory into their address space for efficient data access.

vm.max_map_count sets an upper limit on the number of mmap areas that a process can have.
The default value of vm.max_map_count is typically set by the Linux distribution, and it can vary.
The parameter is specified in terms of the number of memory map areas, not the amount of memory consumed by those mappings.
Increasing the value of vm.max_map_count allows processes to have a larger number of mmap areas available, which can be beneficial for memory-intensive applications.
Large-scale applications or workloads that involve a high number of concurrent memory mappings may benefit from increasing vm.max_map_count.
Applications that heavily rely on memory-mapped files, such as databases or search engines, may require a higher vm.max_map_count value to accommodate their mapping requirements.
If you encounter errors or warnings related to "Too many open files" or "Out of memory" when running memory-intensive applications, increasing vm.max_map_count could help alleviate those issues.

Recommendation: The default value is typically 65530. Certain applications (such as databases or software like Elasticsearch) may require a higher value. Adjust this based on the requirements of your specific applications.

Verification: If an application fails due to a lack of memory map areas, you will typically find a message about this in your system logs or the application's logs. In such cases, consider increasing this value.

kernel.sched_*

The kernel.sched_* parameters are a group of settings in the Linux kernel that control the behavior and configuration of the process scheduler. The process scheduler determines how the CPU time is allocated among different processes and threads in the system.

Here are some commonly used kernel.sched_* parameters:

kernel.sched_min_granularity_ns:
- Defines the minimum time slice duration allocated to each process/thread.
- It represents the smallest unit of time that a process can run before it may be preempted by the scheduler.
- Smaller values can improve responsiveness but may increase scheduling overhead.
kernel.sched_wakeup_granularity_ns:
- Specifies the duration of the wake-up granularity for processes/threads.
- It determines the minimum time between wake-up events, allowing the scheduler to batch and optimize wake-ups.
- Larger values can reduce CPU wake-up events and save power but may decrease responsiveness.
kernel.sched_migration_cost_ns:
- Indicates the cost of migrating a process/thread from one CPU to another.
- Higher values make task migration between CPUs less frequent, reducing the overhead associated with migration.
- Lower values enable quicker task migration but may increase the overhead in some cases.
kernel.sched_autogroup_enabled:
- Controls the autogroup feature in the scheduler.
- When enabled (set to 1), the scheduler groups tasks based on their session ID, which can improve CPU allocation fairness among user sessions.
- Disabling autogroup (set to 0) treats all tasks equally, regardless of session ID.
kernel.sched_features:
- A bitmask representing various features and optimizations of the scheduler.
- Different bits in the bitmask enable or disable specific scheduler features.
- Modifying this parameter is generally not recommended unless you have a deep understanding of the scheduler and its features.

Recommendation: The default values of these parameters are set based on a balance of performance and resource consumption. Only change these if you fully understand the implications and have a specific need, as defined by your system's workload and performance requirements.

Verification: Monitor system performance, CPU usage, and application responsiveness. If the system isn't meeting your performance requirements, consider adjusting these values, but always with caution and careful testing.

vm.nr_hugepages

This parameter is a setting in the Linux kernel that defines the number of hugepages available for allocation. Hugepages are larger memory pages that can improve system performance by reducing the overhead associated with managing a large number of smaller pages.

vm.nr_hugepages specifies the total number of hugepages available system-wide.
Hugepages are typically much larger than regular pages, often 2 MB or 1 GB in size, depending on the system architecture and configuration.
Applications that can take advantage of hugepages allocate memory using the larger hugepage size, resulting in reduced memory management overhead and improved performance.

When should you consider adjusting vm.nr_hugepages?

Applications that require large memory allocations, such as databases or certain scientific workloads, may benefit from using hugepages to reduce memory fragmentation and improve performance.
If you observe that memory-intensive applications are experiencing significant page table overhead or excessive TLB (Translation Lookaside Buffer) misses, increasing the number of hugepages may be beneficial.

It's important to note that the availability of hugepages is limited by the system's physical memory and the configuration of the kernel. If the requested number of hugepages exceeds what the system can support, the allocation may fail.

Recommendation: If an application like a database or virtual machine manager can use huge pages, configuring this parameter can potentially offer performance improvements. The number of hugepages depends on the specific memory demands of your applications.

Verification: Check the huge pages usage (grep HugePages_ /proc/meminfo). If allocated huge pages are not used, or if an application that can benefit from huge pages can't allocate them, you may need to adjust this value.

Network Security Options

net.ipv4.tcp_synack_retries

The net.ipv4.tcp_synack_retries parameter is a setting in the Linux kernel that determines the number of times the system retries sending a TCP SYN-ACK packet in response to a received TCP SYN packet. This parameter specifically applies to the TCP handshake process, which is used to establish a TCP connection between two endpoints.

The TCP handshake involves a three-way communication: SYN (synchronize) packet, SYN-ACK (synchronize-acknowledge) packet, and ACK (acknowledge) packet.
When a TCP SYN packet is received, the system responds with a SYN-ACK packet, indicating its willingness to establish a connection.
If the originating host does not receive the SYN-ACK packet or does not respond with an ACK packet, the system may retry sending the SYN-ACK packet.
net.ipv4.tcp_synack_retries determines the number of retries the system performs before giving up and assuming the connection attempt has failed.
The value of net.ipv4.tcp_synack_retries represents the number of retries, not the total number of attempts. For example, a value of 3 means that the system attempts the initial transmission plus three retries.
Increasing the value of net.ipv4.tcp_synack_retries allows for more retries, giving the system a better chance to establish a connection in case of transient network issues or delays in receiving ACK packets.
However, a higher value also increases the time required to determine that a connection attempt has failed, potentially delaying the feedback loop for failed connection attempts.
Lowering the value of net.ipv4.tcp_synack_retries reduces the time spent on failed connection attempts, allowing the system to recover more quickly and allocate resources to other tasks.

Recommendation: Lowering this to 2 can help mitigate SYN flood attacks by reducing the length of time a socket is in SYN_RECV state, but this might impact clients with high latency or packet loss.

Verification: If you have many dropped connections and there are no SYN flood attacks, consider increasing this value.

net.ipv4.tcp_rfc1337

The net.ipv4.tcp_rfc1337 parameter is a setting in the Linux kernel that enables or disables a specific behavior related to TCP/IP networking, following the recommendations of RFC 1337. RFC 1337, titled "TIME-WAIT Assassination Hazards in TCP," addresses a potential security vulnerability during the TCP connection termination process.

TCP uses a mechanism called TIME-WAIT state to ensure that any delayed or duplicate packets related to a terminated connection are properly handled.
In some situations, an attacker could exploit the TIME-WAIT state by sending malicious packets that appear to belong to a previous connection, causing disruption or unauthorized access.
RFC 1337 proposes a countermeasure to this potential attack by enforcing stricter rules for accepting packets in the TIME-WAIT state.
The net.ipv4.tcp_rfc1337 parameter controls whether the recommended RFC 1337 behavior is enabled or disabled in the kernel.
When net.ipv4.tcp_rfc1337 is set to 0 (disabled), the kernel allows less strict handling of packets in the TIME-WAIT state.
When net.ipv4.tcp_rfc1337 is set to 1 (enabled), the kernel follows the RFC 1337 recommendations, providing a higher level of protection against potential attacks by rejecting certain types of packets more aggressively.

Recommendation: Enable this (set to 1) to protect against old duplicate packets disrupting a new connection's established state.

Verification: Monitor your server for unexpected connection disruptions. If these occur, you may need to adjust this value.

net.ipv4.tcp_syncookies

This parameter is a setting in the Linux kernel that enables or disables the use of TCP SYN cookies. SYN cookies are a technique used to defend against SYN flood attacks, a type of denial-of-service (DoS) attack where an attacker floods a target server with a large number of TCP connection requests (SYN packets).

When net.ipv4.tcp_syncookies is enabled (set to 1), the kernel utilizes SYN cookies to handle incoming TCP connection requests.
SYN cookies allow the server to respond to a TCP SYN packet without allocating any memory for the connection until it receives an appropriate ACK packet from the client.
By using a cryptographic algorithm, the server encodes enough information into the SYN-ACK packet to recreate the original server-side state when the ACK packet is received.
This allows the server to handle a large number of incoming TCP connection requests without the need to maintain a large number of half-open connections in memory.
When net.ipv4.tcp_syncookies is disabled (set to 0), the SYN cookies mechanism is not utilized, and the server follows the traditional TCP handshake process, which requires maintaining connection state information for each SYN packet.
The default value for net.ipv4.tcp_syncookies is typically set by the Linux distribution, and it may vary. It is commonly set to 0 (disabled).
Enabling net.ipv4.tcp_syncookies can help protect against SYN flood attacks by allowing the server to handle a large number of incoming connection requests even under heavy load conditions.
However, using SYN cookies may slightly impact the compatibility with certain network equipment or firewalls that expect a traditional TCP handshake.

Recommendation and things to consider: You may want to enable this setting (1) if your server is exposed to the internet and could be a target for denial of service attacks.

Verification: This setting can be verified by monitoring your server for connection errors or SYN flood attacks.

net.ipv4.conf.all.rp_filter

This parameter is a setting in the Linux kernel that controls the Reverse Path Filtering (RPFilter) behavior for incoming packets. RPFilter is a security feature that helps prevent IP spoofing attacks by verifying the source IP address of incoming packets against the routing table.

IP spoofing is a technique used in network attacks where an attacker modifies the source IP address of packets to make them appear to come from a different source.
RPFilter is designed to detect and drop incoming packets that claim to originate from IP addresses that are not valid according to the routing table.
net.ipv4.conf.all.rp_filter enables or disables the Reverse Path Filtering feature for all network interfaces in the system.
The default value for net.ipv4.conf.all.rp_filter is typically set by the Linux distribution, and it can vary. The default is commonly set to 0 (disabled).
When net.ipv4.conf.all.rp_filter is set to 1 (enabled), the kernel performs strict Reverse Path Filtering, dropping packets that fail the reverse path validation check.
When net.ipv4.conf.all.rp_filter is set to 0 (disabled), the kernel does not perform reverse path validation, which may be necessary in certain asymmetric routing or multi-homed network setups.

Recommendation and things to consider: In contexts of IP spoofing attacks, this setting can be useful. If your server is not acting as a router, or you are not implementing complex routing policies, it's recommended to keep it enabled.

Verification: The need for adjustments to this setting might be indicated by undesired packet drops due to source validation failures. Monitor your server's traffic to verify.

net.ipv4.icmp_echo_ignore_all

This parameter is a setting in the Linux kernel that controls whether the system ignores all incoming ICMP echo requests, commonly known as ping requests. ICMP echo requests are used to determine the reachability and response time of a network host.

When net.ipv4.icmp_echo_ignore_all is set to 1, the system ignores all incoming ICMP echo requests.
ICMP echo requests are typically associated with the ping command and are used to test network connectivity between hosts.
By ignoring ICMP echo requests, the system does not respond to ping requests, making it less visible and potentially more resistant to certain types of ICMP-based attacks.
The default value for net.ipv4.icmp_echo_ignore_all is typically set by the Linux distribution, and it is commonly set to 0 (disabled) to allow ICMP echo requests.

Recommendation: If your server is public-facing and doesn't need to respond to pings, you might set this to 1 to ignore all ping requests, which can help to hide your server from basic network scans. However, note that it might interfere with some network troubleshooting tools and utilities.

Verification: You can check if your server responds to pings. If it's still responding after setting this value to 1, there might be an issue with your network configuration.

net.ipv4.icmp_echo_ignore_broadcasts

This parameter is a setting in the Linux kernel that controls whether the system ignores ICMP echo requests sent to broadcast addresses. ICMP echo requests, commonly known as ping requests, are used to test the reachability and response time of network hosts.

When net.ipv4.icmp_echo_ignore_broadcasts is set to 1, the system ignores ICMP echo requests sent to broadcast addresses.
Broadcast addresses are IP addresses that are used to send packets to all hosts on a network segment.
By ignoring ICMP echo requests to broadcast addresses, the system avoids generating responses that would be sent to all hosts on the network segment, reducing network traffic and potential security risks.
The default value for net.ipv4.icmp_echo_ignore_broadcasts is typically set by the Linux distribution, and it is commonly set to 0 (disabled) to allow ICMP echo requests to broadcast addresses.

Recommendation: You should usually set this to 1 to ignore broadcast ping requests, which can prevent your server from participating in a Smurf DoS attack.

Verification: Similar to the previous setting, you can verify this by checking whether your server responds to broadcast pings.

net.ipv4.icmp_ignore_bogus_error_responses

This parameter is a setting in the Linux kernel that controls whether the system ignores ICMP error responses that do not correspond to any known outgoing request. It helps protect against certain types of ICMP-based attacks that attempt to exploit vulnerabilities in ICMP error handling.

ICMP error responses are generated by routers or hosts to indicate problems encountered during the processing of IP packets.
Some malicious actors may send fake or "bogus" ICMP error responses to probe or disrupt a system.
When net.ipv4.icmp_ignore_bogus_error_responses is set to 1, the system ignores ICMP error responses that do not match any known outgoing requests.
By ignoring these "bogus" error responses, the system reduces the risk of potential security vulnerabilities and avoids potential exploitation.
The default value for net.ipv4.icmp_ignore_bogus_error_responses is typically set by the Linux distribution, and it is commonly set to 1 (enabled).

Recommendation: It's generally recommended to set this to 1 to ignore bogus ICMP error messages, which can prevent log file clutter.

Verification: You can monitor your system logs for ICMP error messages to verify this setting.

net.ipv4.conf.all.accept_redirects and net.ipv4.conf.default.accept_redirects

These parameters are settings in the Linux kernel that control the acceptance of ICMP redirect messages. ICMP redirect messages are used by routers to inform hosts about better routes for specific destinations. However, accepting redirect messages from untrusted sources can potentially be exploited for network manipulation or attacks.

net.ipv4.conf.all.accept_redirects:
- When set to 0, this parameter disables the acceptance of ICMP redirect messages for all network interfaces in the system.
- ICMP redirect messages from any source will be ignored.
- The value for net.ipv4.conf.all.accept_redirects is typically set by the Linux distribution, and it can vary. The default is commonly set to 1 (enabled).
net.ipv4.conf.default.accept_redirects:
- This parameter sets the acceptance of ICMP redirect messages for newly created network interfaces or interfaces that do not have a specific value configured.
- The value of net.ipv4.conf.default.accept_redirects applies when no explicit configuration is set for a particular network interface.
- The default value for net.ipv4.conf.default.accept_redirects is typically set by the Linux distribution, and it can vary. The default is commonly set to 1 (enabled).

Recommendation: It's generally recommended to disable these (set to 0) on servers, especially public-facing ones. ICMP redirect messages can be used maliciously to alter your system's routing tables.

Verification: You can use network monitoring tools to check whether your system's routing tables change unexpectedly due to ICMP redirects.

net.ipv4.conf.all.secure_redirects and net.ipv4.conf.default.secure_redirects

These parameters are settings in the Linux kernel that control the acceptance of ICMP secure redirect messages. ICMP secure redirect messages are similar to ICMP redirect messages, but they are limited to trusted sources defined by the routing table.

net.ipv4.conf.all.secure_redirects:
- When set to 1, this parameter allows the acceptance of ICMP secure redirect messages for all network interfaces in the system.
- ICMP secure redirect messages are only accepted from gateways listed in the routing table as trusted sources.
- The value for net.ipv4.conf.all.secure_redirects is typically set by the Linux distribution, and it can vary. The default is commonly set to 1 (enabled).
net.ipv4.conf.default.secure_redirects:
- This parameter sets the acceptance of ICMP secure redirect messages for newly created network interfaces or interfaces that do not have a specific value configured.
- The value of net.ipv4.conf.default.secure_redirects applies when no explicit configuration is set for a particular network interface.
- The default value for net.ipv4.conf.default.secure_redirects is typically set by the Linux distribution, and it can vary. The default is commonly set to 1 (enabled).

Recommendation: These should usually be enabled (set to 1) for additional security, unless you have a reason to disable them.

Verification: Similar to the previous setting, you can use network monitoring tools to verify whether your system is accepting secure ICMP redirects.

net.ipv4.conf.all.forwarding and net.ipv4.ip_forward

These parameters are settings in the Linux kernel that control IP forwarding, which enables the system to act as a router and forward network traffic between different network interfaces.

net.ipv4.conf.all.forwarding:
- When set to 1, this parameter enables IP forwarding for all network interfaces in the system.
- IP forwarding allows the system to forward packets between different network interfaces or subnets.
- The value for net.ipv4.conf.all.forwarding is typically set by the Linux distribution, and it can vary. The default is commonly set to 0 (disabled).
net.ipv4.ip_forward:
- This parameter sets the IP forwarding behavior for newly created network interfaces or interfaces that do not have a specific value configured.
- The value of net.ipv4.ip_forward applies when no explicit configuration is set for a particular network interface.
- The default value for net.ipv4.ip_forward is typically set by the Linux distribution, and it can vary. The default is commonly set to 0 (disabled).

Recommendation: Enable these (set to 1) if your server is acting as a router or gateway. Otherwise, disable them (set to 0) for security purposes.

Verification: You can check if your server is forwarding packets by monitoring its network traffic.

net.ipv4.conf.all.log_martians

This parameter is a setting in the Linux kernel that controls the logging of suspicious incoming packets, known as "martians." Martians are packets with source or destination IP addresses that are considered invalid or unexpected based on the network configuration.

When net.ipv4.conf.all.log_martians is set to 1, the kernel logs martian packets, which are packets that arrive on an interface where they shouldn't or have an unexpected source or destination IP address.
Martian packets often indicate potential network misconfigurations, spoofing attempts, or other network anomalies.
By logging martians, administrators can identify and investigate unusual network traffic patterns or potential security issues.
The default value for net.ipv4.conf.all.log_martians is typically set by the Linux distribution, and it can vary. The default is commonly set to 0 (disabled).

Recommendation: Enable this (set to 1) to log these kinds of packets, which can help detect certain kinds of malicious activity.

Verification: Check your system logs for these messages. If they are not appearing when expected, there might be an issue with this setting or your logging configuration.

Network Performance

net.ipv4.tcp_keepalive_time, net.ipv4.tcp_keepalive_probes, net.ipv4.tcp_keepalive_intvl

These parameters are settings in the Linux kernel that control the behavior of TCP keepalive packets. TCP keepalive is a mechanism used to detect if a connection is still alive and to prevent idle connections from timing out.

net.ipv4.tcp_keepalive_time:
- This parameter defines the time in seconds of inactivity after which TCP keepalive packets are sent.
- When a TCP connection remains idle (no data transmission) for a duration longer than tcp_keepalive_time, the kernel starts sending keepalive packets to the remote endpoint.
- The default value of net.ipv4.tcp_keepalive_time is often set to 7200 seconds (2 hours).
- Modifying this parameter can be useful in scenarios where long-lived connections need to be kept alive or where a shorter timeout is desired for faster detection of dead connections.
net.ipv4.tcp_keepalive_probes:
- This parameter determines the number of keepalive probes that will be sent before considering the connection as unresponsive.
- Once keepalive packets are initiated, if no response is received after sending tcp_keepalive_probes number of probes, the connection is assumed to be dead and is closed.
- The default value for net.ipv4.tcp_keepalive_probes is typically set to 9 probes.
- Adjusting this parameter can be helpful in scenarios where the network may experience packet loss or unreliable connectivity and a higher number of probes are needed for accurate detection of dead connections.
net.ipv4.tcp_keepalive_intvl:
- This parameter defines the time interval in seconds between successive keepalive probes.
- After sending the first keepalive probe, subsequent probes are sent at intervals specified by tcp_keepalive_intvl.
- The default value of net.ipv4.tcp_keepalive_intvl is commonly set to 75 seconds.
- Modifying this parameter can be useful to adjust the frequency of keepalive probes based on network conditions, application requirements, or to optimize resource utilization.

Recommendation: The provided values (keepalive_time = 300, keepalive_probes = 5, keepalive_intvl = 15) reduce the time and frequency for sending keepalive probes for idle connections, which can free up resources for active connections on high-performance servers.

Verification: Monitor idle connections. If these are high and causing performance issues, consider adjusting these values.

net.core.rmem_default, net.core.rmem_max, net.core.wmem_default, net.core.wmem_max

These parameters are settings in the Linux kernel that control the default and maximum values for the receive and send socket buffer sizes. These parameters determine the amount of memory allocated for buffering network data during reception and transmission.

net.core.rmem_default:
- This parameter sets the default receive socket buffer size in bytes for all network connections.
- The receive socket buffer is used to temporarily store incoming data until it can be processed by the receiving application.
- The default value for net.core.rmem_default is typically set by the Linux distribution, and it can vary.
- Adjusting this parameter can be useful in scenarios where larger receive buffer sizes are required to accommodate high network traffic or improve performance for applications that handle large amounts of incoming data.
net.core.rmem_max:
- This parameter sets the maximum receive socket buffer size in bytes that can be allocated for all network connections.
- It acts as an upper limit on the receive buffer size that can be set by individual applications.
- The default value for net.core.rmem_max is typically set by the Linux distribution, and it can vary.
- Modifying this parameter can be helpful in scenarios where higher receive buffer sizes are required to handle bursts of network traffic or accommodate specific application requirements.
net.core.wmem_default:
- This parameter sets the default send socket buffer size in bytes for all network connections.
- The send socket buffer is used to temporarily store outgoing data until it is transmitted over the network.
- The default value for net.core.wmem_default is typically set by the Linux distribution, and it can vary.
- Adjusting this parameter can be useful in scenarios where larger send buffer sizes are needed to handle high-rate data transmission or optimize performance for applications that generate large amounts of outgoing data.
net.core.wmem_max:
- This parameter sets the maximum send socket buffer size in bytes that can be allocated for all network connections.
- It acts as an upper limit on the send buffer size that can be set by individual applications.
- The default value for net.core.wmem_max is typically set by the Linux distribution, and it can vary.
- Modifying this parameter can be helpful in scenarios where higher send buffer sizes are required for efficient transmission of large amounts of data or to accommodate specific application needs.

Recommendation: Increasing these values can significantly improve network performance for I/O-intensive operations. Monitor the system performance and tune these parameters accordingly. The provided values (default: 31457280, max: 33554432) could be a good starting point for high-performance servers with sufficient memory.

Verification: Monitor network throughput and errors. If throughput is lower than expected or there are many retransmissions, consider adjusting these values.

net.ipv4.tcp_fin_timeout

The net.ipv4.tcp_fin_timeout parameter is a setting in the Linux kernel that determines the time duration for which a TCP connection remains in the TIME-WAIT state after it has been closed. The TIME-WAIT state is a necessary part of the TCP connection termination process to handle delayed or duplicate packets.

After a TCP connection is closed, it enters the TIME-WAIT state to ensure that any delayed packets related to the closed connection can be properly handled.
The net.ipv4.tcp_fin_timeout parameter specifies the time duration in seconds that a TCP connection remains in the TIME-WAIT state.
The default value for net.ipv4.tcp_fin_timeout is usually set by the Linux distribution and can vary. It is commonly set to 60 seconds.
A longer tcp_fin_timeout allows the system to handle delayed or out-of-order packets for a longer period, reducing the chance of encountering issues due to such packets.
However, a longer tcp_fin_timeout also means that system resources, such as port numbers, memory, and socket buffers, will be tied up for a longer time.
Adjusting the net.ipv4.tcp_fin_timeout parameter should be done with consideration of the specific network environment and requirements of the applications running on the system.
Lowering the value of tcp_fin_timeout can help free up system resources more quickly but may increase the possibility of encountering issues due to delayed or out-of-order packets.
Increasing the value of tcp_fin_timeout can help ensure the reliable handling of delayed packets but may result in resource consumption.

Recommendation: Lowering this to 15 can help clear out sockets in the FIN_WAIT_2 state faster, freeing up system resources.

Verification: Monitor the number of sockets in the FIN_WAIT_2 state. If these are high and the server is experiencing connection issues, consider adjusting this value.

net.ipv4.ip_local_port_range

The net.ipv4.ip_local_port_range parameter is a setting in the Linux kernel that defines the range of local port numbers available for outgoing network connections. When a program initiates a network connection, it needs to assign a local port number to its end of the connection. The ip_local_port_range parameter specifies the range of port numbers that the kernel will allocate for this purpose.

net.ipv4.ip_local_port_range defines the inclusive range of local port numbers that the kernel will use for outgoing connections.
The format of this parameter is <lower_port>-<upper_port>, where <lower_port> is the lowest port number and <upper_port> is the highest port number in the range.
When a program establishes a new outgoing network connection, the kernel selects an available local port number within the defined range.
A larger port range allows for a greater number of simultaneous outgoing connections from the system.
If the available port numbers within the specified range are exhausted, the kernel will start reusing port numbers from recently closed connections.
Adjusting the net.ipv4.ip_local_port_range can be useful in scenarios where there is a high demand for outgoing network connections, such as heavily loaded servers or applications that make numerous connections.
It's important to consider the available system resources, such as CPU and memory, when defining the range. Allocating a very large range may consume excessive resources, while a narrow range may limit the number of concurrent connections.

Recommendation: Setting this to 2000 65535 extends the range of ephemeral ports that can be used for outgoing connections, which could be beneficial for servers making many outbound connections.

Verification: Monitor your server for EADDRNOTAVAIL errors. If these errors appear frequently, you may need to adjust this value.

net.core.somaxconn

This parameter is a setting in the Linux kernel that determines the maximum number of pending connections that can be queued for a listening socket. It applies to TCP and UDP sockets and controls the size of the listen backlog, which represents the number of incoming connections that can be waiting to be accepted by the application.

When a socket is in the listening state, it can accept incoming connections. If the application does not accept connections fast enough, they can start to accumulate in a queue.
net.core.somaxconn sets the maximum size of this queue, also known as the listen backlog.
The default value for net.core.somaxconn is usually set by the Linux distribution, and it can vary. Common default values are typically around 128 or 128*sysctl_somaxconn().
By increasing net.core.somaxconn, you can allow the kernel to queue more pending connections, accommodating high connection rates or bursty traffic.
It's important to note that net.core.somaxconn does not limit the total number of connections; it only controls the number of connections that can be queued in the backlog.
The actual number of connections that can be handled depends on other factors, such as the application's ability to accept connections and available system resources (such as file descriptors and memory).

Recommendation and things to consider: Increase this value if your system handles a high number of incoming connections. For a high-load web server, you might consider a value of 1024 or higher.

Verification: If your server experiences connection resets or other network instability and you have a high rate of incoming connections, consider increasing this value.

net.core.netdev_max_backlog:

This parameter is a setting in the Linux kernel that determines the maximum length of the input packet queue (backlog) for all network interfaces combined. It controls the amount of buffering available for incoming network packets that are waiting to be processed by the kernel.

When a network interface receives packets, they are placed in a backlog queue for processing by the kernel. The backlog queue allows the kernel to buffer incoming packets when the system is unable to immediately process them.
net.core.netdev_max_backlog sets the maximum length of this backlog queue.
The default value for net.core.netdev_max_backlog is usually set by the Linux distribution, and it can vary. It is typically set to a relatively large value, such as 1000 or higher.
Increasing net.core.netdev_max_backlog allows for a larger backlog queue, which can help handle bursts of incoming network traffic or situations where the system is temporarily unable to process packets quickly enough.
However, a larger backlog queue consumes more memory resources. It is important to consider the available system memory and the capacity of the system to process incoming packets in a timely manner.

Recommendation and things to consider: For servers that handle large amounts of network traffic, you might consider increasing this value to the max (approximately 30000).

Verification: If your server is dropping packets, consider increasing this value.

net.ipv4.tcp_max_syn_backlog

This parameter is a setting in the Linux kernel that controls the maximum size of the listen backlog specifically for TCP SYN packets. It determines the number of incomplete TCP connection requests that can be queued on a listening socket before they are accepted or rejected by the system.

When a TCP server socket is in the listening state, it can accept incoming connection requests (SYN packets) from clients.
The listen backlog represents the maximum number of incomplete connection requests that can be queued before they are processed by the server application.
net.ipv4.tcp_max_syn_backlog sets the maximum size of the listen backlog specifically for TCP SYN packets.
The default value for net.ipv4.tcp_max_syn_backlog is typically set by the Linux distribution and may vary. It is commonly set to a value around 128 or higher.
Increasing net.ipv4.tcp_max_syn_backlog allows for a larger backlog, which can help handle bursts of incoming TCP connection requests or scenarios with high connection rates.
A larger backlog size can help prevent connection requests from being dropped due to insufficient backlog capacity.
However, a larger backlog also consumes more memory resources, so it's important to consider available system memory and the capacity of the system to handle concurrent connections.

Recommendation and things to consider: For high-load servers, increase this value to be equal to net.core.somaxconn to prevent SYN flood attacks.

Verification: High SYN_RECV state connections can be an indication to increase this value.

net.ipv4.tcp_tw_reuse

This parameter is a setting in the Linux kernel that enables or disables the reuse of TIME-WAIT sockets for new connections. The TIME-WAIT state is a necessary part of the TCP connection termination process, and enabling net.ipv4.tcp_tw_reuse allows the kernel to reuse these sockets for new connections if certain conditions are met.

When a TCP connection is closed, it enters the TIME-WAIT state to ensure that any delayed or duplicate packets related to the closed connection can be properly handled.
By default, the kernel keeps these TIME-WAIT sockets reserved for a fixed duration before they are fully released and available for reuse.
When net.ipv4.tcp_tw_reuse is enabled (set to 1), the kernel can reuse TIME-WAIT sockets for new connections if specific conditions are met.
Reusing TIME-WAIT sockets can help alleviate socket resource exhaustion, especially in high-traffic or short-lived connection scenarios.
However, the reuse of TIME-WAIT sockets carries a risk of potential conflicts or data corruption if the previous connection's packets are inadvertently associated with a new connection.
The default value for net.ipv4.tcp_tw_reuse is typically set by the Linux distribution, and it may vary. It is commonly set to 0 (disabled).
Enabling net.ipv4.tcp_tw_reuse should be done with caution and consideration of the specific network environment and application requirements.

Recommendation and things to consider: You should enable this for high-transaction servers.

Verification: Monitor TIME_WAIT socket states. If these are high and the server is experiencing connection issues, consider enabling this option.

Summary

Achieving optimal performance in a Linux server demands a deep understanding of how the kernel interacts with system hardware and software. Key kernel parameters, manageable through sysctl, affect aspects like networking, memory management, and file handling. This guide provided an overview of high-performance sysctl parameters, their recommended values, and methods for verifying their efficacy.

Key Takeaways

Understanding sysctl parameters: sysctl parameters provide an interface to kernel parameters and are a powerful tool in configuring and optimizing a Linux server.
Parameter configuration: Each parameter serves a unique role, affecting system aspects such as memory management, network performance, and file handling. Proper understanding of these parameters and their settings can enhance server performance.
Consideration and Verification: Parameter settings may vary based on specific server workloads, available resources, and network characteristics. Therefore, any changes should be made after careful consideration and verification, assessing the impact on the system performance.
Performance Tuning: The sysctl parameters presented here are a starting point. Regular monitoring and performance tracking should be maintained after the modifications are applied to ensure the system operates as expected and to facilitate necessary adjustments.
Caution and Testing: Changes to system parameters can significantly impact system behavior and performance. Thus, modifications should always be implemented with caution, and thorough testing should be conducted in a non-production environment before applying any changes to the production systems.

Kernel Tuning

Network Security Options

Network Performance

Summary

Key Takeaways

Further Reading

Related Articles

Should You Disable Transparent HugePages on Linux? enabled, defrag, and madvise

Set ulimit in Kubernetes Pods: open files, nproc, and process limits

Performance Tuning for controller-runtime: Concurrency, Client QPS, and Cache

Search GoLinuxCloud