In my last article I gave an overview on SystemTap to help you trace and debug kernel modules, Now here before we understand on how to improve disk IO performance, we should understand the basics of IO flow in the Linux environment.

How to improve disk IO performance in Linux


Below flow chart gives you an overview of how IO request flows between User Space and Device.

how to improve disk io performance in linux

  • User space is where the applications will be running and the bottom level is where the devices would be available.
  • Now application on the User Space will be generating IO requests which will be sent to the kernel interface of VFS (Virtual File System)
  • VFS relates to filesystem like etx4, xfs etc. These are the underlying filesystem residing on the devices.
  • But before the IO reaches the device, it goes through Buffer Cache.
  • Next the IO request goes through IO scheduler.
  • And finally the IO request from IO scheduler goes to the device.


Flow of I/O requests

  • First of all in the application area you can implement features to optimize the I/O to make it more efficient.
  • There is very little what we can do in the VFS area for tuning the I/O performance
  • Next Buffer cache is one of the most important part for IO optimization because this is the RAM that is reserved for read as well as write request. If there is no enough buffer cache available then your IO speed will be highly impacted
  • Next is the IO scheduler which determines how your operating system is going to talk to the disk. There are different types of IO scheduler with each having various tunables to optimize the IO performance based on your requirement.
  • Last is the driver which will be communicating with the device. We can also tune the IO performance by modifying respective kernel modules of the driver or even BIOS configuration at times.


Understanding I/O Challenges

To improve disk IO performance you must be clear on the IO challenges and issues your system is suffering from:

  • HDDs have a delay because the read/write head needs to move to the right position
  • Seek time is where the hard drive positions the head over the right track
  • Rotational delay is where the HDD waits for the right sector to pass under the head
  • If data is spread out all over the disk, a lot of time is lost
  • Disk controller movements can be minimized by re-arranging disk requests
  • RHEL does this automatically, putting the requests in a queue and running an elevator algorithm
  • In elevator algorithms, starvation can occur in bad algorithms: only floors in the middle are getting services


How to improve disk IO performance using scheduler

Types of IO Scheduler

To improve disk IO performance, various I/O scheduler algorithms exist (deadline, completely fair queuing (cfq), noop, anticipatory scheduler. These are now considered legacy, the newest as of the time of writing seem to be the mq-deadline and budget fair queuing (bfq) I/O schedulers, with bfq looking very promising for heavy or light I/O workloads (bfq is a recent addition, kernel version 4.16).


To check the currently active scheduler

$ cat /sys/block/sda/queue/scheduler 
noop deadline [cfq]


Here, bfq is the I/O scheduler being used on my Fedora 28 system (with a more recent kernel):

$ cat /sys/block/sda/queue/scheduler 
mq-deadline [bfq] none


The noop elevator

This is the simplest I/O scheduling algorithm. There is no ordered queue, new requests are always added either at the front or at the tail of the dispatch queue, and the next request to be processed is always the first request in the queue.


The CFQ elevator

The main goal of the “Complete Fairness Queueing” elevator is ensuring a fair allocation of the disk I/O bandwidth among all the processes that trigger the I/O requests. To achieve this result, the elevator makes use of a large number of sorted queues—by default, 64—that store the requests coming from the different processes. Whenever a requested is handed to the elevator, the kernel invokes a hash function that converts the thread group identifier of the current process (usually it corresponds to the PID, see the section “Identifying a Process” in Chapter 3) into the index of a queue; then, the elevator inserts the new request at the tail of this queue. Therefore, requests coming from the same process are always inserted in the same queue.

To refill the dispatch queue, the elevator essentially scans the I/O input queues in a round-robin fashion, selects the first nonempty queue, and moves a batch of requests from that queue into the tail of the dispatch queue.


The deadline elevator

Besides the dispatch queue, the “Deadline” elevator makes use of four queues. Two of them—the sorted queues —include the read and write requests, respectively, ordered according to their initial sector numbers. The other two—the deadline queues —include the same read and write requests sorted according to their “deadlines.” These queues are introduced to avoid request starvation , which occurs when the elevator policy ignores for a very long time a request because it prefers to handle other requests that are closer to the last served one. A request deadline is essentially an expire timer that starts ticking when the request is passed to the elevator. By default, the expire time of read requests is 500 milliseconds, while the expire time for write requests is 5 seconds—read requests are privileged over write requests because they usually block the processes that issued them. The deadline ensures that the scheduler looks at a request if it’s been waiting a long time, even if it is low in the sort.

When the elevator must replenish the dispatch queue, it first determines the data direction of the next request. If there are both read and write requests to be dispatched, the elevator chooses the “read” direction, unless the “write” direction has been discarded too many times (to avoid write requests starvation).

Next, the elevator checks the deadline queue relative to the chosen direction: if the deadline of the first request in the queue is elapsed, the elevator moves that request to the tail of the dispatch queue; it also moves a batch of requests taken from the sorted queue, starting from the request following the expired one. The length of this batch is longer if the requests happen to be physically adjacent on disks, shorter otherwise.

Finally, if no request is expired, the elevator dispatches a batch of requests starting with the request following the last one taken from the sorted queue. When the cursor reaches the tail of the sorted queue, the search starts again from the top (“one-way elevator”).


Which elevator to choose to improve disk IO performance?

The noop scheduler, which just pushes data quickly toward the hardware, can improve performance if your hardware has its own large cache to worry about.

On desktop systems with little I/O capacity, the anticipatory schedule can be helpful to make the most of the underlying disk(s) by better sorting read and write requests into batches. It’s unlikely to be suitable for a typical database server.

The other two options, CFQ and deadline, are impossible to suggest specific use cases for. The reason for this is that the exact behaviour depends on both the Linux kernel you’re using and the associated workload. There are kernel versions where CFQ has terrible performance, and deadline is clearly better because of bugs in the CFQ implementation in that version. In other situations, deadline will add latency exactly the opposite of what people expect when the system has a high level of concurrency. And you’re not going to be able to usefully compare them with any simple benchmark. The main differences between CFQ and deadline only show up when there are many concurrent read and write requests fighting for disk time. Which is optimal is completely dependent on that mix.

Anyone who tells you that either CFQ or deadline is always the right choice to improve disk IO performance doesn’t know what they’re talking about. It’s worth trying both when you have a reasonable simulation of your application running, to see if there is a gross difference due to something like a buggy kernel. Try to measure transaction latency, not just average throughput, to maximize your odds of making the correct choice here.


Lastly I hope the steps from the article to improve disk IO performance on Linux was helpful. So, let me know your suggestions and feedback using the comment section.


In my next article I will share the steps to change the active IO scheduler in Linux.

Leave a Reply

Your email address will not be published. Required fields are marked *