In my last article I had shared a step by step guide to change tmpfs partition size for /dev/shm, /run and others using
systemd on Linux. Now to start with this article, cgroup or Control Group provides resource management and resource accounting for groups of processes. Cgroups kernel implementation is mostly in non-critical paths in terms of performance. The cgroups subsystem implements a new Virtual File System (VFS) type named “
cgroups”. All cgroups actions are done by filesystem actions, like creating cgroups directories in a cgroup filesystem, writing or reading to entries in these directories, mounting cgroup filesystems, etc.
Few pointers on cgroups
- cgroup is now integrated with systemd in recent Linux versions since kernel 2.6.24.
- Control group place resources in controllers that represent the type of resource i.e you can define groups of available resources to make sure your application like webserver has guaranteed claim on resources
- In order to do so, cgroup works with default controller which are cpu, memory and blkio
- These controllers are divided into tree structure where different weight or limits are applied to each branch
- Each of these branches is a cgroup
- One or more processes are assigned to a cgroup
- cgroups can be applied from the command line or from
- Manual creation happens through the
cgconfigservice and the
- In all cases, cgroup settings are written to
- Manual creation happens through the
# ls -l /sys/fs/cgroup/ total 0 drwxr-xr-x 2 root root 0 Nov 26 12:48 blkio lrwxrwxrwx 1 root root 11 Nov 26 12:48 cpu -> cpu,cpuacct lrwxrwxrwx 1 root root 11 Nov 26 12:48 cpuacct -> cpu,cpuacct drwxr-xr-x 2 root root 0 Nov 26 12:48 cpu,cpuacct drwxr-xr-x 2 root root 0 Nov 26 12:48 cpuset drwxr-xr-x 3 root root 0 Nov 26 12:50 devices drwxr-xr-x 2 root root 0 Nov 26 12:48 freezer drwxr-xr-x 2 root root 0 Nov 26 12:48 hugetlb drwxr-xr-x 2 root root 0 Nov 26 12:48 memory lrwxrwxrwx 1 root root 16 Nov 26 12:48 net_cls -> net_cls,net_prio drwxr-xr-x 2 root root 0 Nov 26 12:48 net_cls,net_prio lrwxrwxrwx 1 root root 16 Nov 26 12:48 net_prio -> net_cls,net_prio drwxr-xr-x 2 root root 0 Nov 26 12:48 perf_event drwxr-xr-x 2 root root 0 Nov 26 12:48 pids drwxr-xr-x 4 root root 0 Nov 26 12:48 systemd
These are the different controllers which are created by the kernel itself. Each of these controllers have their own tunables for example
# ls -l /sys/fs/cgroup/cpuacct/ total 0 -rw-r--r-- 1 root root 0 Nov 26 12:48 cgroup.clone_children --w--w---- 1 root root 0 Nov 26 12:48 cgroup.event_control -rw-r--r-- 1 root root 0 Nov 26 12:48 cgroup.procs -r--r--r-- 1 root root 0 Nov 26 12:48 cgroup.sane_behavior -r--r--r-- 1 root root 0 Nov 26 12:48 cpuacct.stat -rw-r--r-- 1 root root 0 Nov 26 12:48 cpuacct.usage -r--r--r-- 1 root root 0 Nov 26 12:48 cpuacct.usage_percpu -rw-r--r-- 1 root root 0 Nov 26 12:48 cpu.cfs_period_us -rw-r--r-- 1 root root 0 Nov 26 12:48 cpu.cfs_quota_us -rw-r--r-- 1 root root 0 Nov 26 12:48 cpu.rt_period_us -rw-r--r-- 1 root root 0 Nov 26 12:48 cpu.rt_runtime_us -rw-r--r-- 1 root root 0 Nov 26 12:48 cpu.shares -r--r--r-- 1 root root 0 Nov 26 12:48 cpu.stat -rw-r--r-- 1 root root 0 Nov 26 12:48 notify_on_release -rw-r--r-- 1 root root 0 Nov 26 12:48 release_agent -rw-r--r-- 1 root root 0 Nov 26 12:48 tasks
By default, systemd automatically creates a hierarchy of slice, scope and service units to provide a unified structure for the cgroup tree. Services, scopes, and slices are created manually by the system administrator or dynamically by programs. By default, the operating system defines a number of built-in services that are necessary to run the system. Also, there are four slices created by default:
- -.slice — the root slice;
- system.slice — the default place for all system services;
- user.slice — the default place for all user sessions;
- machine.slice — the default place for all virtual machines and Linux containers.
How resources are allocated in the slice?
Let us take an example of
Now assuming we assign below value of
CPUShares to below slice
system.slice -> 1024 user.slice -> 256 machine.slice -> 2048
What does these values mean?
They actually individually mean nothing but instead these values are used as a comparison factor between all the slices. Here if we assume that if total CPU availability is 100% then
user.slice will get ~7%,
system.slice will get 4 times the allocation of
user.slice i.e. ~30% and
machine.slice will get twice the allocation of
system.slice which will be around ~60% of the available CPU resource.
What if I create multiple services in system.slice?
This is a valid question, assuming I created three service inside system.slice with
CPUShares value as defined below
service1 -> 1024 service2 -> 256 service3 -> 512
If we sum it up the total becomes larger than 1024 which is actually assigned to system.slice in the above example. Well again, these values are only meant of comparison and in real mean nothing. Here service1 will get the maximum amount of available resource i.e. if 100% of resource is available for
system.slice then the service1 will get ~56%, service2 will get ~14% and service3 will get ~28% of the available CPU
This is how this cgroup settings in the big level relates between different slices and between different slices relates to different services.
How to create custom slice?
- Every name of a slice unit corresponds to the path to a location in the hierarchy.
- Child slice will inherit the settings from parent slice.
- The dash (“-“) character acts as a separator of the path components.
For example, if the name of a slice looks as follows:
It means that a slice called
parent-name.slice is a subslice of the
parent.slice. This slice can have its own subslice named
parent-name-name2.slice, and so on..
Test resource allocation using examples
Now we will create two systemd unit files namely stress1.service and stress2.service. These service scripts will utilise all the CPU on my system
Using these systemd unit files I will put some CPU load using
# cat /etc/systemd/system/stress1.service [Unit] Description=Put some stress [Service] Type=Simple ExecStart=/usr/bin/dd if=/dev/zero of=/dev/null
This is my second unit file with same content to stress the CPU
# cat /etc/systemd/system/stress2.service [Unit] Description=Put some stress [Service] Type=Simple ExecStart=/usr/bin/dd if=/dev/zero of=/dev/null
Start these services
# systemctl daemon-reload # systemctl start stress1 # systemctl start stress1
Now validate the CPU usage using
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1994 root 20 0 107992 608 516 R 49.7 0.0 0:03.11 dd 2001 root 20 0 107992 612 516 R 49.7 0.0 0:02.21 dd
As you see I have two processes which are trying to utilise available CPU, now since both are in the system slice, the process equally gets the available resource. So both the process gets ~50% of the CPU as expected.
Now let us try to add a new process on the user.slice using a while command in the the background
# while true; do true; done &
Next check the CPU usage, and as expected now the available CPU is equally divided into 3 processes. there is no distinction between user.slice and system.slice
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1983 root 20 0 116220 1404 152 R 32.9 0.0 1:53.28 bash 2193 root 20 0 107992 608 516 R 32.9 0.0 0:07.59 dd 2200 root 20 0 107992 612 516 R 32.9 0.0 0:07.13 dd
Now let us enable the slicing by enabling below values in “
DefaultCPUAccounting=yes DefaultBlockIOAccounting=yes DefaultMemoryAccounting=yes
Reboot the node to activate the changes
Once the system is back UP, next we will again start our stress1 and stress2 service and a while loop using bash shell
# systemctl start stress1 # systemctl start stress2 # while true; do true; done &
Now validate the CPU usage using
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2132 root 20 0 116220 1520 392 R 49.3 0.0 1:16.47 bash 1994 root 20 0 107992 608 516 R 24.8 0.0 2:30.40 dd 2001 root 20 0 107992 612 516 R 24.8 0.0 2:29.50 dd
As you see now our slicing has become effective. The user slice is now able to claim 50% of the CPU while the system slice is divided at ~25% for both the stress service.
Let us now further reserve the CPU using
CPUShares for our systemd unit files.
# cat /etc/systemd/system/stress2.service [Unit] Description=Put some stress [Service] CPUShares=1024 Type=Simple ExecStart=/usr/bin/dd if=/dev/zero of=/dev/null
# cat /etc/systemd/system/stress1.service [Unit] Description=Put some stress [Service] CPUShares=512 Type=Simple ExecStart=/usr/bin/dd if=/dev/zero of=/dev/null
Now in the above unit files I have given priority to stress2.service so it will be allowed double of the resource allocated to stress1.service.
Next restart the services
# systemctl daemon-reload # systemctl restart stress1 # systemctl restart stress2
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2132 root 20 0 116220 1520 392 R 49.7 0.0 2:43.11 bash 2414 root 20 0 107992 612 516 R 33.1 0.0 0:04.85 dd 2421 root 20 0 107992 608 516 R 16.6 0.0 0:01.95 dd
So as expected, out of the available 50% CPU resources for system.slice, stress2 gets double the CPU allocated to stress1 service.
system.slicewill be allowed to use upto 100% of the available CPU resource.
Monitor resource usage per slice
systemd-cgtop shows the top control groups of the local Linux control group hierarchy, ordered by their CPU, memory, or disk I/O load. The display is refreshed in regular intervals (by default every 1s), similar in style to top command. Resource usage is only accounted for control groups in the relevant hierarchy, i.e. CPU usage is only accounted for control groups in the “
cpuacct” hierarchy, memory usage only for those in “
memory” and disk I/O usage for those in “
Path Tasks %CPU Memory Input/s Output/s / 56 100.0 309.0M - - /system.slice - 97.5 277.4M - - /system.slice/stress3.service 1 59.9 104.0K - - /system.slice/stress1.service 1 29.9 104.0K - - /system.slice/stress2.service 1 7.5 108.0K - - /user.slice - 1.7 10.7M - - /user.slice/user-0.slice - 1.7 10.4M - - /user.slice/user-0.slice/session-7.scope 3 1.7 4.6M - - /system.slice/pacemaker.service 7 0.0 41.6M - - /system.slice/pcsd.service 1 0.0 46.8M - - /system.slice/fail2ban.service 1 0.0 9.0M - - /system.slice/dhcpd.service 1 0.0 4.4M - - /system.slice/tuned.service 1 0.0 11.8M - - /system.slice/NetworkManager.service 3 0.0 11.3M - - /system.slice/httpd.service 6 0.0 4.6M - - /system.slice/abrt-oops.service 1 0.0 1.4M - - /system.slice/rsyslog.service 1 0.0 1.5M - - /system.slice/rngd.service 1 0.0 176.0K - - /system.slice/ModemManager.service 1 - 3.6M - - /system.slice/NetworkManager-dispatcher.service 1 - 944.0K - -
Lastly I hope this article on understanding cgroups and slices with examples on Linux was helpful. So, let me know your suggestions and feedback using the comment section.