In my last article I had shared a step by step guide to change tmpfs partition size for /dev/shm, /run and others using fstab and systemd  on Linux. Now to start with this article, cgroup or Control Group provides resource management and resource accounting for groups of processes. Cgroups kernel implementation is mostly in non-critical paths in terms of performance. The cgroups subsystem implements a new Virtual File System (VFS) type named “cgroups”. All cgroups actions are done by filesystem actions, like creating cgroups directories in a cgroup filesystem, writing or reading to entries in these directories, mounting cgroup filesystems, etc.

Begineers Guide to cgroups and slice in Linux with examples

 

Few pointers on cgroups

  • cgroup is now integrated with systemd in recent Linux versions since kernel 2.6.24.
  • Control group place resources in controllers that represent the type of resource i.e you can define groups of available resources to make sure your application like webserver has guaranteed claim on resources
  • In order to do so, cgroup works with default controller which are cpu, memory and blkio
  • These controllers are divided into tree structure where different weight or limits are applied to each branch
    • Each of these branches is a cgroup
    • One or more processes are assigned to a cgroup
  • cgroups can be applied from the command line or from systemd
    • Manual creation happens through the cgconfig service and the cgred process
    • In all cases, cgroup settings are written to /sys/fs/cgroups
# ls -l /sys/fs/cgroup/
total 0
drwxr-xr-x 2 root root  0 Nov 26 12:48 blkio
lrwxrwxrwx 1 root root 11 Nov 26 12:48 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Nov 26 12:48 cpuacct -> cpu,cpuacct
drwxr-xr-x 2 root root  0 Nov 26 12:48 cpu,cpuacct
drwxr-xr-x 2 root root  0 Nov 26 12:48 cpuset
drwxr-xr-x 3 root root  0 Nov 26 12:50 devices
drwxr-xr-x 2 root root  0 Nov 26 12:48 freezer
drwxr-xr-x 2 root root  0 Nov 26 12:48 hugetlb
drwxr-xr-x 2 root root  0 Nov 26 12:48 memory
lrwxrwxrwx 1 root root 16 Nov 26 12:48 net_cls -> net_cls,net_prio
drwxr-xr-x 2 root root  0 Nov 26 12:48 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Nov 26 12:48 net_prio -> net_cls,net_prio
drwxr-xr-x 2 root root  0 Nov 26 12:48 perf_event
drwxr-xr-x 2 root root  0 Nov 26 12:48 pids
drwxr-xr-x 4 root root  0 Nov 26 12:48 systemd

These are the different controllers which are created by the kernel itself. Each of these controllers have their own tunables for example

# ls -l /sys/fs/cgroup/cpuacct/
total 0
-rw-r--r-- 1 root root 0 Nov 26 12:48 cgroup.clone_children
--w--w---- 1 root root 0 Nov 26 12:48 cgroup.event_control
-rw-r--r-- 1 root root 0 Nov 26 12:48 cgroup.procs
-r--r--r-- 1 root root 0 Nov 26 12:48 cgroup.sane_behavior
-r--r--r-- 1 root root 0 Nov 26 12:48 cpuacct.stat
-rw-r--r-- 1 root root 0 Nov 26 12:48 cpuacct.usage
-r--r--r-- 1 root root 0 Nov 26 12:48 cpuacct.usage_percpu
-rw-r--r-- 1 root root 0 Nov 26 12:48 cpu.cfs_period_us
-rw-r--r-- 1 root root 0 Nov 26 12:48 cpu.cfs_quota_us
-rw-r--r-- 1 root root 0 Nov 26 12:48 cpu.rt_period_us
-rw-r--r-- 1 root root 0 Nov 26 12:48 cpu.rt_runtime_us
-rw-r--r-- 1 root root 0 Nov 26 12:48 cpu.shares
-r--r--r-- 1 root root 0 Nov 26 12:48 cpu.stat
-rw-r--r-- 1 root root 0 Nov 26 12:48 notify_on_release
-rw-r--r-- 1 root root 0 Nov 26 12:48 release_agent
-rw-r--r-- 1 root root 0 Nov 26 12:48 tasks

 

 

Understanding slice

By default, systemd automatically creates a hierarchy of slice, scope and service units to provide a unified structure for the cgroup tree. Services, scopes, and slices are created manually by the system administrator or dynamically by programs. By default, the operating system defines a number of built-in services that are necessary to run the system. Also, there are four slices created by default:

  • -.slice — the root slice;
  • system.slice — the default place for all system services;
  • user.slice — the default place for all user sessions;
  • machine.slice — the default place for all virtual machines and Linux containers.

 

How resources are allocated in the slice?

Let us take an example of CPUShares.

Now assuming we assign below value of CPUShares to below slice

system.slice -> 1024
user.slice -> 256
machine.slice -> 2048

 

What does these values mean?

They actually individually mean nothing but instead these values are used as a comparison factor between all the slices. Here if we assume that if total CPU availability is 100% then user.slice will get ~7%, system.slice will get 4 times the allocation of user.slice i.e. ~30% and machine.slice will get twice the allocation of system.slice which will be around ~60% of the available CPU resource.

 

What if I create multiple services in system.slice?

This is a valid question, assuming I created three service inside system.slice with CPUShares value as defined below

service1 -> 1024
service2 -> 256
service3 -> 512

If we sum it up the total becomes larger than 1024 which is actually assigned to system.slice in the above example. Well again, these values are only meant of comparison and in real mean nothing. Here service1 will get the maximum amount of available resource i.e. if 100% of resource is available for system.slice then the service1 will get ~56%, service2 will get ~14% and service3 will get ~28% of the available CPU

 

This is how this cgroup settings in the big level relates between different slices and between different slices relates to different services.

 

How to create custom slice?

  • Every name of a slice unit corresponds to the path to a location in the hierarchy.
  • Child slice will inherit the settings from parent slice.
  • The dash (“-“) character acts as a separator of the path components.

For example, if the name of a slice looks as follows:

parent-name.slice

It means that a slice called parent-name.slice is a subslice of the parent.slice. This slice can have its own subslice named parent-name-name2.slice, and so on..

 

Test resource allocation using examples

Now we will create two systemd unit files namely stress1.service and stress2.service. These service scripts will utilise all the CPU on my system

NOTE:
For demonstration I have disabled all other CPUs and have enabled only one CPU. Because if we have more than one CPU then the load will be distributed and I won’t be able to show the resource allocation for CPUShares

Using these systemd unit files I will put some CPU load using system.slice

# cat /etc/systemd/system/stress1.service
[Unit]
Description=Put some stress

[Service]
Type=Simple
ExecStart=/usr/bin/dd if=/dev/zero of=/dev/null

This is my second unit file with same content to stress the CPU

# cat /etc/systemd/system/stress2.service
[Unit]
Description=Put some stress

[Service]
Type=Simple
ExecStart=/usr/bin/dd if=/dev/zero of=/dev/null

Start these services

# systemctl daemon-reload
# systemctl start stress1
# systemctl start stress1

Now validate the CPU usage using top command

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 1994 root      20   0  107992    608    516 R 49.7  0.0   0:03.11 dd
 2001 root      20   0  107992    612    516 R 49.7  0.0   0:02.21 dd

As you see I have two processes which are trying to utilise available CPU, now since both are in the system slice, the process equally gets the available resource. So both the process gets ~50% of the CPU as expected.

 

Now let us try to add a new process on the user.slice using a while command in the the background

# while true; do true; done &

Next check the CPU usage, and as expected now the available CPU is equally divided into 3 processes. there is no distinction between user.slice and system.slice

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 1983 root      20   0  116220   1404    152 R 32.9  0.0   1:53.28 bash
 2193 root      20   0  107992    608    516 R 32.9  0.0   0:07.59 dd
 2200 root      20   0  107992    612    516 R 32.9  0.0   0:07.13 dd
NOTE:
Here as we see both user and system slice level process gets equal amount of available resource, it’s because by default in our distribution DefaultCPUAccounting, DefaultBlockIOAccounting and DefaultMemoryAccounting are in disabled state.

Now let us enable the slicing by enabling below values in “/etc/systemd/system.conf

DefaultCPUAccounting=yes
DefaultBlockIOAccounting=yes
DefaultMemoryAccounting=yes

Reboot the node to activate the changes

IMPORTANT NOTE:
It is important to reboot the node to activate the changes.

Once the system is back UP, next we will again start our stress1 and stress2 service and a while loop using bash shell

# systemctl start stress1
# systemctl start stress2

# while true; do true; done &

Now validate the CPU usage using top command

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 2132 root      20   0  116220   1520    392 R 49.3  0.0   1:16.47 bash
 1994 root      20   0  107992    608    516 R 24.8  0.0   2:30.40 dd
 2001 root      20   0  107992    612    516 R 24.8  0.0   2:29.50 dd

As you see now our slicing has become effective. The user slice is now able to claim 50% of the CPU while the system slice is divided at ~25% for both the stress service.

Let us now further reserve the CPU using CPUShares for our systemd unit files.

# cat /etc/systemd/system/stress2.service
[Unit]
Description=Put some stress

[Service]
CPUShares=1024
Type=Simple
ExecStart=/usr/bin/dd if=/dev/zero of=/dev/null
# cat /etc/systemd/system/stress1.service
[Unit]
Description=Put some stress

[Service]
CPUShares=512
Type=Simple
ExecStart=/usr/bin/dd if=/dev/zero of=/dev/null

Now in the above unit files I have given priority to stress2.service so it will be allowed double of the resource allocated to stress1.service.

NOTE:
The allowed range for CPUShares is 2 to 262144. Defaults to 1024.

Next restart the services

# systemctl daemon-reload
# systemctl restart stress1
# systemctl restart stress2

Validate the top output

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 2132 root      20   0  116220   1520    392 R 49.7  0.0   2:43.11 bash
 2414 root      20   0  107992    612    516 R 33.1  0.0   0:04.85 dd
 2421 root      20   0  107992    608    516 R 16.6  0.0   0:01.95 dd

So as expected, out of the available 50% CPU resources for system.slice, stress2 gets double the CPU allocated to stress1 service.

NOTE:
If there are no active process running from user.slice then system.slice will be allowed to use upto 100% of the available CPU resource.

 

Monitor resource usage per slice

systemd-cgtop shows the top control groups of the local Linux control group hierarchy, ordered by their CPU, memory, or disk I/O load. The display is refreshed in regular intervals (by default every 1s), similar in style to top command. Resource usage is only accounted for control groups in the relevant hierarchy, i.e. CPU usage is only accounted for control groups in the “cpuacct” hierarchy, memory usage only for those in “memory” and disk I/O usage for those in “blkio“.

Path                                                                               Tasks   %CPU   Memory  Input/s Output/s

/                                                                                     56  100.0   309.0M        -        -
/system.slice                                                                          -   97.5   277.4M        -        -
/system.slice/stress3.service                                                          1   59.9   104.0K        -        -
/system.slice/stress1.service                                                          1   29.9   104.0K        -        -
/system.slice/stress2.service                                                          1    7.5   108.0K        -        -
/user.slice                                                                            -    1.7    10.7M        -        -
/user.slice/user-0.slice                                                               -    1.7    10.4M        -        -
/user.slice/user-0.slice/session-7.scope                                               3    1.7     4.6M        -        -
/system.slice/pacemaker.service                                                        7    0.0    41.6M        -        -
/system.slice/pcsd.service                                                             1    0.0    46.8M        -        -
/system.slice/fail2ban.service                                                         1    0.0     9.0M        -        -
/system.slice/dhcpd.service                                                            1    0.0     4.4M        -        -
/system.slice/tuned.service                                                            1    0.0    11.8M        -        -
/system.slice/NetworkManager.service                                                   3    0.0    11.3M        -        -
/system.slice/httpd.service                                                            6    0.0     4.6M        -        -
/system.slice/abrt-oops.service                                                        1    0.0     1.4M        -        -
/system.slice/rsyslog.service                                                          1    0.0     1.5M        -        -
/system.slice/rngd.service                                                             1    0.0   176.0K        -        -
/system.slice/ModemManager.service                                                     1      -     3.6M        -        -
/system.slice/NetworkManager-dispatcher.service                                        1      -   944.0K        -        -

 

Lastly I hope this article on understanding cgroups and slices with examples on Linux was helpful. So, let me know your suggestions and feedback using the comment section.

 

Leave a Reply

Your email address will not be published. Required fields are marked *