Examples to secure Kubernetes cluster nodes (security context)

In the previous chapter, we talked about securing the API server. If an attacker gets access to the API server, they can run whatever they like by packaging their code into a container image and running it in a pod. But can they do any real damage? Aren’t containers isolated from other containers and from the node they’re running on?
Not necessarily. In this chapter, you’ll learn how to allow pods to access the resources of the node they’re running on.

Using the host node's namespace in a pod

Containers in a pod usually run under separate Linux namespaces, which isolate their processes from processes running in other containers or in the node’s default namespaces.

For example, we learned that each pod gets its own IP and port space, because it uses its own network namespace. Likewise, each pod has its own process tree, because it has its own PID namespace, and it also uses its own IPC namespace, allowing only processes in the same pod to communicate with each other through the Inter-Process Communication mechanism (IPC).

Using the node’s network namespace in a pod

Certain pods (usually system pods) need to operate in the host’s default namespaces, allowing them to see and manipulate node-level resources and devices. For example, a pod may need to use the node’s network adapters instead of its own virtual network adapters. This can be achieved by setting the hostNetwork property in the pod spec to true.

Here, we are creating one such pod with hostNetwork: true as spec properties:

[root@controller ~]# cat pod-with-host-network.yml
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-host-network
spec:
  hostNetwork: true
  containers:
  - name: main
    image: alpine
    command: ["/bin/sleep", "999999"]

Next create the pod:

[root@controller ~]# kubectl create -f pod-with-host-network.yml
pod/pod-with-host-network created

And verify the status of the pod

[root@controller ~]# kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
pod-with-host-network   1/1     Running   0          64s
sidecar-pod             2/2     Running   2          11h

Now we can connect to the container from this new pod and check the list of available interfaces:

[root@controller ~]# kubectl exec pod-with-host-network -- ifconfig
datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1376
        ether 22:a9:66:98:e7:f3  txqueuelen 1000  (Ethernet)
        RX packets 31  bytes 1624 (1.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:29:35:5d:9b  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.2.15  netmask 255.255.255.0  broadcast 10.0.2.255
        ether 08:00:27:b6:89:b5  txqueuelen 1000  (Ethernet)
        RX packets 711  bytes 61270 (59.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 395  bytes 67922 (66.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
...

As expected, now the interfaces from host node is visible inside the container.

Binding to a host port without using the host’s network namespace

A related feature allows pods to bind to a port in the node’s default namespace, but still have their own network namespace. This is done by using the hostPort property in one of the container’s ports defined in the spec.containers.ports field. The hostPort feature is primarily used for exposing system services, which are deployed to every node using DaemonSets.

It’s important to understand that if a pod is using a specific host port, only one instance of the pod can be scheduled to each node, because two processes can’t bind to the same host port.

Here I have created a nginx container with hostPort so we can access the container from external network from the same host where container is running:

[root@controller ~]# cat nginx-lab.yml
apiVersion: v1
kind: Pod
metadata:
  name: nginx-lab
  namespace: default
spec:
  containers:
  - name: nginx-lab
    image: nginx
    ports:
    - containerPort: 80
      hostPort: 9000
      protocol: TCP

With port 80 the container can be reached on port 8080 of the pod’s IP, while it can also be reached on port 9000 of the node it’s deployed on.

Create the pod using this YAML file:

[root@controller ~]# kubectl create -f nginx-lab.yml
pod/nginx-lab created

Next verify the status of the newly created pod to make sure it is in Running state:

[root@controller ~]# kubectl get pods -o wide
[root@controller ~]# kubectl create -f nginx-lab.yml
pod/nginx-lab created
[root@controller ~]# kubectl get pods -o wide
NAME          READY   STATUS    RESTARTS   AGE     IP          NODE                   NOMINATED NODE   READINESS GATES
nginx         1/1     Running   0          7m11s   10.44.0.1   worker-2.example.com   <none>           <none>
nginx-lab     1/1     Running   0          16s     10.44.0.2   worker-2.example.com   <none>           <none>
sidecar-pod   2/2     Running   2          11h     10.36.0.1   worker-1.example.com   <none>           <none>

Now we can access our container from external network using hostPort i.e. 9000:

[root@worker-2 ~]# curl 127.0.0.1:9000

Using the node’s PID and IPC namespaces

Similar to the hostNetwork option are the hostPID and hostIPC pod spec properties. When you set them to true, the pod’s containers will use the node’s PID and IPC namespaces, allowing processes running in the containers to see all the other processes on the node or communicate with them through IPC, respectively.

Here we will create another pod using alpine image to demonstrate the behaviour of hostPID and hostIPC:

[root@controller ~]# cat pod-with-host-pid-and-ipc.yml
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-host-pid-and-ipc
spec:
  hostPID: true
  hostIPC: true
  containers:
  - name: main
    image: alpine
    command: ["/bin/sleep", "999999"]

If you run this pod and then list the processes from within its container, you’ll see all the processes running on the host node, not only the ones running in the container, as shown in the following listing.

[root@controller ~]# kubectl exec pod-with-host-pid-and-ipc -- ps aux
PID   USER     TIME  COMMAND
    1 root      0:03 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
    2 root      0:00 [kthreadd]
    3 root      0:00 [rcu_gp]
    4 root      0:00 [rcu_par_gp]
    6 root      0:00 [kworker/0:0H-kb]
    8 root      0:00 [mm_percpu_wq]
    9 root      0:00 [ksoftirqd/0]
   10 root      0:02 [rcu_sched]
   11 root      0:00 [migration/0]
   12 root      0:00 [watchdog/0]
   13 root      0:00 [cpuhp/0]
   14 root      0:00 [cpuhp/1]
   15 root      0:00 [watchdog/1]
   16 root      0:00 [migration/1]
...

Configure container's security context

Besides allowing the pod to use the host’s Linux namespaces, other security-related features can also be configured on the pod and its container through the security-Context properties, which can be specified under the pod spec directly and inside the spec of individual containers.

Configuring the security context allows you to do various things:

Specify the user (the user’s ID) under which the process in the container will run.
Prevent the container from running as root (the default user a container runs as is usually defined in the container image itself, so you may want to prevent containers from running as root).
Run the container in privileged mode, giving it full access to the node’s kernel.
Configure fine-grained privileges, by adding or dropping capabilities—in contrast to giving the container all possible permissions by running it in privileged mode.
Set SELinux (Security Enhanced Linux) options to strongly lock down a container.
Prevent the process from writing to the container’s filesystem.

Running a pod without specifying a security context

First, run a pod with the default security context options (by not specifying them at all), so you can see how it behaves compared to pods with a custom security context:

[root@controller ~]# kubectl run pod-with-defaults --image alpine --restart Never -- /bin/sleep 999999
pod/pod-with-defaults created

Once the container inside this Pod is created, let’s see what user and group ID the container is running as, and which groups it belongs to. You can see this by running the id command inside the container:

[root@controller ~]# kubectl exec pod-with-defaults -- id
uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)

The container is running as user ID (uid) 0, which is root, and group ID (gid) 0 (also root). It’s also a member of multiple other groups.

Running a container as a specific user

To run a pod under a different user ID than the one that’s baked into the container image, you’ll need to set the pod’s securityContext.runAsUser property. You’ll make the container run as user guest, whose user ID in the alpine container image is 1003, as shown in the following listing.

[root@controller ~]# cat pod-as-user-guest.yml
apiVersion: v1
kind: Pod
metadata:
  name: pod-as-user-guest
spec:
  containers:
  - name: main
    image: alpine
    command: ["/bin/sleep", "999999"]
    securityContext:
      runAsUser: 405

You must specify a user ID, not a username (id 405 corresponds to the guest user).

Next create the pod and verify the status:

[root@controller ~]# kubectl create -f pod-as-user-guest.yml
pod/pod-as-user-guest created

[root@controller ~]# kubectl get pods -o wide
NAME                        READY   STATUS              RESTARTS   AGE     IP          NODE                   NOMINATED NODE   READINESS GATES
nginx                       1/1     Running             0          44m     10.44.0.1   worker-2.example.com   <none>           <none>
nginx-lab                   1/1     Running             0          37m     10.44.0.2   worker-2.example.com   <none>           <none>
pod-as-user-guest           0/1     ContainerCreating   0          3s      <none>      worker-2.example.com   <none>           <none>
pod-with-defaults           1/1     Running             0          6m16s   10.36.0.2   worker-1.example.com   <none>           <none>
pod-with-host-pid-and-ipc   1/1     Running             0          15m     10.36.0.1   worker-1.example.com   <none>           <none>

Once the status changes from ContainerCreating to Running, you can execute the following command to verify the user:

[root@controller ~]# kubectl exec pod-as-user-guest -- id
uid=1003 gid=1000(users)

As requested, the container is running as the guest user.

Preventing a container from running as root

What if you don’t care what user the container runs as, but you still want to prevent it from running as root? To prevent the attack scenario described previously, you can specify that the pod’s container needs to run as a non-root user, as shown in the following listing.

[root@controller ~]# cat pod-run-as-non-root.yml
apiVersion: v1
kind: Pod
metadata:
  name: pod-run-as-non-root
spec:
  containers:
  - name: main
    image: alpine
    command: ["/bin/sleep", "999999"]
    securityContext:
      runAsNonRoot: true

Next create the pod and verify the status:

[root@controller ~]# kubectl create -f pod-run-as-non-root.yml
pod/pod-run-as-non-root created

As you can check the status, we get CreateContainerConfigError

[root@controller ~]# kubectl get po pod-run-as-non-root
NAME                  READY   STATUS                       RESTARTS   AGE
pod-run-as-non-root   0/1     CreateContainerConfigError   0          2m1s

We can get more details on this using describe (output is snipped):

[root@controller ~]# kubectl describe pods pod-run-as-non-root
..
  Warning  Failed     54s (x8 over 2m58s)  kubelet            Error: container has runAsNonRoot and image will run as root
..

As expected, the container creation failed as we were trying to create it as root user:

Running pods in privileged mode

Sometimes pods need to do everything that the node they’re running on can do, such as use protected system devices or other kernel features, which aren’t accessible to regular containers. To get full access to the node’s kernel, the pod’s container runs in privileged mode. This is achieved by setting the privileged property in the container’s security-Context property to true

[root@controller ~]# cat pod-privileged.yml
apiVersion: v1
kind: Pod
metadata:
  name: pod-privileged
spec:
  containers:
  - name: main
    image: alpine
    command: ["/bin/sleep", "999999"]
    securityContext:
      privileged: true

Next create this pod and verify the status is Running:

[root@controller ~]# kubectl create -f pod-privileged.yml
pod/pod-privileged created

If you’re familiar with Linux, you may know it has a special file directory called /dev, which contains device files for all the devices on the system. These aren’t regular files on disk, but are special files used to communicate with devices. Let’s see what devices are visible in the non-privileged container you deployed earlier (the pod-with-defaults pod), by listing files in its /dev directory

[root@controller ~]# kubectl exec -it pod-with-defaults -- ls /dev
core             null             shm              termination-log
fd               ptmx             stderr           tty
full             pts              stdin            urandom
mqueue           random           stdout           zero

The listing shows all the devices. The list is fairly short. Now, compare this with the following listing, which shows the device files your privileged pod can see

[root@controller ~]# kubectl exec -it pod-privileged -- ls /dev
autofs              stderr              tty5
bsg                 stdin               tty50
bus                 stdout              tty51
core                termination-log     tty52
cpu                 tty                 tty53
cpu_dma_latency     tty0                tty54
dm-0                tty1                tty55
dm-1                tty10               tty56
dri                 tty11               tty57
fb0                 tty12               tty58
...                 ...                 ...

Adding individual kernel capabilities to a container

Instead of making a container privileged and giving it unlimited permissions, a much safer method (from a security perspective) is to give it access only to the kernel features it really requires. Kubernetes allows you to add capabilities to each container or drop part of them, which allows you to fine-tune the container’s permissions and limit the impact of a potential intrusion by an attacker.

For example, a container usually isn’t allowed to change the system time (the hardware clock’s time). You can confirm this by trying to set the time in your pod-with-defaults pod:

[root@controller ~]# kubectl exec -it pod-with-defaults -- date +%T -s "12:00:00"
date: can't set date: Operation not permitted
12:00:00

If you want to allow the container to change the system time, you can add a capability called CAP_SYS_TIME to the container’s capabilities list, as shown in the following listing.

[root@controller ~]# cat pod-add-settime-capability.yml
apiVersion: v1
kind: Pod
metadata:
  name: pod-add-settime-capability
spec:
  containers:
  - name: main
    image: alpine
    command: ["/bin/sleep", "999999"]
    securityContext:
      capabilities:
        add:
        - SYS_TIME

Now create this pod and attempt to change the date once the container is created:

[root@controller ~]# kubectl create -f pod-add-settime-capability.yml
pod/pod-add-settime-capability created

[root@controller ~]#  kubectl exec -it pod-add-settime-capability -- date +%T -s "12:00:00"; date
12:00:00

[root@controller ~]#  kubectl exec -it pod-add-settime-capability -- date
Sat Nov 28 12:00:20 UTC 2020

You can confirm the node’s time has been changed by checking the time on the node running the pod.

Dropping capabilities from a container

You can also drop capabilities that may otherwise be available to the container. For example, the default capabilities given to a container include the CAP_CHOWN capability, which allows processes to change the ownership of files in the filesystem.

You can see that’s the case by changing the ownership of the /tmp directory in your pod-with-defaults pod to the guest user, for example:

[root@controller ~]# kubectl exec pod-with-defaults -- chown guest /tmp

[root@controller ~]# kubectl exec pod-with-defaults -- ls -la / | grep tmp
drwxrwxrwt    1 guest    root          4096 Oct 21 09:23 tmp

To prevent the container from doing that, you need to drop the capability by listing it under the container’s securityContext.capabilities.drop property, as shown in the following listing.

[root@controller ~]# cat pod-drop-chown-capability.yml
apiVersion: v1
kind: Pod
metadata:
  name: pod-drop-chown-capability
spec:
  containers:
  - name: main
    image: alpine
    command: ["/bin/sleep", "999999"]
    securityContext:
      capabilities:
        drop:
        - CHOWN

By dropping the CHOWN capability, you’re not allowed to change the owner of the /tmp directory in this pod:

[root@controller ~]# kubectl exec pod-drop-chown-capability -- chown guest /tmp
chown: /tmp: Operation not permitted
command terminated with exit code 1

Preventing processes from writing to the container’s filesystem

You may want to prevent the processes running in the container from writing to the container’s filesystem, and only allow them to write to mounted volumes. This is done by setting the container’s securityContext.readOnlyRootFilesystem property to true, as shown in the following listing.

[root@controller ~]# cat pod-with-readonly-filesystem.yml
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-readonly-filesystem
spec:
  containers:
  - name: main
    image: alpine
    command: ["/bin/sleep", "999999"]
    securityContext:
      readOnlyRootFilesystem: true
    volumeMounts:
    - name: my-volume
      mountPath: /volume
      readOnly: false
  volumes:
  - name: my-volume
    emptyDir:

In this container’s filesystem can’t be written to but writing to /volume is allowed, becase a volume is mounted there.

Next create this pod and make sure the containers are created before we do any verification:

[root@controller ~]# kubectl create -f pod-with-readonly-filesystem.yml
pod/pod-with-readonly-filesystem created

When you deploy this pod, the container is running as root, which has write permissions to the / directory, but trying to write a file there fails:

[root@controller ~]# kubectl exec -it pod-with-readonly-filesystem -- touch /new-file
touch: /new-file: Read-only file system

On the other hand, writing to the mounted volume is allowed:

[root@controller ~]# kubectl exec -it pod-with-readonly-filesystem -- touch /volume/newfile

[root@controller ~]# kubectl exec -it pod-with-readonly-filesystem -- ls -la /volume/newfile
-rw-r--r--    1 root     root       0 Nov  28 19:11 /mountedVolume/newfile

Conclusion

In this Kubernetes Tutorial we covered different areas of securing cluster nodes. This is really a vast topic and contains many more areas which couldnot be covered in this tutorial, I may write another tutorial covering the remaining part of securing the cluster nodes such as RBAC ClusterRoles, ClusterRoleBindings, NetworkPolicy resources that are used to limit a pod’s inbound and/or outbound traffic. Following are the topics covered in this tutorial:

Containers can be configured to run as a different user and/or group than the one defined in the container image.
Containers can also run in privileged mode, allowing them to access the node’s devices that are otherwise not exposed to pods.
Containers can be run as read-only, preventing processes from writing to the container’s filesystem (and only allowing them to write to mounted volumes).