In the previous chapter, we talked about securing the API server. If an attacker gets access to the API server, they can run whatever they like by packaging their code into a container image and running it in a pod. But can they do any real damage? Aren’t containers isolated from other containers and from the node they’re running on?
Not necessarily. In this chapter, you’ll learn how to allow pods to access the resources of the node they’re running on.
Using the host node's namespace in a pod
Containers in a pod usually run under separate Linux namespaces, which isolate their processes from processes running in other containers or in the node’s default namespaces.
For example, we learned that each pod gets its own IP and port space, because it uses its own network namespace. Likewise, each pod has its own process tree, because it has its own PID namespace, and it also uses its own IPC namespace, allowing only processes in the same pod to communicate with each other through the Inter-Process Communication mechanism (IPC).
Using the node’s network namespace in a pod
Certain pods (usually system pods) need to operate in the host’s default namespaces, allowing them to see and manipulate node-level resources and devices. For example, a pod may need to use the node’s network adapters instead of its own virtual network adapters. This can be achieved by setting the hostNetwork property in the pod spec to true
.
Here, we are creating one such pod with hostNetwork: true
as spec properties:
[root@controller ~]# cat pod-with-host-network.yml apiVersion: v1 kind: Pod metadata: name: pod-with-host-network spec: hostNetwork: true containers: - name: main image: alpine command: ["/bin/sleep", "999999"]
Next create the pod:
[root@controller ~]# kubectl create -f pod-with-host-network.yml
pod/pod-with-host-network created
And verify the status of the pod
[root@controller ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
pod-with-host-network 1/1 Running 0 64s
sidecar-pod 2/2 Running 2 11h
Now we can connect to the container from this new pod and check the list of available interfaces:
[root@controller ~]# kubectl exec pod-with-host-network -- ifconfig datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1376 ether 22:a9:66:98:e7:f3 txqueuelen 1000 (Ethernet) RX packets 31 bytes 1624 (1.5 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255 ether 02:42:29:35:5d:9b txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255 ether 08:00:27:b6:89:b5 txqueuelen 1000 (Ethernet) RX packets 711 bytes 61270 (59.8 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 395 bytes 67922 (66.3 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ...
As expected, now the interfaces from host node is visible inside the container.
Binding to a host port without using the host’s network namespace
A related feature allows pods to bind to a port in the node’s default namespace, but still have their own network namespace. This is done by using the hostPort
property in one of the container’s ports defined in the spec.containers.ports
field. The hostPort
feature is primarily used for exposing system services, which are deployed to every node using DaemonSets.
It’s important to understand that if a pod is using a specific host port, only one instance of the pod can be scheduled to each node, because two processes can’t bind to the same host port.
Here I have created a nginx container with hostPort so we can access the container from external network from the same host where container is running:
[root@controller ~]# cat nginx-lab.yml apiVersion: v1 kind: Pod metadata: name: nginx-lab namespace: default spec: containers: - name: nginx-lab image: nginx ports: - containerPort: 80 hostPort: 9000 protocol: TCP
With port 80 the container can be reached on port 8080 of the pod’s IP, while it can also be reached on port 9000 of the node it’s deployed on.
Create the pod using this YAML file:
[root@controller ~]# kubectl create -f nginx-lab.yml
pod/nginx-lab created
Next verify the status of the newly created pod to make sure it is in Running
state:
[root@controller ~]# kubectl get pods -o wide
[root@controller ~]# kubectl create -f nginx-lab.yml
pod/nginx-lab created
[root@controller ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 7m11s 10.44.0.1 worker-2.example.com <none> <none>
nginx-lab 1/1 Running 0 16s 10.44.0.2 worker-2.example.com <none> <none>
sidecar-pod 2/2 Running 2 11h 10.36.0.1 worker-1.example.com <none> <none>
Now we can access our container from external network using hostPort
i.e. 9000
:
[root@worker-2 ~]# curl 127.0.0.1:9000
Using the node’s PID and IPC namespaces
Similar to the hostNetwork
option are the hostPID
and hostIPC
pod spec properties. When you set them to true, the pod’s containers will use the node’s PID and IPC namespaces, allowing processes running in the containers to see all the other processes on the node or communicate with them through IPC, respectively.
Here we will create another pod using alpine image to demonstrate the behaviour of hostPID
and hostIPC
:
[root@controller ~]# cat pod-with-host-pid-and-ipc.yml apiVersion: v1 kind: Pod metadata: name: pod-with-host-pid-and-ipc spec: hostPID: true hostIPC: true containers: - name: main image: alpine command: ["/bin/sleep", "999999"]
If you run this pod and then list the processes from within its container, you’ll see all the processes running on the host node, not only the ones running in the container, as shown in the following listing.
[root@controller ~]# kubectl exec pod-with-host-pid-and-ipc -- ps aux
PID USER TIME COMMAND
1 root 0:03 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
2 root 0:00 [kthreadd]
3 root 0:00 [rcu_gp]
4 root 0:00 [rcu_par_gp]
6 root 0:00 [kworker/0:0H-kb]
8 root 0:00 [mm_percpu_wq]
9 root 0:00 [ksoftirqd/0]
10 root 0:02 [rcu_sched]
11 root 0:00 [migration/0]
12 root 0:00 [watchdog/0]
13 root 0:00 [cpuhp/0]
14 root 0:00 [cpuhp/1]
15 root 0:00 [watchdog/1]
16 root 0:00 [migration/1]
...
Configure container's security context
Besides allowing the pod to use the host’s Linux namespaces, other security-related features can also be configured on the pod and its container through the security-Context properties, which can be specified under the pod spec directly and inside the spec of individual containers.
Configuring the security context allows you to do various things:
- Specify the user (the user’s ID) under which the process in the container will run.
- Prevent the container from running as root (the default user a container runs as is usually defined in the container image itself, so you may want to prevent containers from running as root).
- Run the container in privileged mode, giving it full access to the node’s kernel.
- Configure fine-grained privileges, by adding or dropping capabilities—in contrast to giving the container all possible permissions by running it in privileged mode.
- Set SELinux (Security Enhanced Linux) options to strongly lock down a container.
- Prevent the process from writing to the container’s filesystem.
Running a pod without specifying a security context
First, run a pod with the default security context options (by not specifying them at all), so you can see how it behaves compared to pods with a custom security context:
[root@controller ~]# kubectl run pod-with-defaults --image alpine --restart Never -- /bin/sleep 999999
pod/pod-with-defaults created
Once the container inside this Pod is created, let’s see what user and group ID the container is running as, and which groups it belongs to. You can see this by running the id command inside the container:
[root@controller ~]# kubectl exec pod-with-defaults -- id
uid=0(root) gid=0(root) groups=0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
The container is running as user ID (uid) 0, which is root, and group ID (gid) 0 (also root). It’s also a member of multiple other groups.
Running a container as a specific user
To run a pod under a different user ID than the one that’s baked into the container image, you’ll need to set the pod’s securityContext.runAsUser
property. You’ll make the container run as user guest, whose user ID in the alpine container image is 1003, as shown in the following listing.
[root@controller ~]# cat pod-as-user-guest.yml apiVersion: v1 kind: Pod metadata: name: pod-as-user-guest spec: containers: - name: main image: alpine command: ["/bin/sleep", "999999"] securityContext: runAsUser: 405
You must specify a user ID, not a username (id 405 corresponds to the guest
user).
Next create the pod and verify the status:
[root@controller ~]# kubectl create -f pod-as-user-guest.yml pod/pod-as-user-guest created [root@controller ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx 1/1 Running 0 44m 10.44.0.1 worker-2.example.com <none> <none> nginx-lab 1/1 Running 0 37m 10.44.0.2 worker-2.example.com <none> <none> pod-as-user-guest 0/1 ContainerCreating 0 3s <none> worker-2.example.com <none> <none> pod-with-defaults 1/1 Running 0 6m16s 10.36.0.2 worker-1.example.com <none> <none> pod-with-host-pid-and-ipc 1/1 Running 0 15m 10.36.0.1 worker-1.example.com <none> <none>
Once the status changes from ContainerCreating
to Running
, you can execute the following command to verify the user:
[root@controller ~]# kubectl exec pod-as-user-guest -- id
uid=1003 gid=1000(users)
As requested, the container is running as the guest
user.
Preventing a container from running as root
What if you don’t care what user the container runs as, but you still want to prevent it from running as root? To prevent the attack scenario described previously, you can specify that the pod’s container needs to run as a non-root user, as shown in the following listing.
[root@controller ~]# cat pod-run-as-non-root.yml apiVersion: v1 kind: Pod metadata: name: pod-run-as-non-root spec: containers: - name: main image: alpine command: ["/bin/sleep", "999999"] securityContext: runAsNonRoot: true
Next create the pod and verify the status:
[root@controller ~]# kubectl create -f pod-run-as-non-root.yml
pod/pod-run-as-non-root created
As you can check the status, we get CreateContainerConfigError
[root@controller ~]# kubectl get po pod-run-as-non-root
NAME READY STATUS RESTARTS AGE
pod-run-as-non-root 0/1 CreateContainerConfigError 0 2m1s
We can get more details on this using describe
(output is snipped):
[root@controller ~]# kubectl describe pods pod-run-as-non-root
..
Warning Failed 54s (x8 over 2m58s) kubelet Error: container has runAsNonRoot and image will run as root
..
As expected, the container creation failed as we were trying to create it as root user:
Running pods in privileged mode
Sometimes pods need to do everything that the node they’re running on can do, such as use protected system devices or other kernel features, which aren’t accessible to regular containers. To get full access to the node’s kernel, the pod’s container runs in privileged mode. This is achieved by setting the privileged
property in the container’s security-Context property to true
[root@controller ~]# cat pod-privileged.yml apiVersion: v1 kind: Pod metadata: name: pod-privileged spec: containers: - name: main image: alpine command: ["/bin/sleep", "999999"] securityContext: privileged: true
Next create this pod and verify the status is Running
:
[root@controller ~]# kubectl create -f pod-privileged.yml
pod/pod-privileged created
If you’re familiar with Linux, you may know it has a special file directory called /dev, which contains device files for all the devices on the system. These aren’t regular files on disk, but are special files used to communicate with devices. Let’s see what devices are visible in the non-privileged container you deployed earlier (the pod-with-defaults
pod), by listing files in its /dev directory
[root@controller ~]# kubectl exec -it pod-with-defaults -- ls /dev
core null shm termination-log
fd ptmx stderr tty
full pts stdin urandom
mqueue random stdout zero
The listing shows all the devices. The list is fairly short. Now, compare this with the following listing, which shows the device files your privileged pod can see
[root@controller ~]# kubectl exec -it pod-privileged -- ls /dev
autofs stderr tty5
bsg stdin tty50
bus stdout tty51
core termination-log tty52
cpu tty tty53
cpu_dma_latency tty0 tty54
dm-0 tty1 tty55
dm-1 tty10 tty56
dri tty11 tty57
fb0 tty12 tty58
... ... ...
Adding individual kernel capabilities to a container
Instead of making a container privileged and giving it unlimited permissions, a much safer method (from a security perspective) is to give it access only to the kernel features it really requires. Kubernetes allows you to add capabilities to each container or drop part of them, which allows you to fine-tune the container’s permissions and limit the impact of a potential intrusion by an attacker.
For example, a container usually isn’t allowed to change the system time (the hardware clock’s time). You can confirm this by trying to set the time in your pod-with-defaults pod:
[root@controller ~]# kubectl exec -it pod-with-defaults -- date +%T -s "12:00:00"
date: can't set date: Operation not permitted
12:00:00
If you want to allow the container to change the system time, you can add a capability called CAP_SYS_TIME
to the container’s capabilities list, as shown in the following listing.
[root@controller ~]# cat pod-add-settime-capability.yml apiVersion: v1 kind: Pod metadata: name: pod-add-settime-capability spec: containers: - name: main image: alpine command: ["/bin/sleep", "999999"] securityContext: capabilities: add: - SYS_TIME
Now create this pod and attempt to change the date once the container is created:
[root@controller ~]# kubectl create -f pod-add-settime-capability.yml pod/pod-add-settime-capability created [root@controller ~]# kubectl exec -it pod-add-settime-capability -- date +%T -s "12:00:00"; date 12:00:00 [root@controller ~]# kubectl exec -it pod-add-settime-capability -- date Sat Nov 28 12:00:20 UTC 2020
You can confirm the node’s time has been changed by checking the time on the node running the pod.
Dropping capabilities from a container
You can also drop capabilities that may otherwise be available to the container. For example, the default capabilities given to a container include the CAP_CHOWN
capability, which allows processes to change the ownership of files in the filesystem.
You can see that’s the case by changing the ownership of the /tmp directory in your pod-with-defaults pod to the guest user, for example:
[root@controller ~]# kubectl exec pod-with-defaults -- chown guest /tmp
[root@controller ~]# kubectl exec pod-with-defaults -- ls -la / | grep tmp
drwxrwxrwt 1 guest root 4096 Oct 21 09:23 tmp
To prevent the container from doing that, you need to drop the capability by listing it under the container’s securityContext.capabilities.drop
property, as shown in the following listing.
[root@controller ~]# cat pod-drop-chown-capability.yml apiVersion: v1 kind: Pod metadata: name: pod-drop-chown-capability spec: containers: - name: main image: alpine command: ["/bin/sleep", "999999"] securityContext: capabilities: drop: - CHOWN
By dropping the CHOWN capability, you’re not allowed to change the owner of the /tmp directory in this pod:
[root@controller ~]# kubectl exec pod-drop-chown-capability -- chown guest /tmp
chown: /tmp: Operation not permitted
command terminated with exit code 1
Preventing processes from writing to the container’s filesystem
You may want to prevent the processes running in the container from writing to the container’s filesystem, and only allow them to write to mounted volumes. This is done by setting the container’s securityContext.readOnlyRootFilesystem
property to true, as shown in the following listing.
[root@controller ~]# cat pod-with-readonly-filesystem.yml apiVersion: v1 kind: Pod metadata: name: pod-with-readonly-filesystem spec: containers: - name: main image: alpine command: ["/bin/sleep", "999999"] securityContext: readOnlyRootFilesystem: true volumeMounts: - name: my-volume mountPath: /volume readOnly: false volumes: - name: my-volume emptyDir:
In this container’s filesystem can’t be written to but writing to /volume
is allowed, becase a volume is mounted there.
Next create this pod and make sure the containers are created before we do any verification:
[root@controller ~]# kubectl create -f pod-with-readonly-filesystem.yml
pod/pod-with-readonly-filesystem created
When you deploy this pod, the container is running as root, which has write permissions to the / directory, but trying to write a file there fails:
[root@controller ~]# kubectl exec -it pod-with-readonly-filesystem -- touch /new-file
touch: /new-file: Read-only file system
On the other hand, writing to the mounted volume is allowed:
[root@controller ~]# kubectl exec -it pod-with-readonly-filesystem -- touch /volume/newfile
[root@controller ~]# kubectl exec -it pod-with-readonly-filesystem -- ls -la /volume/newfile
-rw-r--r-- 1 root root 0 Nov 28 19:11 /mountedVolume/newfile
Conclusion
In this Kubernetes Tutorial we covered different areas of securing cluster nodes. This is really a vast topic and contains many more areas which couldnot be covered in this tutorial, I may write another tutorial covering the remaining part of securing the cluster nodes such as RBAC ClusterRoles, ClusterRoleBindings, NetworkPolicy resources that are used to limit a pod’s inbound and/or outbound traffic. Following are the topics covered in this tutorial:
- Containers can be configured to run as a different user and/or group than the one defined in the container image.
- Containers can also run in privileged mode, allowing them to access the node’s devices that are otherwise not exposed to pods.
- Containers can be run as read-only, preventing processes from writing to the container’s filesystem (and only allowing them to write to mounted volumes).