How to assign Pods to Nodes | Kubernetes Node Affinity

Introduction to Kubernetes Node Affinity

The pod scheduler is one of the core components of Kubernetes. Whenever an application pod is created as per the user's request, the scheduler determines the placement of these pods onto worker nodes in the cluster. The scheduler is flexible and can be customized for advanced scheduling situations. Before scheduling a pod on a worker node, the scheduler takes the following points into consideration:

check if a node selector is defined in the pod definition, if its defined, nodes whose labels do not match are not considered
check if the worker nodes have enough memory, compute and storage resources
check if nodes have any taints and if the pod to be scheduled has toleration for these taints
check affinity and anti-affinity rules

Based on these factors, the scheduler assigns a weighted score to each node and node with the highest score is selected to run the pod.

Kubernetes Node Affinity is successor of nodeSelector

Node affinity rules are used to influence which node a pod is scheduled to. In the earlier K8s versions, the node affinity mechanism was implemented through the nodeSelector field in the pod specification. The node had to include all the labels specified in that field to become eligible for hosting the pod.

Node affinity is a more sophisticated form of nodeSelector as it offers a much wider selection criteria. Each pod can specify its preferences and requirements by defining it's own node affinity rules. Based on these rules, the Kubernetes scheduler will try to place the pod to one of the nodes matching the defined criteria.

Prerequisites

You must have a working Kubernetes Cluster with a least two worker nodes to implement these scenarios. Single node clusters, such as Minikube will not work for this demo.
You must have a working knowledge of Kubernetes and yaml language.

Add Label to Worker Nodes

We'll first proceed to label the two worker nodes in our cluster as follows:

kubectl label nodes worker-n1.k8s.local tier=gold

kubectl label nodes worker-n2.k8s.local tier=silver

[root@masternode ~]# kubectl label nodes worker-n1.k8s.local tier=gold
node/worker-n1.k8s.local labeled
[root@masternode ~]# kubectl label nodes worker-n2.k8s.local tier=silver
node/worker-n2.k8s.local labeled
[root@masternode ~]#

Create a deployment with Node Affinity

We're now going to create a sample deployment and enforce node affinity rules using the node labels defined earlier.

Create a sample nginx deployment as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: tier
                operator: In
                values:
                - gold
      containers:
      - image: nginx
        name: nginx

Create this deployment using kubectl:

kubectl create -f nginx.yml

Kubernetes Node Affinity in Action

Let's walk through the node affinity specification. The "requiredDuringSchedulingIgnoredDuringExecution" directive can be broken down into two parts:

requiredDuringScheduling means that rules under this field specify the labels the node must have for the pod to be scheduled to the node
IgnoredDuringExecution means the affinity rules will not affect pods that are already running on the node

Setting this directive ensures that affinity only affects the scheduling of a new pod and never causes a pod to be evicted from a node. The nodeSelectorTerms and the matchExpressions fields define the values that the node’s label must match for the pod to be scheduled to the node. In our case, it means that the node must have a label "tier" whose value should be set to "gold".

[root@masternode ~]# kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP               NODE                  NOMINATED NODE   READINESS GATES
nginx-846f64db7c-4hh46   1/1     Running   0          16m   192.168.52.139   worker-n1.k8s.local   <none>

[root@masternode ~]# kubectl get node worker-n1.k8s.local -L tier
NAME                  STATUS   ROLES    AGE    VERSION   TIER
worker-n1.k8s.local   Ready    <none>   4d7h   v1.24.2   gold

Even if we scale our deployment to multiple replicas, the resulting pods will always be scheduled on the same node, i.e. with the label "tier=gold".

kubectl scale --replicas=3 deployment.apps/nginx

Let's take this one step further. We'll now remove the label "tier=gold" from the worker node, further scale our deployment to four replicas.

[root@masternode ~]# kubectl label nodes worker-n1.k8s.local tier-
node/worker-n1.k8s.local unlabeled
[root@masternode ~]#
[root@masternode ~]# kubectl scale --replicas=4 deployment.apps/nginx
deployment.apps/nginx scaled
[root@masternode ~]#

[root@masternode ~]# kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE     IP               NODE                  NOMINATED NODE   READINESS GATES
nginx-846f64db7c-4hh46   1/1     Running   0          23m     192.168.52.139   worker-n1.k8s.local   <none>           <none>
nginx-846f64db7c-88jb7   1/1     Running   0          3m8s    192.168.52.145   worker-n1.k8s.local   <none>           <none>
nginx-846f64db7c-r8g4b   0/1     Pending   0          27s     <none>           <none>                <none>           <none>
nginx-846f64db7c-v75bc   1/1     Running   0          3m19s   192.168.52.144   worker-n1.k8s.local   <none>           <none>
[root@masternode ~]#

As can be seen, the new pod goes in pending state. The old pods are still running, which makes sense as the deployment contains the "requiredDuringSchedulingIgnoredDuringExecution" field, which ensures that running pods are not evicted. If we look into the events, then it can be seen that the new pod could not be scheduled as no node with the matching label was available.

[root@masternode ~]# kubectl describe pod/nginx-846f64db7c-r8g4b |tail -4
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m26s  default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
[root@masternode ~]#

Kubernetes Node Anti-Affinity in Action

Similar to node affinity, node anti-affinity rules can be defined to ensure that a pod is not assigned to a particular group of nodes. These rules define which nodes should not be considered when scheduling a pod. Let's consider the same nginx deployment configuration which we used for node affinity. We only need to update the operator field in the spec section.

    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: tier
                operator: NotIn
                values:
                - gold

The "NotIn" value in the operator field is defining the anti-affinity behavior here. This is going to ensure that the pod is not scheduled on any node which has the label "tier=gold", as evident from the output below.

kubectl create -f nginx.yml

[root@masternode ~]# kubectl get pods -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP            NODE                  NOMINATED NODE   READINESS GATES
nginx-7547cdb594-w9k7n   1/1     Running   0          17s   192.168.0.3   worker-n2.k8s.local   <none>           <none>

[root@masternode ~]# kubectl get nodes -L tier
NAME                   STATUS   ROLES           AGE    VERSION   TIER
worker-n1.k8s.local    Ready    <none>          4d4h   v1.24.2   gold
worker-n2.k8s.local    Ready    <none>          4d4h   v1.24.2   silver
masternode.k8s.local   Ready    control-plane   4d4h   v1.24.2
[root@masternode ~]#

Kubernetes Node Affinity Weight in Action

If you have a cluster with multiple worker nodes, then it is quite possible that more than one node matches the defined affinity rules. In this case, to schedule the pod on a node of our choice, we can assign a weighted score to each affinity rule and prioritize among the selected nodes. This is done through the preferredDuringSchedulingIgnoredDuringExecution field. Let's explain this using an example.

Create an nginx deployment as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: tier
                operator: In
                values:
                - gold
                - silver
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 10
            preference:
              matchExpressions:
              - key: tier
                operator: In
                values:
                - gold
          - weight: 50
            preference:
              matchExpressions:
              - key: tier
                operator: In
                values:
                - silver
      containers:
      - image: nginx
        name: nginx

The "preferredDuringSchedulingIgnoredDuringExecution" directive can be broken down into two parts:

preferredDuringScheduling means that rules under this field are preferred for the pod to be scheduled to the node. Note that we're specifying preferences, not hard requirements.
IgnoredDuringExecution means the affinity rules will not affect pods that are already running on the node

The requiredDuringSchedulingIgnoredDuringExecution section dictates that the pods under this deployment should be scheduled on nodes which either have the label tier=gold or tier=silver. Under the preferredDuringSchedulingIgnoredDuringExecution section, we define our preference for pod scheduling and assign weighted scores to our preferences.

Because of the hard requirement under requiredDuringSchedulingIgnoredDuringExecution section, the scheduler will select nodes with the label tier=gold and tier=silver for running the pod. Based on our defined preference under preferredDuringSchedulingIgnoredDuringExecution, the scheduler will iterate through the rules and assign a weighted score to each rule, 10 when tier=gold and 50 when tier=silver. The scheduler will add this score to other priority functions and select the node with the highest score for running the pod. In this case, the node with the label tier=silver, i.e. worker-n2 will be selected.

kubectl create -f nginx.yml

[root@masternode ~]# kubectl get pods -o wide
NAME                     READY   STATUS              RESTARTS   AGE   IP       NODE                  NOMINATED NODE   READINESS GATES
nginx-5898d7588b-nnlgb   1/1     Running   0          4s    192.168.0.11   worker-n2.k8s.local   <none>           <none>
[root@masternode ~]#

Conclusion

The pod scheduler in Kubernetes offers a lot of flexibility in scheduling application pods as per user requirements. To determine which nodes are acceptable for scheduling a pod, the scheduler evaluates each node against multiple factors. The end user can dictate or prioritize the nodes when running pods by defining node affinity and anti-affinity rules.