Introduction to Kubernetes Node Affinity
The pod scheduler is one of the core components of Kubernetes. Whenever an application pod is created as per the user's request, the scheduler determines the placement of these pods onto worker nodes in the cluster. The scheduler is flexible and can be customized for advanced scheduling situations. Before scheduling a pod on a worker node, the scheduler takes the following points into consideration:
- check if a node selector is defined in the pod definition, if its defined, nodes whose labels do not match are not considered
- check if the worker nodes have enough memory, compute and storage resources
- check if nodes have any taints and if the pod to be scheduled has toleration for these taints
- check affinity and anti-affinity rules
Based on these factors, the scheduler assigns a weighted score to each node and node with the highest score is selected to run the pod.
Kubernetes Node Affinity is successor of nodeSelector
Node affinity rules are used to influence which node a pod is scheduled to. In the earlier K8s versions, the node affinity mechanism was implemented through the nodeSelector
field in the pod specification. The node had to include all the labels specified in that field to become eligible for hosting the pod.
Node affinity is a more sophisticated form of nodeSelector
as it offers a much wider selection criteria. Each pod can specify its preferences and requirements by defining it's own node affinity rules. Based on these rules, the Kubernetes scheduler will try to place the pod to one of the nodes matching the defined criteria.
Prerequisites
- You must have a working Kubernetes Cluster with a least two worker nodes to implement these scenarios. Single node clusters, such as Minikube will not work for this demo.
- You must have a working knowledge of Kubernetes and yaml language.
Add Label to Worker Nodes
We'll first proceed to label the two worker nodes in our cluster as follows:
kubectl label nodes worker-n1.k8s.local tier=gold kubectl label nodes worker-n2.k8s.local tier=silver
[root@masternode ~]# kubectl label nodes worker-n1.k8s.local tier=gold node/worker-n1.k8s.local labeled [root@masternode ~]# kubectl label nodes worker-n2.k8s.local tier=silver node/worker-n2.k8s.local labeled [root@masternode ~]#
Create a deployment with Node Affinity
We're now going to create a sample deployment and enforce node affinity rules using the node labels defined earlier.
Create a sample nginx deployment as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: tier
operator: In
values:
- gold
containers:
- image: nginx
name: nginx
Create this deployment using kubectl:
kubectl create -f nginx.yml
Kubernetes Node Affinity in Action
Let's walk through the node affinity specification. The "requiredDuringSchedulingIgnoredDuringExecution
" directive can be broken down into two parts:
- requiredDuringScheduling means that rules under this field specify the labels the node must have for the pod to be scheduled to the node
- IgnoredDuringExecution means the affinity rules will not affect pods that are already running on the node
Setting this directive ensures that affinity only affects the scheduling of a new pod and never causes a pod to be evicted from a node. The nodeSelectorTerms
and the matchExpressions
fields define the values that the node’s label must match for the pod to be scheduled to the node. In our case, it means that the node must have a label "tier
" whose value should be set to "gold
".
[root@masternode ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-846f64db7c-4hh46 1/1 Running 0 16m 192.168.52.139 worker-n1.k8s.local <none>
[root@masternode ~]# kubectl get node worker-n1.k8s.local -L tier
NAME STATUS ROLES AGE VERSION TIER
worker-n1.k8s.local Ready <none> 4d7h v1.24.2 gold
Even if we scale our deployment to multiple replicas, the resulting pods will always be scheduled on the same node, i.e. with the label "tier=gold
".
kubectl scale --replicas=3 deployment.apps/nginx
Let's take this one step further. We'll now remove the label "tier=gold
" from the worker node, further scale our deployment to four replicas.
[root@masternode ~]# kubectl label nodes worker-n1.k8s.local tier- node/worker-n1.k8s.local unlabeled [root@masternode ~]# [root@masternode ~]# kubectl scale --replicas=4 deployment.apps/nginx deployment.apps/nginx scaled [root@masternode ~]#
[root@masternode ~]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-846f64db7c-4hh46 1/1 Running 0 23m 192.168.52.139 worker-n1.k8s.local <none> <none> nginx-846f64db7c-88jb7 1/1 Running 0 3m8s 192.168.52.145 worker-n1.k8s.local <none> <none> nginx-846f64db7c-r8g4b 0/1 Pending 0 27s <none> <none> <none> <none> nginx-846f64db7c-v75bc 1/1 Running 0 3m19s 192.168.52.144 worker-n1.k8s.local <none> <none> [root@masternode ~]#
As can be seen, the new pod goes in pending state. The old pods are still running, which makes sense as the deployment contains the "requiredDuringSchedulingIgnoredDuringExecution
" field, which ensures that running pods are not evicted. If we look into the events, then it can be seen that the new pod could not be scheduled as no node with the matching label was available.
[root@masternode ~]# kubectl describe pod/nginx-846f64db7c-r8g4b |tail -4
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m26s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
[root@masternode ~]#
Kubernetes Node Anti-Affinity in Action
Similar to node affinity, node anti-affinity rules can be defined to ensure that a pod is not assigned to a particular group of nodes. These rules define which nodes should not be considered when scheduling a pod. Let's consider the same nginx deployment configuration which we used for node affinity. We only need to update the operator field in the spec section.
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: tier
operator: NotIn
values:
- gold
The "NotIn" value in the operator field is defining the anti-affinity behavior here. This is going to ensure that the pod is not scheduled on any node which has the label "tier=gold
", as evident from the output below.
kubectl create -f nginx.yml
[root@masternode ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-7547cdb594-w9k7n 1/1 Running 0 17s 192.168.0.3 worker-n2.k8s.local <none> <none>
[root@masternode ~]# kubectl get nodes -L tier NAME STATUS ROLES AGE VERSION TIER worker-n1.k8s.local Ready <none> 4d4h v1.24.2 gold worker-n2.k8s.local Ready <none> 4d4h v1.24.2 silver masternode.k8s.local Ready control-plane 4d4h v1.24.2 [root@masternode ~]#
Kubernetes Node Affinity Weight in Action
If you have a cluster with multiple worker nodes, then it is quite possible that more than one node matches the defined affinity rules. In this case, to schedule the pod on a node of our choice, we can assign a weighted score to each affinity rule and prioritize among the selected nodes. This is done through the preferredDuringSchedulingIgnoredDuringExecution
field. Let's explain this using an example.
Create an nginx deployment as follows:
apiVersion: apps/v1 kind: Deployment metadata: labels: app: nginx name: nginx spec: replicas: 1 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: tier operator: In values: - gold - silver preferredDuringSchedulingIgnoredDuringExecution: - weight: 10 preference: matchExpressions: - key: tier operator: In values: - gold - weight: 50 preference: matchExpressions: - key: tier operator: In values: - silver containers: - image: nginx name: nginx
The "preferredDuringSchedulingIgnoredDuringExecution
" directive can be broken down into two parts:
- preferredDuringScheduling means that rules under this field are preferred for the pod to be scheduled to the node. Note that we're specifying preferences, not hard requirements.
- IgnoredDuringExecution means the affinity rules will not affect pods that are already running on the node
The requiredDuringSchedulingIgnoredDuringExecution
section dictates that the pods under this deployment should be scheduled on nodes which either have the label tier=gold
or tier=silver
. Under the preferredDuringSchedulingIgnoredDuringExecution
section, we define our preference for pod scheduling and assign weighted scores to our preferences.
Because of the hard requirement under requiredDuringSchedulingIgnoredDuringExecution
section, the scheduler will select nodes with the label tier=gold
and tier=silver
for running the pod. Based on our defined preference under preferredDuringSchedulingIgnoredDuringExecution
, the scheduler will iterate through the rules and assign a weighted score to each rule, 10
when tier=gold
and 50
when tier=silver
. The scheduler will add this score to other priority functions and select the node with the highest score for running the pod. In this case, the node with the label tier=silver
, i.e. worker-n2
will be selected.
kubectl create -f nginx.yml
[root@masternode ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-5898d7588b-nnlgb 1/1 Running 0 4s 192.168.0.11 worker-n2.k8s.local <none> <none>
[root@masternode ~]#
Conclusion
The pod scheduler in Kubernetes offers a lot of flexibility in scheduling application pods as per user requirements. To determine which nodes are acceptable for scheduling a pod, the scheduler evaluates each node against multiple factors. The end user can dictate or prioritize the nodes when running pods by defining node affinity and anti-affinity rules.