Kubernetes scheduling constraints
Affinity and anti-affinity rules allow you to fine-tune your Kubernetes deployments, optimizing resource utilization and enhancing reliability.
Pod Affinity
- Definition : Pod affinity is used to express scheduling constraints based on characteristics of candidate Nodes and existing Pods.
- Purpose : It encourages Pods to be colocated on the same Node if they need to communicate frequently over the network.
- Example : Imagine a microservices architecture where two Pods,
ServiceA
andServiceB
, interact frequently. You can set up pod affinity so that bothServiceA
andServiceB
prefer to run on the same Node. This enhances communication efficiency. - Description : The affinity rule ensures that Pods with a specific label will be scheduled onto a Node that already hosts a Pod with the same label.
This ensures that all nginx
Pods are scheduled on the same Node based on the
hostname label.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
containers:
- name: nginx
image: nginx
Pod Anti-Affinity
- Definition : Pod anti-affinity discourages scheduling Pods onto Nodes that already have Pods with certain labels.
- Purpose : It helps distribute workloads across different Nodes, promoting fault tolerance and resilience.
- Example : Consider a scenario where you have two Pods,
Frontend
andBackend
, serving a web application. You can set up pod anti-affinity so thatFrontend
andBackend
avoid running on the same Node. This way, if one Node fails, the other Node can still handle requests. - Description : The anti-affinity rule ensures that Pods with a specific label prefer not to be scheduled on a Node that already hosts a Pod with the same label.
This ensures that no two nginx
Pods are scheduled on the same Node based on
the hostname label.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
containers:
- name: nginx
image: nginx
Node Affinity
- Definition : Node affinity constrains which Nodes can receive a Pod by matching labels on those Nodes.
- Purpose : It allows you to specify an affinity toward a group of Nodes based on their labels.
- Example : Suppose you have a set of high-memory Nodes labeled as
memory=high
. You want to run memory-intensive Pods on these Nodes. You can define node affinity to ensure that Pods with the labelmemory=high
are scheduled on those specific Nodes. - Description : Node affinity acts as a preference, indicating that the scheduler should use a Node with the specified characteristics if available.
This ensures that the nginx
Pod is scheduled only on a Node with the
disktype=ssd
label.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: nginx
image: nginx
Node Anti-Affinity
- Definition : Node anti-affinity discourages scheduling Pods onto Nodes that already have Pods with specific labels.
- Purpose : It promotes workload distribution across different Nodes, preventing resource bottlenecks.
- Example : Imagine a scenario where you have Pods performing CPU-intensive computations. You can set up node anti-affinity to prevent these Pods from running on the same Node, ensuring better resource utilization.
- Description : Node anti-affinity acts as a repelling rule, making it less probable for Pods to be scheduled on Nodes with the specified label.
This ensures that the nginx
Pod avoids Nodes with the gpu=true
label.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: In
values:
- true
containers:
- name: nginx
image: nginx
requiredDuringSchedulingIgnoredDuringExecution
requiredDuringSchedulingIgnoredDuringExecution can be broken into two parts:
requiredDuringScheduling
:- This component implies that a pod should be scheduled on a node only if it satisfies certain criteria. In other words, the node must meet specific conditions for the pod to be placed there during the initial scheduling process.
IgnoredDuringExecution
:- This part comes into play after a pod is already scheduled and running on a node.
- If any changes occur in the labels on that node during the pod’s execution (for example, due to an update), the existing pod should not be evicted based on these label changes.
- Instead, only newly scheduled pods should be required to match the updated criteria.
In summary, **requiredDuringSchedulingIgnoredDuringExecution
**ensures that
pods are initially placed on suitable nodes and avoids unnecessary evictions
during runtime due to label changes on the node. It’s a way to maintain
stability and predictability in your Kubernetes cluster.
topologyKey
topologyKey represents the key of node labels that the scheduler uses to determine the topology domain for pod placement. For example, when using pod affinity , the scheduler ensures that a pod is scheduled in the same domain (topology) as other pods that match a specific expression.
Common label options of **topologyKey
**include:
topology.kubernetes.io/zone
: Pods are scheduled in the same zone as other pods with matching labels.kubernetes.io/hostname
: Pods are scheduled on the same hostname as other pods with matching labels.kind: Pod metadata: name: with-pod-affinity spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 topologyKey: topology.kubernetes.io/zone containers: - name: with-pod-affinity image: k8s.gcr.io/pause:2.0
topologySpreadConstraints
**topologySpreadConstraints**
allow you to control how Pods are distributed
across your cluster among different failure domains such as regions, zones,
nodes, and other user-defined topology domains. The goal is to achieve both
high availability and efficient resource utilization.
For example, it can avoid single-node dependency, the YAML below deploys pods evenly to all nodes.
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
**maxSkew
**helps maintain a more even spread of pods, enhancing reliability
and performance in your Kubernetes clusters. It defines the maximum allowed
imbalance in the number of pods across topology domains. Set maxSkew
to
1 (meaning only one more pod than the average can be in any zone)
topologySpreadConstraints are ideal for hierarchical topologies (where nodes are spread across logical domains), while pod/node affinity is suitable for linear topologies (where all nodes are on the same level). topologySpreadConstraints provide more expressive control over pod scheduling across broader topological domains, and combining them with other affinity rules allows you to fine-tune your workload placement.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 5
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: my-app
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
In this example, the pods of my-app
are spread across different zones(based
on the topology.kubernetes.io/zone
label)
You may notice that there is labelSelector inside the
topologySpreadConstraints
, there’s difference between with and without the
labelSelector.
1. With labelSelector
:
- When you define a
topologySpreadConstraints
with alabelSelector
, it allows you to select specific Pods based on their labels. These selected Pods are then counted to determine the number of Pods in their corresponding topology domain (such as nodes, zones, or other user-defined domains). - The
labelSelector
helps you control the spreading behavior of your Pods across different failure domains. You can ensure that Pods with specific labels are distributed evenly or according to your desired criteria. - For example, if you want to avoid running multiple Pods with the same label on a single node, you can use a
labelSelector
to enforce this constraint.
2. Without labelSelector
:
- When you omit the
labelSelector
, the spreading behavior is calculated automatically based on other information (such as services, replication controllers, replica sets, or stateful sets) that the Pod belongs to. - In this case, the system determines how to spread the Pods across different domains without explicitly considering their labels.
- It’s a more automatic approach , but it might not provide fine-grained control over the distribution of Pods based on specific labels.
Taints and Tolerations
Taints are applied to nodes to mark them as “tainted” with specific keys and values. A tainted node will not schedule pods that do not have the corresponding toleration.
Tolerations are set on pods to allow them to tolerate specific taints. They define how long a pod can tolerate being scheduled on a tainted node.
Add taint to a node, taint effect NoSchedule.
kubectl taint nodes node1 key1=value1:NoSchedule
The allowed values for the effect
field are:
NoExecute:
This affects pods that are already running on the node as follows:Pods that do not tolerate the taint are evicted immediately
Pods that tolerate the taint without specifying
tolerationSeconds
in their toleration specification remain bound foreverPods that tolerate the taint with a specified
tolerationSeconds
remain bound for the specified amount of time. After that time elapses, the node lifecycle controller evicts the Pods from the node.NoSchedule:
No new Pods will be scheduled on the tainted node unless they have a matching toleration. Pods currently running on the node are not evicted.PreferNoSchedule:
PreferNoSchedule
is a “preference” or “soft” version ofNoSchedule
. The control plane will try to avoid placing a Pod that does not tolerate the taint on the node, but it is not guaranteed.
Remove taint from a node.
kubectl taint nodes node1 key1=value1:NoSchedule-
Get the node’s taint info
kubectl get node/node1 -o json | jq .spec.taints
tolerations usually used in pod or deployment declaration, in the YAML
below, pods will tolerate the taint with key "hardware"
and value "gpu"
on
the nodes where it is scheduled
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
spec:
replicas: 3
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: ai
image: skynet:1997-08-29
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: [“big-gpu”, “expensive-gpu”]
tolerations:
- key: “hardware”
value: “gpu”
effect: “NoSchedule”
tolerationSeconds: 3600
Horizontal Pod Autoscaler(HPA)
Horizontal Pod Autoscaler (HPA) is a Kubernetes resource and controller that automates the scaling of pods based on observed metrics, such as CPU utilization, memory utilization, or custom metrics.
apiVersion: apps/v1
kind: Deployment
metadata:
name: envbin
spec:
selector:
matchLabels:
app: envbin
template:
metadata:
labels:
app: envbin
spec:
containers:
- name: envbin
image: mtinside/envbin:latest
imagePullPolicy: Always
resources:
requests:
cpu: 100m
limits:
cpu: 100m
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: envbin
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: envbin
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
Attention: Some of the sample YAML are generated by ChatGPT.
/yi-dong-ying-yong-kai-fa/kubernetes-scheduling-constraints-13094.html