如何进行Kubernetes Scheduler原理解析

本篇文章为大家展示了如何进行 Kubernetes Scheduler 原理解析，内容简明扼要并且容易理解，绝对能使你眼前一亮，通过这篇文章的详细介绍希望你能有所收获。

本文是对 Kubernetes Scheduler 的算法解读和原理解析, 重点介绍了预选 (Predicates) 和优选 (Priorities) 步骤的原理，并介绍了默认配置的 Default Policies。接下来，我会分析 Kubernetes Scheduler 的源码，窥探其具体的实现细节以及如何开发一个 Policy，见我下片博文吧。

Scheduler 及其算法介绍

Kubernetes Scheduler 是 Kubernetes Master 的一个组件，通常与 API Server 和 Controller Manager 组件部署在一个节点，共同组成 Master 的三剑客。

一句话概括 Scheduler 的功能：将 PodSpec.NodeName 为空的 Pods 逐个地，经过预选 (Predicates) 和优选 (Priorities) 两个步骤，挑选最合适的 Node 作为该 Pod 的 Destination。

展开这两个步骤，就是 Scheduler 的算法描述：

预选：根据配置的 Predicates Policies（默认为 DefaultProvider 中定义的 default predicates policies 集合）过滤掉那些不满足这些 Policies 的的 Nodes，剩下的 Nodes 就作为优选的输入。

优选：根据配置的 Priorities Policies（默认为 DefaultProvider 中定义的 default priorities policies 集合）给预选后的 Nodes 进行打分排名，得分最高的 Node 即作为最适合的 Node，该 Pod 就 Bind 到这个 Node。

如果经过优选将 Nodes 打分排名后，有多个 Nodes 并列得分最高，那么 scheduler 将随机从中选择一个 Node 作为目标 Node。

因此整个 schedule 过程，算法本身的逻辑是非常简单的，关键在这些 Policies 的逻辑，下面我们就来看看 Kubernetes 的 Predicates and Priorities Policies。

Predicates and Priorities PoliciesPredicates Policies

Predicates Policies 就是提供给 Scheduler 用来过滤出满足所定义条件的 Nodes，并发的 (最多 16 个 goroutine) 对每个 Node 启动所有 Predicates Policies 的遍历 Filter，看其是否都满足配置的 Predicates Policies，若有一个 Policy 不满足，则直接被淘汰。

注意：这里的并发 goroutine number 为 All Nodes number，但最多不能超过 16 个，由一个 queue 控制。

Kubernetes 提供了以下 Predicates Policies 的定义，你可以在 kube-scheduler 启动参数中添加 –policy-config-file 来指定要运用的 Policies 集合, 比如：

{
 kind  :  Policy ,
 apiVersion  :  v1 ,
 predicates  : [{ name  :  PodFitsPorts},
 {name  :  PodFitsResources},
 {name  :  NoDiskConflict},
 {name  :  NoVolumeZoneConflict},
 {name  :  MatchNodeSelector},
 {name  :  HostName}
 priorities  : [}

NodeiskConflict：评估一个 pod 是否能够容纳它请求的卷以及已经装载的卷。当前支持的卷有：AWS EBS、GCE PD、ISCSI 和 Ceph RBD。仅检查这些受支持类型的持久卷声明。直接添加到 POD 的持久卷不进行评估，也不受此策略的约束。

NoVolumeZoneConflict：在给定区域限制的情况下，评估节点上是否存在 pod 请求的卷。

PodFitsResources：检查可用资源（CPU 和内存）是否满足 Pod 的要求。可用资源由容量减去节点上所有 POD 的请求之和来衡量。要了解更多关于 Kubernetes 中资源 QoS 的信息，请查看 QoS 建议。

PODFITSHOSPORTS：检查 Pod 所需的任何主机端口是否已在节点上被占用。

HostName：过滤掉除 PodSpec 的 NodeName 字段中指定的节点之外的所有节点。

MatchNodeSelector：检查节点的标签是否与 Pod 的 nodeSelector 字段中指定的标签匹配，并且从 Kubernetes v1.2 开始，还与 scheduler.alpha.Kubernetes.io/affinity Pod 注释（如果存在）匹配。有关这两方面的更多详细信息，请参见此处。

MaxEBSVolumeCount：确保连接的 ElasticBlockStore 卷的数量不超过最大值（默认情况下为 39，因为 Amazon 建议最大值为 40，其中一个保留给根卷——请参阅 Amazon 的文档）。可通过设置 KUBE_MAX_PD_VOLS 环境变量来控制最大值。

MaxGCEPDVolumeCount：确保连接的 GCE PersistentDisk 卷数不超过最大值（默认情况下为 16，这是 GCE 允许的最大值）。可通过设置 KUBE_MAX_PD_VOLS 环境变量来控制最大值。

CheckNodeMemoryPressure：检查是否可以在报告内存压力条件的节点上调度 pod。目前，在内存紧张的节点上不应放置 BestEffort，因为它会被 kubelet 自动逐出。

CheckNodeDiskPressure：检查是否可以在报告磁盘压力状况的节点上调度 pod。目前，在磁盘压力下，不应在节点上放置吊舱，因为它会被 kubelet 自动逐出。

默认的默认提供者中选了以下谓词策略：

NoVolumeZoneConflict

MaxEBSVolumeCount

MaxGCEPDVolumeCount

匹配有限性

说明：Fit 由 pod 间亲和性确定。AffinityAnotationKey 表示 pod 注释中亲和性数据（json 序列化）的键。

AffinityAnnotationKey string = scheduler.alpha.kubernetes.io/affinity

节点冲突

一般预测

pod，在数量上

cpu，在内核中

内存，以字节为单位

alpha.kubernetes.io/nvidia-gpu，在设备中截止 V1.4 每个节点最多只支持 1. 个 gpu

Podcast resources

Podfest Hotel

Pod

Podcast selector match

podesnodetaints

CheckNodeMemoryPressure

Checknodedisk pressure

优先事项和政策

经过预选策略甩选后得到的 Nodes，会来到优选步骤。在这个过程中，会并发的根据每个 Priorities Policy 分别启动一个 goroutine，在每个 goroutine 中会根据对应的 policy 实现，遍历所有的预选 Nodes，分别进行打分，每个 Node 每一个 Policy 的打分为 0 -10 分，0 分最低，10 分最高。待所有 policy 对应的 goroutine 都完成后，根据设置的各个 priorities policies 的权重 weight，对每个 node 的各个 policy 的得分进行加权求和作为最终的 node 的得分。

finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

注意：这里的并发 goroutine number 为 Priorities Policies number，无队列控制，数量不封顶。当然，正常情况，也不会配置超过十几二十个 Policies。

思考：如果经过预选后，没有一个 Node 满足条件，则直接返回 FailedPredicates 报错，不会再触发 Prioritizing 阶段，这是合理的。但是，如果经过预选后，只有一个 Node 满足条件，同样会触发 Prioritizing，并且所走的流程和多个 Nodes 一样。实际上，如果只有一个 Node 满足条件，在优选阶段，可以直接返回该 Node 作为最终 scheduled 结果，无需跑完整个打分流程。

如果经过优选将 Nodes 打分排名后，有多个 Nodes 并列得分最高，那么 scheduler 将随机从中选择一个 Node 作为目标 Node。

Kubernetes 提供了以下 Priorities Policies 的定义，你可以在 kube-scheduler 启动参数中添加 –policy-config-file 来指定要运用的 Policies 集合，比如：

{
 kind  :  Policy ,
 apiVersion  :  v1 ,
 predicates  : [
 priorities  : [{ name  :  LeastRequestedPriority ,  weight  : 1},
 {name  :  BalancedResourceAllocation ,  weight  : 1},
 {name  :  ServiceSpreadingPriority ,  weight  : 1},
 {name  :  EqualPriority ,  weight  : 1}
}

LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity – sum of requests of all Pods already on the node – request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.

BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.

SelectorSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.

CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.

ImageLocalityPriority: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.

NodeAffinityPriority: (Kubernetes v1.2) Implements preferredDuringSchedulingIgnoredDuringExecution node affinity; see here for more details.

默认的 DefaultProvider 中选了以下 Priorities Policies

SelectorSpreadPriority, 默认权重为 1

InterPodAffinityPriority, 默认权重为 1

pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)

as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.

AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

scheduler.alpha.kubernetes.io/affinity= …

LeastRequestedPriority, 默认权重为 1

BalancedResourceAllocation, 默认权重为 1

NodePreferAvoidPodsPriority, 默认权重为 10000

说明：这里权重设置足够大（10000），如果得分不为 0，那么加权后最终得分将很高，如果得分为 0，那么意味着相对其他得搞很高的，注定被淘汰, 分析如下：

如果 Node 的 Anotation 没有设置 key-value:

scheduler.alpha.kubernetes.io/preferAvoidPods= …

则该 node 对该 policy 的得分就是 10 分，加上权重 10000，那么该 node 对该 policy 的得分至少 10W 分。

如果 Node 的 Anotation 设置了

scheduler.alpha.kubernetes.io/preferAvoidPods= …

如果该 pod 对应的 Controller 是 ReplicationController 或 ReplicaSet，则该 node 对该 policy 的得分就是 0 分，那么该 node 对该 policy 的得分相对没有设置该 Anotation 的 Node 得分低的离谱了。也就是说这个 Node 一定会被淘汰！

NodeAffinityPriority, 默认权重为 1

TaintTolerationPriority, 默认权重为 1

##scheduler 算法流程图

## 总结

kubernetes scheduler 的任务就是将 pod 调度到最合适的 Node。

整个调度过程分两步：预选 (Predicates) 和优选(Policies)

默认配置的调度策略为 DefaultProvider，具体包含的策略见上。

可以通过 kube-scheduler 的启动参数 –policy-config-file 指定一个自定义的 Json 内容的文件，按照格式组装自己 Predicates and Priorities policies。

上述内容就是如何进行 Kubernetes Scheduler 原理解析，你们学到知识或技能了吗？如果还想学到更多技能或者丰富自己的知识储备，欢迎关注丸趣 TV 行业资讯频道。