K8S二次开发系列之五:深度解析scheduler原理


调度器概述

k8s scheduler 是一个控制面进程,它分配 Pod 到 Nodes。根据限制和可用资源,scheduler 确定哪些节点符合调度队列里的 Pod。然后对这些符合的节点进行打分,然后把Pod绑定到合适的节点上。一个集群内可以存在多个 scheduler,而 kube-scheduler 是一个参考实现。

所以 scheduler 主要要做的事就是根据 Nodes 当前状态和Pod对资源的需求,按照顺序运行一系列指定的算法来挑选出一个 Node。

我们可以通过下图,对上述说的列算法有一个初步的认识。

图中每一个绿色箭头在 k8s 中叫扩展点(extension point),从图中可以看到一共有 10 个扩展点。

用户可以编写自己的调度插件(scheduler plugins)注册到这些扩展点来实现想要的调度逻辑。 每个扩展点上一般会有多个 plugins,按注册顺序依次执行。

扩展点根据是否影响调度决策,可以分为两类。

  • 影响调度决策的扩展点
    大部分扩展点是影响调度决策的,在插件函数的返回值中包括一个成功/失败字段,决定了是允许还是拒绝这个 pod 进入下一处理阶段,任何一个扩展点失败了,这个 pod 的调度就失败了。

  • 不影响调度决策的扩展点(informational)
    少数几个扩展点是 informational 的,这些函数没有返回值,因此不能影响调度决策,在这里面可以修改 pod/node 等信息,或者执行清理操作。

扩展点分类说明

插件分类说明

根据插件是否维护在 k8s 代码仓库本身,分为两类。

in-tree plugins

维护在 k8s 代码目录 pkg/scheduler/framework/plugins 中, 跟内置调度器一起编译。里面有十几个调度插件,大部分都是常用和在用的。

$ ll pkg/scheduler/framework/plugins
defaultbinder/
defaultpreemption/
dynamicresources/
feature/
imagelocality/
interpodaffinity/
names/
nodeaffinity/
nodename/
nodeports/
noderesources/
nodeunschedulable/
nodevolumelimits/
podtopologyspread/
queuesort/
schedulinggates/
selectorspread/
tainttoleration/
volumebinding/
volumerestrictions/
volumezone/

in-tree 方式每次要添加新插件,或者修改原有插件,都需要修改 kube-scheduler 代码然后编译和 重新部署 kube-scheduler,比较重量级。

out-of-tree plugins

out-of-tree plugins 由用户自己编写和维护,独立部署, 不需要对 k8s 做任何代码或配置改动。

本质上 out-of-tree plugins 也是跟 kube-scheduler 代码一起编译的,不过 kube-scheduler 相关代码已经抽出来作为一个独立项目:

github.com/kubernetes-sigs/scheduler-plugins

用户只需要引用这个包,编写自己的调度器插件,然后以普通 pod 方式部署就行(其他部署方式也行,比如 binary 方式部署)。 编译之后是个包含默认调度器和所有 out-of-tree 插件的总调度器程序,有内置调度器的功能,也包括了 out-of-tree 调度器的功能。用法有两种:

  1. 跟现有调度器并行部署,只管理特定的某些 pods
  2. 取代现有调度器,因为它功能也是全的。

上面我们主要从扩展点和插件方面说明了scheduler的架构。下面我们从源码架构说说scheduler是怎么工作。

下图是 kube-scheduler 代码的主要框架:

我们来看看,当一个新 Pod创建出来后,这个流程是怎么走的:

  1. informer 监听到了有新建Pod,根据Pod的优先级把Pod加入到activeQ中适当位置(即执行sort插件);
  2. scheduler 从activeQ队头取一个Pod(如果队列没有Pod可取,则会一直阻塞;此时假设就是上述说的新建的Pod),开始调度;
  3. 执行filter类型扩展点(包括preFilter,filter,postFilter)插件,选出所有符合Pod的Node,如果无法找到符合的Node, 则把Pod加入unscheduleableQ中,此次调度结束;
  4. 执行score扩展点插件,找出最符合Pod的那个Node;
  5. assume Pod。这一步就是乐观假设Pod已经调度成功,更新缓存中Node和PodStats信息,到了这里scheduling cycle就已经结束了,然后会开启新的一轮调度。至于真正的绑定,则会新起一个协程。
  6. 执行reserve插件;
  7. 启动协程绑定Pod到Node上。实际上就是修改Pod.spec.nodeName: 选定的node名字,然后调用kube-apiserver接口写入etcd。如果绑定失败了,那么移除缓存中此前加入的信息,然后把Pod放入activeQ中,后续重新调度。
  8. 执行postBinding,该步没有实现的插件没所以没有做任何事。

主要扩展点源码解析

调度阶段

Prefilter

pod 预处理和检查,不符合预期就提前结束调度。

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L349-L367

// PreFilterPlugin is an interface that must be implemented by "PreFilter" plugins.
// These plugins are called at the beginning of the scheduling cycle.
type PreFilterPlugin interface {
    // PreFilter is called at the beginning of the scheduling cycle. All PreFilter
    // plugins must return success or the pod will be rejected. PreFilter could optionally
    // return a PreFilterResult to influence which nodes to evaluate downstream. This is useful
    // for cases where it is possible to determine the subset of nodes to process in O(1) time.
    // When it returns Skip status, returned PreFilterResult and other fields in status are just ignored,
    // and coupled Filter plugin/PreFilterExtensions() will be skipped in this scheduling cycle.
    PreFilter(ctx , state *CycleState, p *v1.Pod) (*PreFilterResult, *Status)

    // PreFilterExtensions returns a PreFilterExtensions interface if the plugin implements one,
    // or nil if it does not. A Pre-filter plugin can provide extensions to incrementally
    // modify its pre-processed info. The framework guarantees that the extensions
    // AddPod/RemovePod will only be called after PreFilter, possibly on a cloned
    // CycleState, and may call those functions more than once before calling
    // Filter again on a specific node.
    PreFilterExtensions() PreFilterExtensions
}
  • 输入
    • p *v1.Pod 是待调度的 pod;
    • 第二个参数 state 可用于保存一些状态信息,然后在后面的扩展点(例如 Filter() 阶段)拿出来用;
  • 输出
    • 只要有任何一个 plugin 返回失败,这个 pod 的调度就失败了;
    • 换句话说,所有已经注册的 PreFilter plugins 都成功之后,pod 才会进入到下一个环节。

Filter

排除所有不符合要求的 node。

这里的插件可以过滤掉那些不满足要求的 node(equivalent of Predicates in a scheduling Policy)

  • 针对每个 node,调度器会按配置顺序依次执行 filter plugins;
  • 任何一个插件 返回失败,这个 node 就被排除了;
// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L349C1-L367C2

// FilterPlugin is an interface for Filter plugins. These plugins are called at the
// filter extension point for filtering out hosts that cannot run a pod.
// This concept used to be called 'predicate' in the original scheduler.
// These plugins should return "Success", "Unschedulable" or "Error" in Status.code.
// However, the scheduler accepts other valid codes as well.
// Anything other than "Success" will lead to exclusion of the given host from running the pod.
type FilterPlugin interface {
    Plugin
    // Filter is called by the scheduling framework.
    // All FilterPlugins should return "Success" to declare that
    // the given node fits the pod. If Filter doesn't return "Success",
    // it will return "Unschedulable", "UnschedulableAndUnresolvable" or "Error".
    // For the node being evaluated, Filter plugins should look at the passed
    // nodeInfo reference for this particular node's information (e.g., pods
    // considered to be running on the node) instead of looking it up in the
    // NodeInfoSnapshot because we don't guarantee that they will be the same.
    // For example, during preemption, we may pass a copy of the original
    // nodeInfo object that has some pods removed from it to evaluate the
    // possibility of preempting them to schedule the target pod.
    Filter(ctx , state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
}
  • 输入
    • nodeInfo 是当前给定的 node 的信息,Filter() 程序判断这个 node 是否符合要求;
  • 输出
    • 放行或拒绝。

对于给定 node,如果所有 Filter plugins 都返回成功,这个 node 才算通过筛选, 成为备选 node 之一。

PostFilter

Filter 之后没有 node 剩下,补救阶段。

如果 Filter 阶段之后,所有 nodes 都被筛掉了,一个都没剩,才会执行这个阶段;否则不会执行这个阶段的 plugins。

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L392C1-L407C2

// PostFilterPlugin is an interface for "PostFilter" plugins. These plugins are called after a pod cannot be scheduled.
type PostFilterPlugin interface {
    // A PostFilter plugin should return one of the following statuses:
    // - Unschedulable: the plugin gets executed successfully but the pod cannot be made schedulable.
    // - Success: the plugin gets executed successfully and the pod can be made schedulable.
    // - Error: the plugin aborts due to some internal error.
    //
    // Informational plugins should be configured ahead of other ones, and always return Unschedulable status.
    // Optionally, a non-nil PostFilterResult may be returned along with a Success status. For example,
    // a preemption plugin may choose to return nominatedNodeName, so that framework can reuse that to update the
    // preemptor pod's .spec.status.nominatedNodeName field.
    PostFilter(ctx , state *CycleState, pod *v1.Pod, filteredNodeStatusMap NodeToStatusMap) (*PostFilterResult, *Status)
}

按 plugin 顺序依次执行,任何一个插件将 node 标记为 Schedulable 就算成功,不再执行剩下的 PostFilter plugins。

典型例子:preemptiontoleration, Filter() 之后已经没有可用 node 了,在这个阶段就挑一个 pod/node,抢占它的资源。

PreScore

PreScore/Score/NormalizeScore 都是给 node 打分的,以最终选出一个最合适的 node,这里就不展开了。

Score

针对每个 node 依次调用 scoring plugin,得到一个分数。

NormalizeScore

给每个 node 打的分 最终必须转换为一个 [0-100] 的范围,在这一步完成该操作。

Reserve

Informational,维护 plugin 状态信息

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L444C1-L462C2

// ReservePlugin is an interface for plugins with Reserve and Unreserve
// methods. These are meant to update the state of the plugin. This concept
// used to be called 'assume' in the original scheduler. These plugins should
// return only Success or Error in Status.code. However, the scheduler accepts
// other valid codes as well. Anything other than Success will lead to
// rejection of the pod.
type ReservePlugin interface {
    // Reserve is called by the scheduling framework when the scheduler cache is
    // updated. If this method returns a failed Status, the scheduler will call
    // the Unreserve method for all enabled ReservePlugins.
    Reserve(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status
    // Unreserve is called by the scheduling framework when a reserved pod was
    // rejected, an error occurred during reservation of subsequent plugins, or
    // in a later phase. The Unreserve method implementation must be idempotent
    // and may be called by the scheduler even if the corresponding Reserve
    // method for the same plugin was not called.
    Unreserve(ctx , state *CycleState, p *v1.Pod, nodeName string)
}

这里有两个方法,都是 informational,也就是不影响调度决策; 维护了 runtime state (aka “stateful plugins”) 的插件,可以通过这两个方法 接收 scheduler 传来的信息:

  • Reserve
    用来避免 scheduler 等待 bind 操作结束期间,因 race condition 导致的错误。 只有当所有 Reserve plugins 都成功后,才会进入下一阶段,否则 scheduling cycle 就中止了。
  • Unreserve
    调度失败,这个阶段回滚时执行。Unreserve() 必须幂等,且不能 fail。

Permit

允许/拒绝/等待 进入 binding cycle。

这是 scheduling cycle 的最后一个扩展点了,可以阻止或延迟将一个 pod binding 到 candidate node。

// PermitPlugin is an interface that must be implemented by "Permit" plugins.
// These plugins are called before a pod is bound to a node.
type PermitPlugin interface {
    // Permit is called before binding a pod (and before prebind plugins). Permit
    // plugins are used to prevent or delay the binding of a Pod. A permit plugin
    // must return success or wait with timeout duration, or the pod will be rejected.
    // The pod will also be rejected if the wait timeout or the pod is rejected while
    // waiting. Note that if the plugin returns "wait", the framework will wait only
    // after running the remaining plugins given that no other plugin rejects the pod.
    Permit(ctx , state *CycleState, p *v1.Pod, nodeName string) (*Status, time.Duration)
}

三种结果:

  • approve:所有 Permit plugins 都 appove 之后,这个 pod 就进入下面的 binding 阶段;
  • deny:任何一个 Permit plugin deny 之后,就无法进入 binding 阶段。这会触发 Reserve plugins 的 Unreserve() 方法;
  • wait (with a timeout):如果有 Permit plugin 返回 “wait”,这个 pod 就会进入一个 internal “waiting” Pods list;

绑定阶段

PreBind

Bind 之前的预处理,例如到 node 上去挂载 volume。

例如,在将 pod 调度到一个 node 之前,先给这个 pod 在那台 node 上挂载一个 network volume。

// PreBindPlugin is an interface that must be implemented by "PreBind" plugins.
// These plugins are called before a pod being scheduled.
type PreBindPlugin interface {
    // PreBind is called before binding a pod. All prebind plugins must return
    // success or the pod will be rejected and won't be sent for binding.
    PreBind(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status
}

任何一个 PreBind plugin 失败,都会导致 pod 被 reject,进入到 reserve plugins 的 Unreserve() 方法。

Bind

将 pod 关联到 node,所有 PreBind 完成之后才会进入 Bind。

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L497

// Bind plugins are used to bind a pod to a Node.
type BindPlugin interface {
    // Bind plugins will not be called until all pre-bind plugins have completed. Each
    // bind plugin is called in the configured order. A bind plugin may choose whether
    // or not to handle the given Pod. If a bind plugin chooses to handle a Pod, the
    // remaining bind plugins are skipped. When a bind plugin does not handle a pod,
    // it must return Skip in its Status code. If a bind plugin returns an Error, the
    // pod is rejected and will not be bound.
    Bind(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status
}
  • 所有 plugin 按配置顺序依次执行;
  • 每个 plugin 可以选择是否要处理一个给定的 pod;
  • 如果选择处理,后面剩下的 plugins 会跳过。也就是最多只有一个 bind plugin 会执行。

PostBind

informational,可选,执行清理操作。

这是一个 informational extension point,也就是无法影响调度决策(没有返回值)。

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L473

// PostBindPlugin is an interface that must be implemented by "PostBind" plugins.
// These plugins are called after a pod is successfully bound to a node.
type PostBindPlugin interface {
    // PostBind is called after a pod is successfully bound. These plugins are informational.
    // A common application of this extension point is for cleaning
    // up. If a plugin needs to clean-up its state after a pod is scheduled and
    // bound, PostBind is the extension point that it should register.
    PostBind(ctx , state *CycleState, p *v1.Pod, nodeName string)
}

以上就是 kube-scheduler 的基本原理。


文章作者: 洪宇轩
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 洪宇轩 !
评论
  目录