根据是否编写代码,我们可以把自定义调度器的方式分为两种:
- 不写代码,调整组合已有的默认插件,从而定义新的调度器
- 实现接口代码,自定义开发调度器
本文将会描述第二种方式,编写一个 filter 类型的调度器。
在 深度解析scheduler原理 这篇文章中已经讲解过,要实现一个 filter 类型的调度器,主要是实现 FilterPlugin 和 Plugin 这两个接口:
// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L349C1-L367C2
// FilterPlugin is an interface for Filter plugins. These plugins are called at the
// filter extension point for filtering out hosts that cannot run a pod.
// This concept used to be called 'predicate' in the original scheduler.
// These plugins should return "Success", "Unschedulable" or "Error" in Status.code.
// However, the scheduler accepts other valid codes as well.
// Anything other than "Success" will lead to exclusion of the given host from running the pod.
type FilterPlugin interface {
Plugin
// Filter is called by the scheduling framework.
// All FilterPlugins should return "Success" to declare that
// the given node fits the pod. If Filter doesn't return "Success",
// it will return "Unschedulable", "UnschedulableAndUnresolvable" or "Error".
// For the node being evaluated, Filter plugins should look at the passed
// nodeInfo reference for this particular node's information (e.g., pods
// considered to be running on the node) instead of looking it up in the
// NodeInfoSnapshot because we don't guarantee that they will be the same.
// For example, during preemption, we may pass a copy of the original
// nodeInfo object that has some pods removed from it to evaluate the
// possibility of preempting them to schedule the target pod.
Filter(ctx , state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
}
type Plugin interface {
Name() string
}
也就是要实现 Filter、Name 这两个方法。
下来我们编写一个简单的调度器,过滤掉 label 含有 hello 字样的 Node。
编写自定义调度器
创建一个空白目录 scheduler/node-filter-label 并初始化:
mkdir -p scheduler/node-filter-labe
cd scheduler && go mod init myscheduler
创建目录 manifests 用于存放 KubeSchedulerConfiguration ,创建目录 plugins 存放插件代码,最终目录结构如下:
tree scheduler/
scheduler/
├── go.mod
├── go.sum
├── node-filter-label
│ ├── main.go
│ ├── manifests
│ │ └── nodelabelfilter.yaml
│ └── plugins
│ └── node_filter_label.go
编写 node_filter_label.go
package plugins
import (
"context"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/klog/v2"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const SchedulerName = "NodeFilterLabel"
type NodeFilterLabel struct{}
func (pl *NodeFilterLabel) Name() string {
return SchedulerName
}
func (pl *NodeFilterLabel) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
for k, v := range nodeInfo.Node().ObjectMeta.Labels {
if k == "node" && v == "hello" {
klog.InfoS("node failed to pass NodeFilterLabel filter", "pod_name", pod.Name, "current node", nodeInfo.Node().Name)
return framework.NewStatus(framework.UnschedulableAndUnresolvable, "node has label node=hello")
}
}
klog.InfoS("node pass NodeFilterLabel filter", "pod_name", pod.Name, "current node", nodeInfo.Node().Name)
return nil
}
func New(_ context.Context, _ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
return &NodeFilterLabel{}, nil
}
编写 main.go
package main
import (
"os"
"myscheduler/plugins"
"k8s.io/component-base/cli"
"k8s.io/component-base/logs"
_ "k8s.io/component-base/metrics/prometheus/clientgo"
_ "k8s.io/component-base/metrics/prometheus/version" // for version metric registration
"k8s.io/kubernetes/cmd/kube-scheduler/app"
)
func main() {
command := app.NewSchedulerCommand(app.WithPlugin(plugins.SchedulerName, plugins.New))
logs.InitLogs()
defer logs.FlushLogs()
code := cli.Run(command)
os.Exit(code)
}
编写 KubeSchedulerConfiguration
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: "/etc/kubernetes/scheduler.conf"
profiles:
- schedulerName: nodelabelfilter
plugins:
filter:
enabled:
- name: NodeFilterLabel
disabled:
- name: "*"
编译
注意,通过 go mod tidy 生成的 go.mod 和 go.sum 无法直接使用,会报如下错误:
我们需要从 kubernetes-sigs/scheduler-plugins 找到对应的 k8s 版本(比如1.29)并复制该分支的 go.mod 过来。
go build -o nodelabelfilter main.go
本地调试运行
编写好以后,本地调试可以直接找台机器启动,无需做成工作负载或者 Pod 放到 k8s 中运行,只要这台机器有 scheduler.conf(可以从 kube-master 机器获取) 且能访问到 k8s 集群。
./nodelabelfilter --leader-elect=false --config nodelabelfilter.yaml
I1117 09:03:19.700978 25785 serving.go:380] Generated self-signed cert in-memory
W1117 09:03:20.057352 25785 authentication.go:339] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W1117 09:03:20.057402 25785 authentication.go:363] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W1117 09:03:20.057418 25785 authorization.go:193] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I1117 09:03:20.073613 25785 server.go:154] "Starting Kubernetes Scheduler" version="v0.0.0-master+$Format:%H$"
I1117 09:03:20.073977 25785 server.go:156] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I1117 09:03:20.079517 25785 secure_serving.go:213] Serving securely on [::]:10259
I1117 09:03:20.081147 25785 tlsconfig.go:240] "Starting DynamicServingCertificateController"
测试
假设集群中有 1 个 master,2 个 node,给 master 和 node1 添加 label node=hello,
kubectl label nodes k8s-master node=hello
kubectl label nodes k8s-node2 node=hello
创建一个 Pod,观察调度情况:
apiVersion: v1
kind: Pod
metadata:
name: nginx-labelfilter
spec:
schedulerName: nodelabelfilter
containers:
- image: registry.cn-beijing.aliyuncs.com/fpf_devops/nginx:1.24
name: nginx
查看自定义插件的日志:
I1117 09:08:32.807891 25785 node_filter_label.go:27] "node pass NodeFilterLabel filter" pod_name="nginx-labelfilter" current node="k8s-node1"
I1117 09:08:32.807932 25785 node_filter_label.go:23] "node failed to pass NodeFilterLabel filter" pod_name="nginx-labelfilter" current node="k8s-node2"
I1117 09:08:32.807888 25785 node_filter_label.go:23] "node failed to pass NodeFilterLabel filter" pod_name="nginx-labelfilter" current node="k8s-master"
可以看到 master 和 node2 两个 node 被过滤掉了,不能调度,只能调度到 node1 上。
部署自定义调度器
构建镜像
v1.24 版以后的 k8s 底层从 docker 换成了 containerd,自然也就没有了以前的 docker build 命令了。需要用 containerd 对应的构建命令。
containerd 有一个子项目 nerdctl,用来兼容 docker cli,可以像 docker 命令一样来管理本地的镜像和容器。
nerdctl 包含精简版和完整版:
- 精简版仅有
nerdctl命令,无法使用nerdctl build命令,执行 nerdctl build 会报错 - 完整版不仅有
netdctl命令,还包含了buildkitd、buildctl、ctr、runc等containerd相关的命令,以及cni插件的二进制文件
下载后解压,将 bin 下的 nerdctl 和 buildkitd 拷贝或软链到 /usr/local/bin/,将 lib 下的 systemd/system/buildkit.service 拷贝到 /etc/systemd/system/ 下,启动 buildkitd:
systemctl enable buildkit.service --now
编写 Dockerfile:
FROM registry.cn-beijing.aliyuncs.com/fpf_devops/golang:1.23.0
WORKDIR /opt
ADD nodelabelfilter /opt
ENTRYPOINT ["/opt/nodelabelfilter", "--leader-elect=false", "--config=/opt/nodelabelfilter.yaml"]
构建镜像:
nerdctl build -t nodelabelfilter:v1.0.0 .
部署
用一个工作负载部署该镜像,参见用 Deployment 部署自定义调度器