根据是否编写代码,我们可以把自定义调度器的方式分为两种:
- 不写代码,调整组合已有的默认插件,从而定义新的调度器
- 实现接口代码,自定义开发调度器
本文将会描述第二种方式,编写一个 filter 类型的调度器。
在 深度解析scheduler原理 这篇文章中已经讲解过,要实现一个 filter 类型的调度器,主要是实现 FilterPlugin
和 Plugin
这两个接口:
// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L349C1-L367C2
// FilterPlugin is an interface for Filter plugins. These plugins are called at the
// filter extension point for filtering out hosts that cannot run a pod.
// This concept used to be called 'predicate' in the original scheduler.
// These plugins should return "Success", "Unschedulable" or "Error" in Status.code.
// However, the scheduler accepts other valid codes as well.
// Anything other than "Success" will lead to exclusion of the given host from running the pod.
type FilterPlugin interface {
Plugin
// Filter is called by the scheduling framework.
// All FilterPlugins should return "Success" to declare that
// the given node fits the pod. If Filter doesn't return "Success",
// it will return "Unschedulable", "UnschedulableAndUnresolvable" or "Error".
// For the node being evaluated, Filter plugins should look at the passed
// nodeInfo reference for this particular node's information (e.g., pods
// considered to be running on the node) instead of looking it up in the
// NodeInfoSnapshot because we don't guarantee that they will be the same.
// For example, during preemption, we may pass a copy of the original
// nodeInfo object that has some pods removed from it to evaluate the
// possibility of preempting them to schedule the target pod.
Filter(ctx , state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
}
type Plugin interface {
Name() string
}
也就是要实现 Filter
、Name
这两个方法。
下来我们编写一个简单的调度器,过滤掉 label 含有 hello 字样的 Node。
编写自定义调度器
创建一个空白目录 scheduler/node-filter-label 并初始化:
mkdir -p scheduler/node-filter-labe
cd scheduler && go mod init myscheduler
创建目录 manifests 用于存放 KubeSchedulerConfiguration
,创建目录 plugins 存放插件代码,最终目录结构如下:
tree scheduler/
scheduler/
├── go.mod
├── go.sum
├── node-filter-label
│ ├── main.go
│ ├── manifests
│ │ └── nodelabelfilter.yaml
│ └── plugins
│ └── node_filter_label.go
编写 node_filter_label.go
package plugins
import (
"context"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/klog/v2"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const SchedulerName = "NodeFilterLabel"
type NodeFilterLabel struct{}
func (pl *NodeFilterLabel) Name() string {
return SchedulerName
}
func (pl *NodeFilterLabel) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
for k, v := range nodeInfo.Node().ObjectMeta.Labels {
if k == "node" && v == "hello" {
klog.InfoS("node failed to pass NodeFilterLabel filter", "pod_name", pod.Name, "current node", nodeInfo.Node().Name)
return framework.NewStatus(framework.UnschedulableAndUnresolvable, "node has label node=hello")
}
}
klog.InfoS("node pass NodeFilterLabel filter", "pod_name", pod.Name, "current node", nodeInfo.Node().Name)
return nil
}
func New(_ context.Context, _ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
return &NodeFilterLabel{}, nil
}
编写 main.go
package main
import (
"os"
"myscheduler/plugins"
"k8s.io/component-base/cli"
"k8s.io/component-base/logs"
_ "k8s.io/component-base/metrics/prometheus/clientgo"
_ "k8s.io/component-base/metrics/prometheus/version" // for version metric registration
"k8s.io/kubernetes/cmd/kube-scheduler/app"
)
func main() {
command := app.NewSchedulerCommand(app.WithPlugin(plugins.SchedulerName, plugins.New))
logs.InitLogs()
defer logs.FlushLogs()
code := cli.Run(command)
os.Exit(code)
}
编写 KubeSchedulerConfiguration
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: "/etc/kubernetes/scheduler.conf"
profiles:
- schedulerName: nodelabelfilter
plugins:
filter:
enabled:
- name: NodeFilterLabel
disabled:
- name: "*"
编译
注意,通过 go mod tidy
生成的 go.mod 和 go.sum 无法直接使用,会报如下错误:
我们需要从 kubernetes-sigs/scheduler-plugins 找到对应的 k8s 版本(比如1.29)并复制该分支的 go.mod 过来。
go build -o nodelabelfilter main.go
本地调试运行
编写好以后,本地调试可以直接找台机器启动,无需做成工作负载或者 Pod 放到 k8s 中运行,只要这台机器有 scheduler.conf
(可以从 kube-master 机器获取) 且能访问到 k8s 集群。
./nodelabelfilter --leader-elect=false --config nodelabelfilter.yaml
I1117 09:03:19.700978 25785 serving.go:380] Generated self-signed cert in-memory
W1117 09:03:20.057352 25785 authentication.go:339] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W1117 09:03:20.057402 25785 authentication.go:363] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W1117 09:03:20.057418 25785 authorization.go:193] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I1117 09:03:20.073613 25785 server.go:154] "Starting Kubernetes Scheduler" version="v0.0.0-master+$Format:%H$"
I1117 09:03:20.073977 25785 server.go:156] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I1117 09:03:20.079517 25785 secure_serving.go:213] Serving securely on [::]:10259
I1117 09:03:20.081147 25785 tlsconfig.go:240] "Starting DynamicServingCertificateController"
测试
假设集群中有 1 个 master,2 个 node,给 master 和 node1 添加 label node=hello
,
kubectl label nodes k8s-master node=hello
kubectl label nodes k8s-node2 node=hello
创建一个 Pod,观察调度情况:
apiVersion: v1
kind: Pod
metadata:
name: nginx-labelfilter
spec:
schedulerName: nodelabelfilter
containers:
- image: registry.cn-beijing.aliyuncs.com/fpf_devops/nginx:1.24
name: nginx
查看自定义插件的日志:
I1117 09:08:32.807891 25785 node_filter_label.go:27] "node pass NodeFilterLabel filter" pod_name="nginx-labelfilter" current node="k8s-node1"
I1117 09:08:32.807932 25785 node_filter_label.go:23] "node failed to pass NodeFilterLabel filter" pod_name="nginx-labelfilter" current node="k8s-node2"
I1117 09:08:32.807888 25785 node_filter_label.go:23] "node failed to pass NodeFilterLabel filter" pod_name="nginx-labelfilter" current node="k8s-master"
可以看到 master 和 node2 两个 node 被过滤掉了,不能调度,只能调度到 node1 上。
部署自定义调度器
构建镜像
v1.24 版以后的 k8s 底层从 docker
换成了 containerd
,自然也就没有了以前的 docker build
命令了。需要用 containerd
对应的构建命令。
containerd
有一个子项目 nerdctl
,用来兼容 docker cli
,可以像 docker 命令一样来管理本地的镜像和容器。
nerdctl
包含精简版和完整版:
- 精简版仅有
nerdctl
命令,无法使用nerdctl build
命令,执行 nerdctl build 会报错 - 完整版不仅有
netdctl
命令,还包含了buildkitd
、buildctl
、ctr
、runc
等containerd
相关的命令,以及cni
插件的二进制文件
下载后解压,将 bin 下的 nerdctl
和 buildkitd
拷贝或软链到 /usr/local/bin/
,将 lib 下的 systemd/system/buildkit.service
拷贝到 /etc/systemd/system/
下,启动 buildkitd:
systemctl enable buildkit.service --now
编写 Dockerfile
:
FROM registry.cn-beijing.aliyuncs.com/fpf_devops/golang:1.23.0
WORKDIR /opt
ADD nodelabelfilter /opt
ENTRYPOINT ["/opt/nodelabelfilter", "--leader-elect=false", "--config=/opt/nodelabelfilter.yaml"]
构建镜像:
nerdctl build -t nodelabelfilter:v1.0.0 .
部署
用一个工作负载部署该镜像,参见用 Deployment 部署自定义调度器