本文将详细讲述如何在K8S环境部署一整套jaeger工具以实现链路跟踪。同时由于jaeger-ui里的monitor功能需要使用到OpenTelemetry Collector,因此本文将一并部署并讲解原理。
Jaeger架构
jaeger组件总共有四个:
- jaeger-agent
- jaeger-collector
- jaeger-query(ui)
- jaeger-spark-dependencies
其中jaeger-agent和应用部署在一个主机上或者一个容器pod里。在实际应用中,jaeger-agent通常已经集成到应用框架里了(如go-zero自带jaeger),或者由opentelemetry SDK(如kotlin或springboot)实现,因此无需额外再部署jaeger-agent。
jaeger-collector是收集trace的组件,它可提供多种收集方式,本文介绍两种:zipkin方式(监听在9411)和jaeger方式(监听在14268)。
jaeger将trace存储到数据库,可支持多种关系型和非关系型数据库,本文将使用elasticsearch-7.15做数据存储。
Jaeger部署
部署jaeger-collector
kind: Deployment
apiVersion: apps/v1
metadata:
name: jaeger-collector
namespace: jaeger-test
labels:
app: jaeger
app.kubernetes.io/component: collector
app.kubernetes.io/name: jaeger
annotations:
deployment.kubernetes.io/revision: '15'
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
app.kubernetes.io/component: collector
app.kubernetes.io/name: jaeger
template:
metadata:
creationTimestamp: null
labels:
app: jaeger
app.kubernetes.io/component: collector
app.kubernetes.io/name: jaeger
annotations:
kubesphere.io/restartedAt: '2023-04-04T07:05:11.313Z'
spec:
volumes:
- name: jaeger-configuration-volume
configMap:
name: jaeger-configuration
items:
- key: collector
path: collector.yaml
defaultMode: 420
containers:
- name: jaeger-collector
image: >-
jaegertracing/jaeger-collector:1.43
args:
- '--config-file=/conf/collector.yaml'
ports:
- containerPort: 14267
protocol: TCP
- containerPort: 14268
protocol: TCP
- containerPort: 9411
protocol: TCP
- containerPort: 14250
protocol: TCP
env:
- name: SPAN_STORAGE_TYPE
valueFrom:
configMapKeyRef:
name: jaeger-configuration
key: span-storage-type
resources: {}
volumeMounts:
- name: jaeger-configuration-volume
mountPath: /conf
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
serviceAccountName: default
serviceAccount: default
securityContext: {}
schedulerName: default-scheduler
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
---
kind: Service
apiVersion: v1
metadata:
name: jaeger-collector
namespace: jaeger-test
labels:
app: jaeger-collector
annotations:
spec:
ports:
- name: jaeger-collector-tchannel
protocol: TCP
port: 14267
targetPort: 14267
- name: jaeger-collector-http
protocol: TCP
port: 14268 ## jaeger agent通过http将trace送到此端口
targetPort: 14268
- name: jaeger-collector-zipkin
protocol: TCP
port: 9411 ## opentelemetry通过http将trace送到此端口
targetPort: 9411
- name: jaeger-grpc
protocol: TCP
port: 14250 ## opentelemetry collector通过grpc将trace送到此端口
targetPort: 14250
selector:
app: jaeger
app.kubernetes.io/component: collector
app.kubernetes.io/name: jaeger
clusterIP: 10.96.73.35
type: ClusterIP
sessionAffinity: None
其中用到的collector.yaml以configmap的方式挂载到容器的/conf/collector.yaml
## collector.yaml
es:
server-urls: http://es.xxx.io ## 需要有一个es
username: elastic
password: '***********'
exporters:
opentelemetry:
endpoint: "otel-collector:55678"
collector:
zipkin:
http-port: 9411 ## zipkin方式收集trace
jaeger:
http:
host-port: 0.0.0.0:14269 ## jaeger admin HTTP server(没用上)
部署jaeger-query
jaeger-query是前端界面
kind: Deployment
apiVersion: apps/v1
metadata:
name: jaeger-query
namespace: jaeger-test
labels:
app: jaeger
jaeger-infra: query-deployment
annotations:
deployment.kubernetes.io/revision: '5'
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
jaeger-infra: query-pod
template:
metadata:
creationTimestamp: null
labels:
app: jaeger
jaeger-infra: query-pod
annotations:
prometheus.io/port: '16686'
prometheus.io/scrape: 'true'
spec:
volumes:
- name: jaeger-configuration-volume
configMap:
name: jaeger-configuration
items:
- key: query
path: query.yaml
defaultMode: 420
containers:
- name: jaeger-query
image: >-
jaegertracing/jaeger-query:1.43
args:
- '--config-file=/conf/query.yaml'
ports:
- containerPort: 16686
protocol: TCP
- containerPort: 16685
protocol: TCP
- containerPort: 16687
protocol: TCP
env:
- name: SPAN_STORAGE_TYPE
valueFrom:
configMapKeyRef:
name: jaeger-configuration
key: span-storage-type
- name: METRICS_STORAGE_TYPE
value: prometheus
- name: PROMETHEUS_SERVER_URL
value: 'http://10.50.89.17:8080'
resources: {}
volumeMounts:
- name: jaeger-configuration-volume
mountPath: /conf
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
securityContext: {}
schedulerName: default-scheduler
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
---
kind: Service
apiVersion: v1
metadata:
name: jaeger-query
namespace: jaeger-test
labels:
app: jaeger-query
annotations:
spec:
ports:
- name: jaeger-grpc
protocol: TCP
port: 16685
targetPort: 16685
- name: jaeger-query
protocol: TCP
port: 16686
targetPort: 16686
- name: jaeger-admin
protocol: TCP
port: 16687
targetPort: 16687
selector:
app: jaeger
jaeger-infra: query-pod
clusterIP: 10.96.161.39
type: ClusterIP
sessionAffinity: None
其中两个环境变量很重要:
METRICS_STORAGE_TYPE:设置此值为prometheus才能开启UI上的monitor页面
PROMETHEUS_SERVER_URL:设置prometheus地址,monitor数据将从该prometheus上获取
再设置一个ingress或者nodeport暴露query的16686端口即可
此时还没有任何trace数据
部署jaeger-spark-dependencies
这是一个计算链路图的组件,他周期性的从elasticsearch读取trace并计算依赖链路,然后再写回elasticsearch的jaeger-dependencies索引。
应用接入
go-zero应用接入
配置文件中增加jaeger相关模块:
Name: Devops
Host: 0.0.0.0
Port: 8888
Timeout: 60000
Log:
Encoding: plain
Level: debug
Prometheus:
Host: 0.0.0.0
Port: 10990
Path: /metrics
Namespace: jaeger-test
Kubeconfig: ./kubeconfig
Environment: local
Telemetry:
Name: fiops-devops
Endpoint: http://jaeger-collector.jaeger-test:14268/api/traces
Sampler: 1.0
Batcher: jaeger
再在UI查看
opentelemetry接入
启动参数增加:
-Dotel.propagators=b3
-Dotel.instrumentation.common.default-enabled=true
-javaagent:/path/to/you/opentelemetry-javaagent.jar
-Dotel.instrumentation.common.db-statement-sanitizer.enabled=false
-Dotel.instrumentation.redisson.enabled=false
-Dotel.metrics.exporter=none
-Dotel.traces.exporter=zipkin
-Dotel.exporter.zipkin.endpoint=http://jaeger-collector.jaeger-test:9411/api/v2/spans
注意最后的zipkin.endpoint指向了jaeger-collector的9411端口
UI查看
elasticsearch配置
jaeger会在elasticsearch里创建三个索引模板:
- jaeger-service:存放所有已收集的serviceName
- jaeger-span:存放trace数据,每个doc是一个span
- Jaeger-dependencies:存放spark计算后的依赖图
jaeger提供两种方式的索引管理,一个是rollover、一个是ILM
rollover方式(默认方式)
该方式jaeger默认每天创建一个带日期后缀的索引,如:jaeger-span-2023-04-04,可查看其数据:
{
"_index" : "jaeger-span-2023-04-04",
"_type" : "_doc",
"_id" : "mu72SocBT2RQOvh7w_S0",
"_score" : 1.0,
"_source" : {
"traceID" : "8323e388f8a172e5e404e6873906f4e4",
"spanID" : "229debb8161b9693",
"operationName" : "/shelltrade/(authenticate jwttokenfilterwithoutobject)/loadbycriteria",
"references" : [
{
"refType" : "CHILD_OF",
"traceID" : "8323e388f8a172e5e404e6873906f4e4",
"spanID" : "453b0488ff74aca4"
}
],
"startTime" : 1680589895699132,
"startTimeMillis" : 1680589895699,
"duration" : 26024,
"tags" : [
{
"key" : "http.user_agent",
"type" : "string",
"value" : "Ktor client"
},
{
"key" : "net.host.name",
"type" : "string",
"value" : "trd-http-server-6644f77644-gdx4x"
}
],
"logs" : [ ],
"process" : {
"serviceName" : "trd-http-server",
"tags" : [
{
"key" : "ip",
"type" : "int64",
"value" : "168464409"
}
]
}
}
}
关于rollover配置参考:
https://www.jaegertracing.io/docs/1.43/deployment/#elasticsearch-rollover
ILM方式(推荐)
创建策略
手工在ES创建一个ILM策略,名称必须叫:jaeger-ilm-policy
curl -X PUT http://ESHOST:9200/_ilm/policy/jaeger-ilm-policy \
-H 'Content-Type: application/json; charset=utf-8' \
--data-binary @- << EOF
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d"
},
"set_priority": {
"priority": 100
}
}
},
"delete": {
"min_age": "3d",
"actions": {
"delete": {}
}
}
}
}
}
EOF
初始化索引
在K8S创建Job,配置如下:
apiVersion: batch/v1
kind: Job
metadata:
name: jaeger-es-rollover
namespace: jaeger-test
spec:
template:
spec:
containers:
- name: es-rollover-container
image: jaegertracing/jaeger-es-rollover:1.43
args:
- '--es.username=elastic'
- '--es.password=DtvUvl7k80bX5aL9'
- init
- 'http://es.xxx.io'
env:
- name: ES_USE_ILM
value: 'true'
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: Always
restartPolicy: Never
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
serviceAccountName: default
serviceAccount: default
securityContext: {}
schedulerName: default-scheduler
parallelism: 1
completions: 1
backoffLimit: 6
配置collector和query参数
先将jaeger-collector的pod副本降为0,使得collector不再往ES写入,然后删除之前已经存在的三个index template和所有jaeger-开头的索引。然后在jaeger-collector和jaeger-query的启动参数加上--es.use-ilm=true
和--es.use-aliases=true
,或者在它们的yaml配置文件里增加:
es:
use-ilm: true
use-aliases: true
做完上面的步骤后调大jaeger-collector的pod副本数,观察ES重新创建了三个index template,并且已经自动关联上了上一步手动创建的ILM policy
观察索引已经出现了-000001,配置成功
关于ILM配置参考:
https://www.jaegertracing.io/docs/1.43/deployment/#elasticsearch-ilm-support
Service Performance Monitoring (SPM)
jaeger还可对trace进行聚合统计,计算出RED(Request, Error, Duration) metrics,并展示在UI的monitor页面。
此功能依赖另外一个组件OpenTelemetry Collector,接入后整体流程如下:
此时应用不再将trace发往jaeger-collector,而是将trace发送到OpenTelemetry Collector(下文简称otel-collocter),otel-collocter分两路走,一路将trace原封不动发送到jaeger-collocter(由它去展示trace),另一路经过它自己内部的一个pipeline,计算出calls_total、latency_bucket等metrics,并通过自己的8889/metrics暴露出去,prometheus配置一个job采集它的8889/metrics,然后jaeger-query再去prometheus取回这些metrics,并展示到UI的monitor页面。
参考:https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor
部署OpenTelemetry Collector
kind: Deployment
apiVersion: apps/v1
metadata:
name: otel-collector
namespace: jaeger-test
labels:
app: otel-collector
annotations:
deployment.kubernetes.io/revision: '51'
spec:
replicas: 1
selector:
matchLabels:
app: otel-collector
template:
metadata:
creationTimestamp: null
labels:
app: otel-collector
annotations:
kubesphere.io/restartedAt: '2023-04-04T09:05:57.517Z'
spec:
volumes:
- name: otel-collector-volume
configMap:
name: jaeger-configuration
defaultMode: 420
containers:
- name: otel-collector-container
image: >-
otel/opentelemetry-collector-contrib:0.74.0
args:
- '--config'
- /etc/otel-collector-config.yml
ports:
- name: otel-port
containerPort: 4317 ## 此端口没用上
protocol: TCP
- name: metrics
containerPort: 8888 ## otel-collector自己的metrics
protocol: TCP
- name: exporter ## otel-collector计算trace统计值后通过此端口暴露出去
containerPort: 8889
protocol: TCP
- name: collector ## 应用向otel-collector:14278发送trace数据
containerPort: 14278
protocol: TCP
- name: zipkin ## 应用通过zipkin方式发送trace到9411
containerPort: 9411
protocol: TCP
resources:
limits:
cpu: 200m
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: otel-collector-volume
mountPath: /etc/otel-collector-config.yml
subPath: otel-collector-config.yml
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
serviceAccountName: default
serviceAccount: default
securityContext: {}
schedulerName: default-scheduler
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
---
kind: Service
apiVersion: v1
metadata:
name: otel-collector
namespace: jaeger-test
labels:
app: otel-collector
annotations:
kubesphere.io/creator: fanpengfei
spec:
ports:
- name: http-4317
protocol: TCP
port: 4317
targetPort: 4317
- name: http-8888
protocol: TCP
port: 8888
targetPort: 8888
- name: http-14278
protocol: TCP
port: 14278
targetPort: 14278
- name: http-8889
protocol: TCP
port: 8889
targetPort: 8889
- name: http-55678
protocol: TCP
port: 55678
targetPort: 55678
- name: http-zipkin
protocol: TCP
port: 9411
targetPort: 9411
selector:
app: otel-collector
clusterIP: 10.96.7.76
type: ClusterIP
sessionAffinity: None
otel-collector的配置文件如下。一个otel-collector由receivers、exporters、processors三部分组成,其中receivers定义了jaeger输入(监听在14278)、zipkin输入(监听在9411)和otlp输入(本文未使用otlp)。exporters定义了输出有三条路径,prometheus(即普米从本机 http://xx:8889/metrics 采走)、zipkin(zipkin方式过来的trace统计计算需要发送到本机监听的9411)、jaeger(trace原样输出到jaeger)。
## otel-collector-config.yml
receivers:
jaeger:
protocols:
thrift_http:
endpoint: "0.0.0.0:14278"
zipkin:
endpoint: "0.0.0.0:9411"
otlp:
protocols:
grpc:
http:
## Dummy receiver that's never used, because a pipeline is required to have one.
otlp/spanmetrics:
protocols:
grpc:
endpoint: "localhost:65535"
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
zipkin:
endpoint: "http://localhost:9411/api/v2/spans"
jaeger:
endpoint: "jaeger-collector:14250"
tls:
insecure: true
processors:
batch:
spanmetrics:
metrics_exporter: prometheus
service:
pipelines:
traces:
receivers: [jaeger, zipkin]
processors: [spanmetrics, batch]
exporters: [jaeger]
## The exporter name in this pipeline must match the spanmetrics.metrics_exporter name.
## The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
metrics/spanmetrics:
receivers: [otlp/spanmetrics]
exporters: [prometheus]
对接prometheus
应用发送trace到otel-collector以后,可在otel-collector的8889看到metrics,说明聚合统计已经生效:
curl http:/otel-collector:8889/metrics
## HELP calls_total
## TYPE calls_total counter
calls_total{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
calls_total{operation="/checkjwtinredis",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 840
calls_total{operation="/getobjectlist",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 74
## HELP latency
## TYPE latency histogram
latency_bucket{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="2"} 0
latency_bucket{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="4"} 0
latency_bucket{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="6"} 0
latency_sum{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 192.146
latency_count{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
latency_bucket{operation="/checkjwtinredis",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="2"} 816
latency_bucket{operation="/checkjwtinredis",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="4"} 832
latency_bucket{operation="/checkjwtinredis",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="6"} 834
现在可在prometheus.yaml配置job:
## prometheus.yaml
scrape_configs:
- job_name: aggregated-trace-metrics
static_configs:
- targets: ['otel-exporter.jaeger-test.xxx.io']
otel-exporter.jaeger-test.xxx.io是我的K8S给otel-collector:8889配置的ingress地址
重新加载配置后,可在jaeger-ui的monitor页面查看