云原生链路跟踪工具Jaeger+OpenTelemetry Collector部署详解


提到链路跟踪,或者叫全链路监控,或者叫APM(Application Performance Management),具体含义和原理不赘述,开源方案有skywalking、zipkin、elasticapm等工具,商业产品有基调听云等等,但在云原生领域,也有一个CNCF已毕业项目jaeger同样发展迅速。

本文将详细讲述如何在K8S环境部署一整套jaeger工具以实现链路跟踪。同时由于jaeger-ui里的monitor功能需要使用到OpenTelemetry Collector,因此本文将一并部署并讲解原理。

Jaeger架构

jaeger组件总共有四个:

  • jaeger-agent
  • jaeger-collector
  • jaeger-query(ui)
  • jaeger-spark-dependencies

其中jaeger-agent和应用部署在一个主机上或者一个容器pod里。在实际应用中,jaeger-agent通常已经集成到应用框架里了(如go-zero自带jaeger),或者由opentelemetry SDK(如kotlin或springboot)实现,因此无需额外再部署jaeger-agent。

jaeger-collector是收集trace的组件,它可提供多种收集方式,本文介绍两种:zipkin方式(监听在9411)和jaeger方式(监听在14268)。

jaeger将trace存储到数据库,可支持多种关系型和非关系型数据库,本文将使用elasticsearch-7.15做数据存储。

Jaeger部署

部署jaeger-collector

kind: Deployment
apiVersion: apps/v1
metadata:
  name: jaeger-collector
  namespace: jaeger-test
  labels:
    app: jaeger
    app.kubernetes.io/component: collector
    app.kubernetes.io/name: jaeger
  annotations:
    deployment.kubernetes.io/revision: '15'
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
      app.kubernetes.io/component: collector
      app.kubernetes.io/name: jaeger
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: jaeger
        app.kubernetes.io/component: collector
        app.kubernetes.io/name: jaeger
      annotations:
        kubesphere.io/restartedAt: '2023-04-04T07:05:11.313Z'
    spec:
      volumes:
        - name: jaeger-configuration-volume
          configMap:
            name: jaeger-configuration
            items:
              - key: collector
                path: collector.yaml
            defaultMode: 420
      containers:
        - name: jaeger-collector
          image: >-
            jaegertracing/jaeger-collector:1.43
          args:
            - '--config-file=/conf/collector.yaml'
          ports:
            - containerPort: 14267
              protocol: TCP
            - containerPort: 14268
              protocol: TCP
            - containerPort: 9411
              protocol: TCP
            - containerPort: 14250
              protocol: TCP
          env:
            - name: SPAN_STORAGE_TYPE
              valueFrom:
                configMapKeyRef:
                  name: jaeger-configuration
                  key: span-storage-type
          resources: {}
          volumeMounts:
            - name: jaeger-configuration-volume
              mountPath: /conf
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: default
      serviceAccount: default
      securityContext: {}
      schedulerName: default-scheduler
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
---
kind: Service
apiVersion: v1
metadata:
  name: jaeger-collector
  namespace: jaeger-test
  labels:
    app: jaeger-collector
  annotations:
spec:
  ports:
    - name: jaeger-collector-tchannel
      protocol: TCP
      port: 14267
      targetPort: 14267
    - name: jaeger-collector-http
      protocol: TCP
      port: 14268  ## jaeger agent通过http将trace送到此端口
      targetPort: 14268
    - name: jaeger-collector-zipkin
      protocol: TCP
      port: 9411  ## opentelemetry通过http将trace送到此端口
      targetPort: 9411
    - name: jaeger-grpc
      protocol: TCP
      port: 14250  ## opentelemetry collector通过grpc将trace送到此端口
      targetPort: 14250
  selector:
    app: jaeger
    app.kubernetes.io/component: collector
    app.kubernetes.io/name: jaeger
  clusterIP: 10.96.73.35
  type: ClusterIP
  sessionAffinity: None

其中用到的collector.yaml以configmap的方式挂载到容器的/conf/collector.yaml

## collector.yaml
es:
  server-urls: http://es.xxx.io ## 需要有一个es
  username: elastic
  password: '***********'
exporters:
  opentelemetry:
    endpoint: "otel-collector:55678"
collector:
  zipkin:
    http-port: 9411  ## zipkin方式收集trace
  jaeger:
    http:
      host-port: 0.0.0.0:14269  ## jaeger admin HTTP server(没用上)

部署jaeger-query

jaeger-query是前端界面

kind: Deployment
apiVersion: apps/v1
metadata:
  name: jaeger-query
  namespace: jaeger-test
  labels:
    app: jaeger
    jaeger-infra: query-deployment
  annotations:
    deployment.kubernetes.io/revision: '5'
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
      jaeger-infra: query-pod
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: jaeger
        jaeger-infra: query-pod
      annotations:
        prometheus.io/port: '16686'
        prometheus.io/scrape: 'true'
    spec:
      volumes:
        - name: jaeger-configuration-volume
          configMap:
            name: jaeger-configuration
            items:
              - key: query
                path: query.yaml
            defaultMode: 420
      containers:
        - name: jaeger-query
          image: >-
            jaegertracing/jaeger-query:1.43
          args:
            - '--config-file=/conf/query.yaml'
          ports:
            - containerPort: 16686
              protocol: TCP
            - containerPort: 16685
              protocol: TCP
            - containerPort: 16687
              protocol: TCP
          env:
            - name: SPAN_STORAGE_TYPE
              valueFrom:
                configMapKeyRef:
                  name: jaeger-configuration
                  key: span-storage-type
            - name: METRICS_STORAGE_TYPE
              value: prometheus
            - name: PROMETHEUS_SERVER_URL
              value: 'http://10.50.89.17:8080'
          resources: {}
          volumeMounts:
            - name: jaeger-configuration-volume
              mountPath: /conf
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      schedulerName: default-scheduler
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
---
kind: Service
apiVersion: v1
metadata:
  name: jaeger-query
  namespace: jaeger-test
  labels:
    app: jaeger-query
  annotations:
spec:
  ports:
    - name: jaeger-grpc
      protocol: TCP
      port: 16685
      targetPort: 16685
    - name: jaeger-query
      protocol: TCP
      port: 16686
      targetPort: 16686
    - name: jaeger-admin
      protocol: TCP
      port: 16687
      targetPort: 16687
  selector:
    app: jaeger
    jaeger-infra: query-pod
  clusterIP: 10.96.161.39
  type: ClusterIP
  sessionAffinity: None

其中两个环境变量很重要:
METRICS_STORAGE_TYPE:设置此值为prometheus才能开启UI上的monitor页面
PROMETHEUS_SERVER_URL:设置prometheus地址,monitor数据将从该prometheus上获取
再设置一个ingress或者nodeport暴露query的16686端口即可

此时还没有任何trace数据

部署jaeger-spark-dependencies

这是一个计算链路图的组件,他周期性的从elasticsearch读取trace并计算依赖链路,然后再写回elasticsearch的jaeger-dependencies索引。

应用接入

go-zero应用接入

配置文件中增加jaeger相关模块:

Name: Devops
Host: 0.0.0.0
Port: 8888
Timeout: 60000
Log:
  Encoding: plain
  Level: debug 
Prometheus:
  Host: 0.0.0.0
  Port: 10990
  Path: /metrics
Namespace: jaeger-test
Kubeconfig: ./kubeconfig
Environment: local
Telemetry:
  Name: fiops-devops
  Endpoint: http://jaeger-collector.jaeger-test:14268/api/traces
  Sampler: 1.0
  Batcher: jaeger

再在UI查看

opentelemetry接入

启动参数增加:

-Dotel.propagators=b3
-Dotel.instrumentation.common.default-enabled=true
-javaagent:/path/to/you/opentelemetry-javaagent.jar
-Dotel.instrumentation.common.db-statement-sanitizer.enabled=false
-Dotel.instrumentation.redisson.enabled=false
-Dotel.metrics.exporter=none 
-Dotel.traces.exporter=zipkin 
-Dotel.exporter.zipkin.endpoint=http://jaeger-collector.jaeger-test:9411/api/v2/spans

注意最后的zipkin.endpoint指向了jaeger-collector的9411端口

UI查看

elasticsearch配置

jaeger会在elasticsearch里创建三个索引模板:

  • jaeger-service:存放所有已收集的serviceName
  • jaeger-span:存放trace数据,每个doc是一个span
  • Jaeger-dependencies:存放spark计算后的依赖图

jaeger提供两种方式的索引管理,一个是rollover、一个是ILM

rollover方式(默认方式)

该方式jaeger默认每天创建一个带日期后缀的索引,如:jaeger-span-2023-04-04,可查看其数据:

{
    "_index" : "jaeger-span-2023-04-04",
    "_type" : "_doc",
    "_id" : "mu72SocBT2RQOvh7w_S0",
    "_score" : 1.0,
    "_source" : {
      "traceID" : "8323e388f8a172e5e404e6873906f4e4",
      "spanID" : "229debb8161b9693",
      "operationName" : "/shelltrade/(authenticate jwttokenfilterwithoutobject)/loadbycriteria",
      "references" : [
        {
          "refType" : "CHILD_OF",
          "traceID" : "8323e388f8a172e5e404e6873906f4e4",
          "spanID" : "453b0488ff74aca4"
        }
      ],
      "startTime" : 1680589895699132,
      "startTimeMillis" : 1680589895699,
      "duration" : 26024,
      "tags" : [
        {
          "key" : "http.user_agent",
          "type" : "string",
          "value" : "Ktor client"
        },
        {
          "key" : "net.host.name",
          "type" : "string",
          "value" : "trd-http-server-6644f77644-gdx4x"
        }
      ],
      "logs" : [ ],
      "process" : {
        "serviceName" : "trd-http-server",
        "tags" : [
          {
            "key" : "ip",
            "type" : "int64",
            "value" : "168464409"
          }
        ]
      }
    }
  }

关于rollover配置参考:
https://www.jaegertracing.io/docs/1.43/deployment/#elasticsearch-rollover

ILM方式(推荐)

创建策略

手工在ES创建一个ILM策略,名称必须叫:jaeger-ilm-policy

curl -X PUT http://ESHOST:9200/_ilm/policy/jaeger-ilm-policy \
-H 'Content-Type: application/json; charset=utf-8' \
--data-binary @- << EOF
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "delete": {
        "min_age": "3d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}
EOF

初始化索引

在K8S创建Job,配置如下:

apiVersion: batch/v1
kind: Job
metadata:
  name: jaeger-es-rollover
  namespace: jaeger-test
spec:
  template:
    spec:
      containers:
      - name: es-rollover-container
        image: jaegertracing/jaeger-es-rollover:1.43
        args:
          - '--es.username=elastic'
          - '--es.password=DtvUvl7k80bX5aL9'
          - init
          - 'http://es.xxx.io'
        env:
          - name: ES_USE_ILM
            value: 'true'
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        imagePullPolicy: Always
      restartPolicy: Never
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: default
      serviceAccount: default
      securityContext: {}
      schedulerName: default-scheduler
  parallelism: 1
  completions: 1
  backoffLimit: 6

配置collector和query参数

先将jaeger-collector的pod副本降为0,使得collector不再往ES写入,然后删除之前已经存在的三个index template和所有jaeger-开头的索引。然后在jaeger-collector和jaeger-query的启动参数加上--es.use-ilm=true--es.use-aliases=true,或者在它们的yaml配置文件里增加:

es:
  use-ilm: true
  use-aliases: true

做完上面的步骤后调大jaeger-collector的pod副本数,观察ES重新创建了三个index template,并且已经自动关联上了上一步手动创建的ILM policy

观察索引已经出现了-000001,配置成功

关于ILM配置参考:
https://www.jaegertracing.io/docs/1.43/deployment/#elasticsearch-ilm-support

Service Performance Monitoring (SPM)

jaeger还可对trace进行聚合统计,计算出RED(Request, Error, Duration) metrics,并展示在UI的monitor页面。
此功能依赖另外一个组件OpenTelemetry Collector,接入后整体流程如下:

此时应用不再将trace发往jaeger-collector,而是将trace发送到OpenTelemetry Collector(下文简称otel-collocter),otel-collocter分两路走,一路将trace原封不动发送到jaeger-collocter(由它去展示trace),另一路经过它自己内部的一个pipeline,计算出calls_total、latency_bucket等metrics,并通过自己的8889/metrics暴露出去,prometheus配置一个job采集它的8889/metrics,然后jaeger-query再去prometheus取回这些metrics,并展示到UI的monitor页面。

参考:https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor

部署OpenTelemetry Collector

kind: Deployment
apiVersion: apps/v1
metadata:
  name: otel-collector
  namespace: jaeger-test
  labels:
    app: otel-collector
  annotations:
    deployment.kubernetes.io/revision: '51'
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: otel-collector
      annotations:
        kubesphere.io/restartedAt: '2023-04-04T09:05:57.517Z'
    spec:
      volumes:
        - name: otel-collector-volume
          configMap:
            name: jaeger-configuration
            defaultMode: 420
      containers:
        - name: otel-collector-container
          image: >-
            otel/opentelemetry-collector-contrib:0.74.0
          args:
            - '--config'
            - /etc/otel-collector-config.yml
          ports:
            - name: otel-port
              containerPort: 4317  ## 此端口没用上
              protocol: TCP
            - name: metrics
              containerPort: 8888  ## otel-collector自己的metrics
              protocol: TCP
            - name: exporter       ## otel-collector计算trace统计值后通过此端口暴露出去
              containerPort: 8889
              protocol: TCP
            - name: collector      ## 应用向otel-collector:14278发送trace数据
              containerPort: 14278
              protocol: TCP
            - name: zipkin         ## 应用通过zipkin方式发送trace到9411
              containerPort: 9411
              protocol: TCP
          resources:
            limits:
              cpu: 200m
              memory: 500Mi
            requests:
              cpu: 100m
              memory: 200Mi
          volumeMounts:
            - name: otel-collector-volume
              mountPath: /etc/otel-collector-config.yml
              subPath: otel-collector-config.yml
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: default
      serviceAccount: default
      securityContext: {}
      schedulerName: default-scheduler
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
---
kind: Service
apiVersion: v1
metadata:
  name: otel-collector
  namespace: jaeger-test
  labels:
    app: otel-collector
  annotations:
    kubesphere.io/creator: fanpengfei
spec:
  ports:
    - name: http-4317
      protocol: TCP
      port: 4317
      targetPort: 4317
    - name: http-8888
      protocol: TCP
      port: 8888
      targetPort: 8888
    - name: http-14278
      protocol: TCP
      port: 14278
      targetPort: 14278
    - name: http-8889
      protocol: TCP
      port: 8889
      targetPort: 8889
    - name: http-55678
      protocol: TCP
      port: 55678
      targetPort: 55678
    - name: http-zipkin
      protocol: TCP
      port: 9411
      targetPort: 9411
  selector:
    app: otel-collector
  clusterIP: 10.96.7.76
  type: ClusterIP
  sessionAffinity: None

otel-collector的配置文件如下。一个otel-collector由receivers、exporters、processors三部分组成,其中receivers定义了jaeger输入(监听在14278)、zipkin输入(监听在9411)和otlp输入(本文未使用otlp)。exporters定义了输出有三条路径,prometheus(即普米从本机 http://xx:8889/metrics 采走)、zipkin(zipkin方式过来的trace统计计算需要发送到本机监听的9411)、jaeger(trace原样输出到jaeger)。

## otel-collector-config.yml
receivers:
  jaeger:
    protocols:
      thrift_http:
        endpoint: "0.0.0.0:14278"
  zipkin:
    endpoint: "0.0.0.0:9411"
  otlp:
    protocols:
      grpc:
      http:

  ## Dummy receiver that's never used, because a pipeline is required to have one.
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:65535"

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  zipkin:
    endpoint: "http://localhost:9411/api/v2/spans"
  jaeger:
    endpoint: "jaeger-collector:14250"
    tls:
      insecure: true

processors:
  batch:
  spanmetrics:
    metrics_exporter: prometheus

service:
  pipelines:
    traces:
      receivers: [jaeger, zipkin]
      processors: [spanmetrics, batch]
      exporters: [jaeger]
    ## The exporter name in this pipeline must match the spanmetrics.metrics_exporter name.
    ## The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
    metrics/spanmetrics:
      receivers: [otlp/spanmetrics]
      exporters: [prometheus]

对接prometheus

应用发送trace到otel-collector以后,可在otel-collector的8889看到metrics,说明聚合统计已经生效:

curl http:/otel-collector:8889/metrics
## HELP calls_total 
## TYPE calls_total counter
calls_total{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
calls_total{operation="/checkjwtinredis",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 840
calls_total{operation="/getobjectlist",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 74
## HELP latency 
## TYPE latency histogram
latency_bucket{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="2"} 0
latency_bucket{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="4"} 0
latency_bucket{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="6"} 0
latency_sum{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 192.146
latency_count{operation="/callback",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
latency_bucket{operation="/checkjwtinredis",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="2"} 816
latency_bucket{operation="/checkjwtinredis",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="4"} 832
latency_bucket{operation="/checkjwtinredis",service_name="auth-http-server",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET",le="6"} 834

现在可在prometheus.yaml配置job:

## prometheus.yaml
scrape_configs:
  - job_name: aggregated-trace-metrics
    static_configs:
    - targets: ['otel-exporter.jaeger-test.xxx.io']

otel-exporter.jaeger-test.xxx.io是我的K8S给otel-collector:8889配置的ingress地址
重新加载配置后,可在jaeger-ui的monitor页面查看


文章作者: 洪宇轩
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 洪宇轩 !
评论
 上一篇
go-zero+opentelemetry+jaeger云原生链路跟踪实践 go-zero+opentelemetry+jaeger云原生链路跟踪实践
opentelemetry是一组可观测领域的标准和规范,基于这个标准提供的各种语言的SDK,通过在代码中打埋点的方式(侵入式),可实现市面上绝大多数语言的链路跟踪功能。
2023-05-05
下一篇 
云原生监控Prometheus-Operator部署配置 云原生监控Prometheus-Operator部署配置
Prometheus是CNCF毕业的第二个项目,Prometheus-Operator是在K8S里安装Prometheus的最佳方案,该方案简化了部署Prometheus及其相关组件的步骤,本文将介绍此方案。
2022-12-05
  目录