Tail Sampling Scheme¶

The tail sampling processor samples the links according to a set of defined policies. However, all spans of the link must be received by the same collector instance in order to make effective sampling decisions.

Therefore, adjustments need to be made to the Global Opentelemetry Collector architecture of Insight to implement tail sampling policies.

Specific Changes¶

Introducing an Otel Col with LB capability in front of the Global Opentelemetry Collector.

Steps for Changes¶

Deploy OTEL COL Component with LB Capability¶

Refer to the following YAML to deploy the component.

Click to view deployment configuration

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: insight-otel-collector-lb
rules:
- apiGroups: [""]
  resources: ["endpoints"]
  verbs: ["get", "watch", "list"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: insight-otel-collector-lb
  namespace: insight-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: insight-otel-collector-lb
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: insight-otel-collector-lb
subjects:
- kind: ServiceAccount
  name: insight-otel-collector-lb
  namespace: insight-system
---
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/component: opentelemetry-collector
    app.kubernetes.io/instance: insight-otel-collector-lb
    app.kubernetes.io/name: insight-otel-collector-lb
  name: insight-otel-collector-lb-collector
  namespace: insight-system
apiVersion: v1
data:
  collector.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
      jaeger:
        protocols:
          grpc:
    processors:

    extensions:
      health_check:
      pprof:
        endpoint: :1888
      zpages:
        endpoint: :55679
    exporters:
      logging:
      loadbalancing:
        routing_key: "traceID"
        protocol:
          otlp:
            # all options from the OTLP exporter are supported
            # except the endpoint
            timeout: 1s
            tls:
              insecure: true
        resolver:
          k8s:
            service: insight-opentelemetry-collector
            ports:
              - 4317
    service:
      extensions: [pprof, zpages, health_check]
      pipelines:
        traces:
          receivers: [otlp, jaeger]
          exporters: [loadbalancing]
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: opentelemetry-collector
    app.kubernetes.io/instance: insight-otel-collector-lb
    app.kubernetes.io/name: insight-otel-collector-lb
  name: insight-otel-collector-lb
  namespace: insight-system
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/component: opentelemetry-collector
      app.kubernetes.io/instance: insight-otel-collector-lb
      app.kubernetes.io/name: insight-otel-collector-lb
  template:
    metadata:
      labels:
        app.kubernetes.io/component: opentelemetry-collector
        app.kubernetes.io/instance: insight-otel-collector-lb
        app.kubernetes.io/name: insight-otel-collector-lb
    spec:
      containers:
      - args:
        - --config=/conf/collector.yaml
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: ghcr.m.daocloud.io/openinsight-proj/opentelemetry-collector-contrib:5baef686672cfe5551e03b5c19d3072c432b6f33
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 13133
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: otc-container
        resources:
          limits:
            cpu: '1'
            memory: 2Gi
          requests:
            cpu: 100m
            memory: 400Mi
        ports:
        - containerPort: 14250
          name: jaeger-grpc
          protocol: TCP
        - containerPort: 8888
          name: metrics
          protocol: TCP
        - containerPort: 4317
          name: otlp-grpc
          protocol: TCP
        - containerPort: 4318
          name: otlp-http
          protocol: TCP
        - containerPort: 55679
          name: zpages
          protocol: TCP

        volumeMounts:
        - mountPath: /conf
          name: otc-internal

      serviceAccount: insight-otel-collector-lb
      serviceAccountName: insight-otel-collector-lb
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: collector.yaml
            path: collector.yaml
          name: insight-otel-collector-lb-collector
        name: otc-internal
---
kind: Service
apiVersion: v1
metadata:
  name: insight-opentelemetry-collector-lb
  namespace: insight-system
  labels:
    app.kubernetes.io/component: opentelemetry-collector
    app.kubernetes.io/instance: insight-otel-collector-lb
    app.kubernetes.io/name: insight-otel-collector-lb
spec:
  ports:
    - name: fluentforward
      protocol: TCP
      port: 8006
      targetPort: 8006
    - name: jaeger-compact
      protocol: UDP
      port: 6831
      targetPort: 6831
    - name: jaeger-grpc
      protocol: TCP
      port: 14250
      targetPort: 14250
    - name: jaeger-thrift
      protocol: TCP
      port: 14268
      targetPort: 14268
    - name: metrics
      protocol: TCP
      port: 8888
      targetPort: 8888
    - name: otlp
      protocol: TCP
      appProtocol: grpc
      port: 4317
      targetPort: 4317
    - name: otlp-http
      protocol: TCP
      port: 4318
      targetPort: 4318
    - name: zipkin
      protocol: TCP
      port: 9411
      targetPort: 9411
    - name: zpages
      protocol: TCP
      port: 55679
      targetPort: 55679
  selector:
    app.kubernetes.io/component: opentelemetry-collector
    app.kubernetes.io/instance: insight-otel-collector-lb
    app.kubernetes.io/name: insight-otel-collector-lb

Configure Tail Sampling Rules¶

Note

Tail sampling rules need to be added to the existing insight-otel-collector-config configmap configuration group.

Add the following content in the processor section, and adjust the specific rules as needed; refer to the OTel official example.

........
tail_sampling:
  decision_wait: 10s # Wait for 10 seconds, traces older than 10 seconds will no longer be processed
  num_traces: 1500000  # Number of traces saved in memory, assuming 1000 traces per second, should not be less than 1000 * decision_wait * 2;
                       # Setting it too large may consume too much memory resources, setting it too small may cause some traces to be dropped
  expected_new_traces_per_sec: 10
  policies: # Reporting policies
    [
        {
          name: latency-policy,
          type: latency,  # Report traces that exceed 500ms
          latency: {threshold_ms: 500}
        },
        {
          name: status_code-policy,
          type: status_code,  # Report traces with ERROR status code
          status_code: {status_codes: [ ERROR ]}
        }
    ]
......
tail_sampling: # Composite sampling
  decision_wait: 10s # Wait for 10 seconds, traces older than 10 seconds will no longer be processed
  num_traces: 1500000  # Number of traces saved in memory, assuming 1000 traces per second, should not be less than 1000 * decision_wait * 2;
                       # Setting it too large may consume too much memory resources, setting it too small may cause some traces to be dropped
  expected_new_traces_per_sec: 10
  policies: [
      {
        name: debug-worker-cluster-sample-policy,
        type: and,
        and:
          {
            and_sub_policy:
              [
                {
                  name: service-name-policy,
                  type: string_attribute,
                  string_attribute:
                    { key: k8s.cluster.id, values: [xxxxxxx] },
                },
                {
                  name: trace-status-policy,
                  type: status_code,
                  status_code: { status_codes: [ERROR] },
                },
                {
                  name: probabilistic-policy,
                  type: probabilistic,
                  probabilistic: { sampling_percentage: 1 },
                }
              ]
          }
      }
    ]

Activate this processor in the otel col pipeline:

traces:
  exporters:
    - servicegraph
    - otlp/jaeger
  processors:
    - memory_limiter
    - tail_sampling # 👈
    - batch
  receivers:
    - otlp

Restart the insight-opentelemetry-collector component.

When deploying the Insight-agent, modify the reporting address of the link data to the 4317 port address of the otel-col LB.

....
    exporters:
      otlp/global:
        endpoint: insight-opentelemetry-collector-lb.insight-system.svc.cluster.local:4317  # 👈 Modify to lb address