Skip to main content

Implementing Distributed Tracing with OpenTelemetry: From Basics to Production

In a microservices architecture, a single request traverses multiple services. When problems occur, pinpointing which service and which operation is causing latency is far from trivial. OpenTelemetry, a CNCF Graduated project, standardizes distributed tracing and solves this challenge as a comprehensive observability framework. Even on lightweight K3s-based Kubernetes environments like Kubo, OpenTelemetry provides complete visibility into inter-service request flows.

Distributed Tracing Fundamentals

Why Distributed Tracing Matters

As The New Stack points out, metrics and logs alone are insufficient in Kubernetes microservice environments. To understand the complete picture as a request crosses pods, nodes, and namespaces, distributed tracing is essential.

OpenTelemetry Telemetry Signals

OpenTelemetry is a vendor-neutral observability framework that unifies three telemetry signals:

  • Traces: Track the end-to-end flow of requests. Parent-child relationships between Spans represent service call chains
  • Metrics: Quantitative measurements such as request counts, latency, and error rates
  • Logs: Event records that, when correlated with trace IDs, identify logs related to specific requests

Tracing Building Blocks

text
Trace
├── Span A (API Gateway, 150ms)
│   ├── Span B (Auth Service, 20ms)
│   └── Span C (Product Service, 100ms)
│       ├── Span D (Database Query, 40ms)
│       └── Span E (Cache Lookup, 5ms)
  • Trace: The overall processing flow of a single request
  • Span: An individual unit of work within a Trace, with start time, end time, attributes, and events
  • Context: Information containing Trace ID and Span ID, propagated between services

Captain.AI uses AI to analyze tracing data, automatically detecting performance bottlenecks and suggesting optimizations.

Configuring the OpenTelemetry Collector

The OpenTelemetry Collector is a vendor-agnostic component responsible for receiving, processing, and exporting telemetry data. This section draws from Uptrace and the Logit.io implementation guide.

Collector Architecture

text
Receivers → Processors → Exporters
  (OTLP)     (batch)      (Jaeger)
  (Zipkin)   (filter)     (Tempo)
  (Jaeger)   (sampling)   (OTLP)

Kubernetes Deployment (DaemonSet)

yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: collector
        image: otel/opentelemetry-collector-contrib:0.98.0
        args: ["--config=/conf/otel-collector-config.yaml"]
        volumeMounts:
        - name: config
          mountPath: /conf
      volumes:
      - name: config
        configMap:
          name: otel-collector-config

Collector Configuration

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 5s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
      tail_sampling:
        decision_wait: 10s
        policies:
        - name: errors-policy
          type: status_code
          status_code: {status_codes: [ERROR]}
        - name: slow-traces
          type: latency
          latency: {threshold_ms: 1000}
        - name: probabilistic
          type: probabilistic
          probabilistic: {sampling_percentage: 10}

    exporters:
      otlp/tempo:
        endpoint: tempo:4317
        tls:
          insecure: true
      otlp/jaeger:
        endpoint: jaeger-collector:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, tail_sampling, batch]
          exporters: [otlp/tempo]

For K3s clusters running on Kubo, the DaemonSet approach is recommended -- one Collector per node maximizes resource efficiency.

Application Instrumentation

Zero-Code Instrumentation

OpenTelemetry's auto-instrumentation allows you to add tracing without modifying code.

Python example:

bash
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
bash
# Configuration via environment variables
export OTEL_SERVICE_NAME=my-python-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
export OTEL_TRACES_EXPORTER=otlp
export OTEL_METRICS_EXPORTER=otlp

# Launch with auto-instrumentation
opentelemetry-instrument python app.py

Java example:

bash
# Auto-instrumentation with Java Agent
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-java-service \
  -Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
  -jar my-app.jar

Automatic Injection with Kubernetes Operator

As explained in this Medium implementation article, the OpenTelemetry Operator enables auto-instrumentation injection via Pod annotations:

yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-python: "true"
spec:
  containers:
  - name: my-app
    image: my-python-app:latest

Manual Instrumentation Example (Go)

go
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"
)

func handleRequest(ctx context.Context) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "handleRequest")
    defer span.End()

    // Add custom attributes
    span.SetAttributes(
        attribute.String("user.id", userID),
        attribute.Int("http.status_code", 200),
    )

    // Create child span
    ctx, childSpan := tracer.Start(ctx, "database-query")
    result := queryDatabase(ctx)
    childSpan.End()
}

Combining Captain.AI with OpenTelemetry enables AI to automatically identify inter-service bottlenecks from trace data and generate improvement recommendations.

Backend Integration: Jaeger and Grafana Tempo

Jaeger Integration

Jaeger is a CNCF Graduated distributed tracing backend. Following the step-by-step guide on Medium:

yaml
# Deploy with Jaeger Operator
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es.server-urls: http://elasticsearch:9200

Grafana Tempo Integration

As the Civo practical guide details, Grafana Tempo is a backend optimized for storing large-scale trace data:

yaml
# Tempo configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200
    distributor:
      receivers:
        otlp:
          protocols:
            grpc:
    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: minio:9000

Context Propagation

As dasroot.net emphasizes, the most critical aspect of distributed tracing is context propagation. The W3C Trace Context standard defines the headers used:

text
traceparent: 00-{trace-id}-{span-id}-{flags}
tracestate: vendor-specific-data

When every service propagates these headers, end-to-end traces are complete.

Sampling Strategies and Performance Optimization

Tail Sampling

Storing every trace leads to explosive storage costs. The markaicode implementation guide recommends tail sampling strategies:

yaml
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
    # Keep 100% of traces with errors
    - name: errors
      type: status_code
      status_code: {status_codes: [ERROR]}
    # Keep 100% of traces over 1 second
    - name: slow-traces
      type: latency
      latency: {threshold_ms: 1000}
    # Sample 10% of remaining traces
    - name: probabilistic
      type: probabilistic
      probabilistic: {sampling_percentage: 10}

Performance Optimization Tips

  • Batch processing: Set appropriate batch sizes and timeouts in the Collector
  • Memory limiter: Configure memory limits to prevent OOM conditions
  • DaemonSet deployment: More resource-efficient than sidecar patterns
  • Attribute optimization: Remove unnecessary attributes to reduce payload size

According to Andrew Odendaal's guide, a well-designed sampling strategy can reduce trace data volume by 90% while retaining 100% of critical traces.

Conclusion

Distributed tracing with OpenTelemetry fundamentally improves observability in microservice environments. The key takeaways are:

  1. OpenTelemetry is a vendor-neutral framework unifying traces, metrics, and logs
  2. The Collector enables flexible telemetry pipeline construction
  3. Auto-instrumentation adds tracing without code changes
  4. Jaeger / Grafana Tempo integration for trace storage and visualization
  5. Tail sampling controls costs while retaining important traces

Kubo is built on K3s with strong affinity for the CNCF ecosystem, and OpenTelemetry deployment dramatically improves visibility in microservice environments. If you are working on distributed system observability, explore Kubo.

For AI-powered trace data analysis, see Captain.AI for details. For consultations, reach out through our contact page.

← Back to all posts