Implementing Full-Stack Context Propagation from APISIX to an Algolia Client via a Go-Kit Service

Observability

Word Count: 2.3k

Read Times: 14 Min

The incident report was unambiguous: P95 search latency had breached its SLO for 45 minutes during peak traffic. The dashboard showed elevated response times at the API gateway, Apache APISIX, but the logs from the downstream Go-Kit search service showed normal processing times. The Algolia monitoring dashboard also reported nominal API performance. We were flying blind. Each component claimed innocence, and we had no data to stitch the user’s journey together. The core problem was a complete lack of context propagation across our stack, turning a simple performance degradation into a protracted post-mortem and a clear business impact.

Our immediate goal was to build a continuous, unified view of a request’s lifecycle. The architecture is straightforward: a client hits an APISIX route, which forwards the request to a Go-Kit microservice. This service then queries Algolia’s Search API to fetch results. To solve the visibility problem, we settled on OpenTelemetry, not because it’s new, but because it provides a standardized, vendor-agnostic protocol for context propagation (W3C Trace Context) and data export. The plan was to instrument every layer: configure APISIX’s native OpenTelemetry plugin, inject middleware into our Go-Kit service, and, most critically, wrap the Algolia client to trace the final outbound call.

The initial setup involved a docker-compose.yml to create a local, reproducible environment. This is non-negotiable for any real-world project; it ensures that development and CI environments mirror production as closely as possible.

# docker-compose.yml
version: '3.8'

services:
  apisix:
    image: apache/apisix:3.5.0-debian
    restart: always
    volumes:
      - ./apisix/config.yaml:/usr/local/apisix/conf/config.yaml:ro
      - ./apisix/apisix.yaml:/usr/local/apisix/conf/apisix.yaml:ro
    ports:
      - "9180:9180"
      - "9080:9080"
    depends_on:
      - search_service
      - jaeger

  search_service:
    build:
      context: ./search_service
      dockerfile: Dockerfile
    restart: always
    ports:
      - "8080:8080"
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318
      - ALGOLIA_APP_ID=${ALGOLIA_APP_ID} # Pass through from host
      - ALGOLIA_API_KEY=${ALGOLIA_API_KEY} # Pass through from host
    depends_on:
      - jaeger

  jaeger:
    image: jaegertracing/all-in-one:1.50
    restart: always
    ports:
      - "16686:16686" # Jaeger UI
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP HTTP receiver

For APISIX, the configuration needs to enable the opentelemetry plugin and define a batch exporter that sends traces to our Jaeger container over OTLP/HTTP. A common mistake here is neglecting the sampler. For debugging, a constant sampler with is_root: true and sampled: true is fine, but in production, this would generate an overwhelming and costly volume of traces.

# apisix/config.yaml

# ... other apisix configurations
plugin_attr:
  opentelemetry:
    resource:
      service.name: "apisix-gateway"
    collector:
      address: http://jaeger:4318/v1/traces # OTLP/HTTP endpoint
      request_timeout: 3
    sampler:
      name: constant
      argument:
        is_root: true
        sampled: true

With APISIX configured, we define a route that proxies requests to our Go service and attaches the opentelemetry plugin. This is done via the Admin API. The plugin’s configuration here is minimal; it inherits the global settings. The crucial part is that this plugin automatically inspects incoming requests for traceparent headers, continuing an existing trace or starting a new one if none exists. It then injects the traceparent header into the upstream request sent to our search_service.

# A one-time setup command to configure the APISIX route
curl -i "http://127.0.0.1:9180/apisix/admin/routes/1" -H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d '
{
  "uri": "/search",
  "plugins": {
    "opentelemetry": {}
  },
  "upstream": {
    "nodes": {
      "search_service:8080": 1
    },
    "type": "roundrobin"
  }
}'

The next layer is the Go-Kit service. We started with a barebones structure: service interface, implementation, and HTTP transport.

// search_service/pkg/service/service.go
package service

import (
	"context"
	"log"
	"os"

	"github.com/algolia/algoliasearch-client-go/v3/algolia/search"
)

// SearchService defines the contract for our search operations.
type SearchService interface {
	Query(ctx context.Context, q string) ([]map[string]interface{}, error)
}

type basicSearchService struct {
	algoliaClient *search.Client
	index         *search.Index
	logger        *log.Logger
}

// NewBasicSearchService creates a new instance of the search service.
func NewBasicSearchService(logger *log.Logger) SearchService {
	appID := os.Getenv("ALGOLIA_APP_ID")
	apiKey := os.Getenv("ALGOLIA_API_KEY")
	if appID == "" || apiKey == "" {
		logger.Fatal("ALGOLIA_APP_ID and ALGOLIA_API_KEY must be set")
	}

	client := search.NewClient(appID, apiKey)
	index := client.InitIndex("products") // Assuming an index named 'products'

	return &basicSearchService{
		algoliaClient: client,
		index:         index,
		logger:        logger,
	}
}

func (s *basicSearchService) Query(ctx context.Context, q string) ([]map[string]interface{}, error) {
	s.logger.Printf("service: received query: %s", q)

	res, err := s.index.Search(q)
	if err != nil {
		s.logger.Printf("service: Algolia search failed: %v", err)
		return nil, err
	}

	s.logger.Printf("service: found %d hits from Algolia", res.NbHits)
	return res.Hits, nil
}

The initial implementation worked, but it was an observability black box. To fix this, we instrumented it. First, a dedicated tracing.go file to encapsulate the OpenTelemetry SDK setup. This is a critical separation of concerns. The application code shouldn’t be cluttered with tracer initialization logic.

// search_service/pkg/tracing/tracing.go
package tracing

import (
	"context"
	"log"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

// InitTracerProvider initializes and registers an OTLP-based tracer provider.
func InitTracerProvider(ctx context.Context, serviceName string, logger *log.Logger) (*sdktrace.TracerProvider, error) {
	// OTLP HTTP Exporter setup
	exporter, err := otlptracehttp.New(ctx, otlptracehttp.WithInsecure())
	if err != nil {
		return nil, err
	}

	// Resource definition for the service
	res, err := resource.Merge(
		resource.Default(),
		resource.NewWithAttributes(
			semconv.SchemaURL,
			semconv.ServiceNameKey.String(serviceName),
		),
	)
	if err != nil {
		return nil, err
	}

	// Tracer provider with a batch span processor
	tp := sdktrace.NewTracerProvider(
		sdktrace.WithBatcher(exporter),
		sdktrace.WithResource(res),
		sdktrace.WithSampler(sdktrace.AlwaysSample()), // Sample all traces in dev
	)

	// Set the global tracer provider and propagator
	otel.SetTracerProvider(tp)
	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))

	logger.Println("Tracer provider initialized.")
	return tp, nil
}

This provider was then initialized in main.go. Next, we needed to wrap our HTTP handler to extract the incoming trace context. The go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp package provides the necessary middleware.

// search_service/cmd/main.go
package main

import (
    // ... other imports
	"github.com/oklog/run"
	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
	
	"search_service/pkg/service"
	"search_service/pkg/tracing"
	"search_service/pkg/transport"
)

func main() {
    // ... logger setup

	// Initialize OpenTelemetry Tracer Provider
	ctx := context.Background()
	tp, err := tracing.InitTracerProvider(ctx, "search-service", logger)
	if err != nil {
		logger.Fatalf("failed to initialize tracer provider: %v", err)
	}
	defer func() {
		if err := tp.Shutdown(context.Background()); err != nil {
			logger.Printf("error shutting down tracer provider: %v", err)
		}
	}()

	var searchSvc service.SearchService
	searchSvc = service.NewBasicSearchService(logger)
    // Here we will add tracing middleware to the service later

	var h http.Handler
	h = transport.NewHTTPHandler(searchSvc, logger)
    
    // The crucial step: wrap the main handler with OTEL middleware
    // This middleware reads the 'traceparent' header from APISIX
    // and injects the span context into the request's context.Context
    h = otelhttp.NewHandler(h, "http.server")

	var g run.Group
	{
		httpListener, err := net.Listen("tcp", ":8080")
		if err != nil {
			logger.Fatalf("transport=HTTP listen err=%v", err)
		}
		g.Add(func() error {
			logger.Println("transport=HTTP addr=:8080")
			return http.Serve(httpListener, h)
		}, func(error) {
			httpListener.Close()
		})
	}
	// ... signal handling
	logger.Fatal(g.Run())
}

At this point, we ran a test. We sent a request to APISIX, which forwarded it to the service. Jaeger showed two connected spans: one from apisix-gateway and a child span from search-service named http.server. This confirmed that context propagation from the gateway to the service was working correctly.

However, the trace was incomplete. The http.server span showed a duration, but we had no insight into what happened inside that duration. Specifically, the call to Algolia was invisible. This is a classic pitfall: instrumenting the entry point is necessary but not sufficient. We needed to instrument the Algolia client.

The Algolia Go client is built on a standard http.Client. This is a huge advantage. It means we can inject our own http.RoundTripper (or http.Transport) to intercept outgoing requests. otelhttp provides otelhttp.NewTransport, which does exactly this. It wraps an existing transport, starts a new client span for each outgoing request, and injects the traceparent header.

We modified our service constructor to accept an http.Client and configured the Algolia client to use it.

// search_service/pkg/service/service.go
package service

import (
	"context"
	"log"
	"net/http" // Import http
	"os"

	"github.com/algolia/algoliasearch-client-go/v3/algolia/search"
	"github.com/algolia/algoliasearch-client-go/v3/algolia/transport"
)

// ... SearchService interface remains the same

type basicSearchService struct {
	// ... other fields
}

// NewBasicSearchService now accepts an instrumented HTTP client.
func NewBasicSearchService(logger *log.Logger, httpClient *http.Client) SearchService {
	appID := os.Getenv("ALGOLIA_APP_ID")
	apiKey := os.Getenv("ALGOLIA_API_KEY")

	// Create a custom transport configuration to use our HTTP client.
	cfg := search.Configuration{
		AppID:  appID,
		APIKey: apiKey,
		Requester: transport.NewRetryStrategyWithClient(
			nil, // Use default hosts
			httpClient,
		),
	}
	client := search.NewClientWithConfig(cfg)
	index := client.InitIndex("products")

	return &basicSearchService{
		algoliaClient: client,
		index:         index,
		logger:        logger,
	}
}

// ... Query method remains the same for now

And in main.go, we create the instrumented client and pass it in.

// search_service/cmd/main.go

func main() {
    // ... logger and tracer setup

    // Create an HTTP client instrumented with OpenTelemetry.
    // This client will automatically create spans for outgoing requests.
    instrumentedClient := &http.Client{
        Transport: otelhttp.NewTransport(http.DefaultTransport),
    }

	var searchSvc service.SearchService
	searchSvc = service.NewBasicSearchService(logger, instrumentedClient)
    
    // ... rest of the main function
}

After redeploying, the trace in Jaeger was transformed. We now saw three nested spans:

apisix-gateway (root span)
http.server (from our service’s entry point)
HTTP POST (the child span created by otelhttp.NewTransport for the call to Algolia)

This was a massive improvement. We could now precisely measure the latency contribution of each component: the gateway, our service’s internal processing, and the external network call to Algolia.

The final piece of the puzzle was adding custom, business-logic-aware spans and attributes. Tracing the boundaries of network calls is good, but tracing key internal operations is better. We created a tracing middleware for our service layer itself.

// search_service/pkg/service/middleware.go
package service

import (
	"context"
	"fmt"
	"log"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/codes"
	"go.opentelemetry.io/otel/trace"
)

type tracingMiddleware struct {
	next   SearchService
	tracer trace.Tracer
	logger *log.Logger
}

// NewTracingMiddleware wraps a service with tracing capabilities.
func NewTracingMiddleware(logger *log.Logger, s SearchService) SearchService {
	return &tracingMiddleware{
		next:   s,
		tracer: otel.Tracer("search.service"),
		logger: logger,
	}
}

func (mw *tracingMiddleware) Query(ctx context.Context, q string) (hits []map[string]interface{}, err error) {
	// Start a new span for the service layer operation.
	// This becomes a child of the HTTP server span.
	ctx, span := mw.tracer.Start(ctx, "service.Query")
	defer span.End()

	// Add meaningful attributes to the span. These are invaluable for debugging.
	span.SetAttributes(
		attribute.String("search.query", q),
		attribute.Int("search.query_length", len(q)),
	)
	mw.logger.Printf("traceID=%s spanID=%s service.Query started",
		span.SpanContext().TraceID().String(),
		span.SpanContext().SpanID().String(),
	)

	// In a real-world project, you might have more internal spans here.
	// For example, one for input validation, another for result transformation.
	ctx, validationSpan := mw.tracer.Start(ctx, "service.Query.validate")
	if len(q) < 3 {
		err = fmt.Errorf("query must be at least 3 characters long")
		validationSpan.RecordError(err)
		validationSpan.SetStatus(codes.Error, err.Error())
		validationSpan.End()

		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
		return nil, err
	}
	validationSpan.End()

	// Call the next service in the chain.
	hits, err = mw.next.Query(ctx, q)

	// Record the outcome of the operation on the span.
	if err != nil {
		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
	} else {
		span.SetAttributes(attribute.Int("search.results_count", len(hits)))
		span.SetStatus(codes.Ok, "")
	}

	return hits, err
}

We applied this middleware in main.go:

// search_service/cmd/main.go
func main() {
    // ...
    instrumentedClient := &http.Client{...}

	var searchSvc service.SearchService
	searchSvc = service.NewBasicSearchService(logger, instrumentedClient)
    // Wrap the core service with our tracing middleware
    searchSvc = service.NewTracingMiddleware(logger, searchSvc)

    // ...
}

The final trace provided a complete narrative.

graph TD
    A[Client] --> B(Span 1: APISIX Gateway);
    B --> C(Span 2: Search Service - http.server);
    subgraph Search Service
        C --> D(Span 3: Search Service - service.Query);
        D --> E(Span 4: Search Service - service.Query.validate);
        D --> F(Span 5: Algolia Client - HTTP POST);
    end
    F --> D;
    D --> C;
    C --> B;
    B --> A;

A request for /search?q=test now generated a rich trace in Jaeger. We could see the total duration at APISIX, the time spent inside our Go service handler, the duration of the specific service.Query method call, the time for the validation sub-task, and the exact network latency of the Algolia API call. The query “test” and the number of results were attached as queryable attributes. We had moved from a state of complete opacity to one of deep, actionable insight. The original P95 latency issue could now be diagnosed in minutes, not hours.

This implementation, while functional, operates with a 100% sampling rate, which is unsustainable in production due to cost and performance overhead. The logical next step is to implement more sophisticated sampling, likely head-based sampling at the gateway (e.g., sample 5% of all requests) or, even better, a tail-based sampling strategy using an OpenTelemetry Collector. Tail-based sampling allows decisions to be made with the full trace context, enabling policies like “sample 100% of traces with errors, and 1% of the rest.” Furthermore, while we’ve traced the system, the next maturity level involves correlating these traces with structured logs (by injecting the trace_id and span_id into log entries) and relevant metrics, providing a truly unified observability platform.

OpenTelemetry Distributed Tracing APISIX Go Kit Algolia

Fusing Neo4j Graph Queries and TensorFlow Inference Within a Custom API Gateway Plugin

2023-10-27 Architecture

API Gateway Neo4j TensorFlow Real-time ML Feature Engineering

Augmenting a Laravel Application with a Rust-Based Real-Time Event Aggregator via Redis Streams

2023-10-27 Software Architecture

Microservices Redis Styled-components Rust Laravel Software Engineering