Implementing Distributed Tracing for Node.js gRPC Services Within a VPC


The system went dark. A user-facing request to our api-gateway service timed out. The logs for that service showed a request came in, and an outbound gRPC call was made to user-service. The logs for user-service showed nothing. Between the two services, running in our private VPC, the request simply vanished. Without a holistic view of the request’s lifecycle across our Node.js microservices, we were effectively debugging with isolated, context-less log entries—a painfully slow and inefficient process. This incident was the final catalyst for building a proper distributed tracing pipeline.

Our architecture is standard for many cloud-native applications: multiple Node.js services communicating exclusively over gRPC for performance, all isolated from the public internet within a VPC for security. This setup is efficient and secure, but it creates an observability black box. The challenge was to inject visibility into this system without compromising the existing architecture or littering our business logic with tracing code.

The immediate technology choices were clear. OpenTelemetry stood out as the vendor-neutral standard for instrumentation. For injecting the tracing logic into our gRPC calls, interceptors are the only sane, non-invasive approach. They act as middleware for RPCs, allowing us to wrap every request and response with the necessary context-propagation logic. The goal was to trace a request from the moment it hits our gateway, through every subsequent internal gRPC call, and see it as a single, unified trace.

The Foundation: A Multi-Service gRPC Application

Before instrumenting anything, we need a baseline application that simulates our production environment. This consists of three services:

  1. API Gateway (gateway-service): A public-facing (within the VPC) service that receives HTTP requests and orchestrates calls to internal services.
  2. User Service (user-service): Manages user data.
  3. Auth Service (auth-service): A dependency of the User Service, responsible for validating credentials.

All inter-service communication is via gRPC. Let’s start with the Protocol Buffer definitions.

protos/user.proto

syntax = "proto3";

package user;

service UserService {
  rpc GetUser(GetUserRequest) returns (GetUserResponse);
}

message GetUserRequest {
  string user_id = 1;
}

message GetUserResponse {
  string user_id = 1;
  string name = 2;
  string email = 3;
}

protos/auth.proto

syntax = "proto3";

package auth;

service AuthService {
  rpc ValidateCredentials(ValidateCredentialsRequest) returns (ValidateCredentialsResponse);
}

message ValidateCredentialsRequest {
  string user_id = 1;
}

message ValidateCredentialsResponse {
  bool is_valid = 1;
}

This defines the contract between our services. The implementation follows a typical pattern for gRPC in Node.js using @grpc/grpc-js and @grpc/proto-loader.

Here’s the core server logic for user-service, which itself acts as a client to auth-service. This nested dependency is crucial for demonstrating multi-hop trace propagation.

user-service/server.js

const path = require('path');
const grpc = require('@grpc/grpc-js');
const protoLoader = require('@grpc/proto-loader');

// Configuration
const USER_PROTO_PATH = path.join(__dirname, '../protos/user.proto');
const AUTH_PROTO_PATH = path.join(__dirname, '../protos/auth.proto');
const AUTH_SERVICE_ADDRESS = process.env.AUTH_SERVICE_ADDRESS || 'localhost:50052';
const PORT = process.env.PORT || 50051;

// Load Protobufs
const userPackageDefinition = protoLoader.loadSync(USER_PROTO_PATH, {
  keepCase: true,
  longs: String,
  enums: String,
  defaults: true,
  oneofs: true,
});
const authPackageDefinition = protoLoader.loadSync(AUTH_PROTO_PATH, {
  keepCase: true,
  longs: String,
  enums: String,
  defaults: true,
  oneofs: true,
});

const user_proto = grpc.loadPackageDefinition(userPackageDefinition).user;
const auth_proto = grpc.loadPackageDefinition(authPackageDefinition).auth;

// gRPC Client for Auth Service
const authClient = new auth_proto.AuthService(AUTH_SERVICE_ADDRESS, grpc.credentials.createInsecure());

// Service Implementation
const getUser = (call, callback) => {
  console.log(`[User-Service] Received GetUser request for user_id: ${call.request.user_id}`);

  // In a real-world project, this is where you'd validate the request.
  if (!call.request.user_id) {
    return callback({
      code: grpc.status.INVALID_ARGUMENT,
      message: 'User ID is required.',
    });
  }

  // Call the Auth service to validate
  authClient.validateCredentials({ user_id: call.request.user_id }, (err, response) => {
    if (err) {
      console.error('[User-Service] Error calling Auth service:', err);
      return callback({
        code: grpc.status.INTERNAL,
        message: 'Failed to communicate with authentication service.',
      });
    }

    if (!response.is_valid) {
      console.warn(`[User-Service] Auth service reported invalid credentials for user_id: ${call.request.user_id}`);
      return callback({
        code: grpc.status.UNAUTHENTICATED,
        message: 'Invalid credentials.',
      });
    }

    console.log(`[User-Service] Credentials validated for user_id: ${call.request.user_id}`);
    
    // Mock user data lookup
    const user = {
      user_id: call.request.user_id,
      name: 'Jane Doe',
      email: '[email protected]',
    };

    callback(null, { user });
  });
};

// Start the server
function main() {
  const server = new grpc.Server();
  server.addService(user_proto.UserService.service, { getUser });
  server.bindAsync(`0.0.0.0:${PORT}`, grpc.ServerCredentials.createInsecure(), (err, port) => {
    if (err) {
      console.error('Failed to bind server:', err);
      return;
    }
    console.log(`[User-Service] Server running at http://0.0.0.0:${port}`);
    server.start();
  });
}

main();

Without tracing, each service is an island. A failure in the authClient.validateCredentials call would log an error in user-service, but we wouldn’t know which initial request to the gateway-service triggered it.

Instrumenting with OpenTelemetry

The first step is to establish a standardized way to create, manage, and export trace data. We’ll create a shared tracing.js module that each service can initialize. This is a critical design decision in a real-world project; centralizing the tracing setup ensures consistency and simplifies updates.

common/tracing.js

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { GrpcInstrumentation } = require('@opentelemetry/instrumentation-grpc');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');

// A common mistake is not providing a service name.
// Without it, it's impossible to distinguish traces from different services.
const initTracer = (serviceName) => {
  // The OTLP (OpenTelemetry Protocol) exporter sends data to a collector.
  // In our VPC setup, this will be an OTEL Collector instance running
  // as another service in the same network.
  const traceExporter = new OTLPTraceExporter({
    // URL is configurable via environment variables.
    // Default points to the standard OTLP gRPC port.
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
  });

  const sdk = new NodeSDK({
    resource: new Resource({
      [SemanticResourceAttributes.SERVICE_NAME]: serviceName,
    }),
    traceExporter,
    // Auto-instrumentation is powerful but can feel like magic.
    // It works by patching the `require` calls of popular libraries.
    // Here we enable it for gRPC and HTTP.
    instrumentations: [new GrpcInstrumentation(), new HttpInstrumentation()],
  });

  // Graceful shutdown
  process.on('SIGTERM', () => {
    sdk.shutdown()
      .then(() => console.log('Tracing terminated'))
      .catch((error) => console.log('Error terminating tracing', error))
      .finally(() => process.exit(0));
  });
  
  try {
    sdk.start();
    console.log(`Tracing initialized for service: ${serviceName}`);
  } catch (error) {
    console.log('Error initializing tracing', error);
  }

  return sdk;
};

module.exports = { initTracer };

To use this, we simply add one line to the top of each service’s main entry point file (server.js).

For user-service:

// At the top of user-service/server.js
const { initTracer } = require('../common/tracing');
initTracer('user-service');

// ... rest of the server code

For auth-service:

// At the top of auth-service/server.js
const { initTracer } = require('../common/tracing');
initTracer('auth-service');

// ... rest of the server code

This automatic instrumentation is what handles the magic of context propagation. When user-service makes a gRPC call to auth-service, the @opentelemetry/instrumentation-grpc package automatically:

  1. Retrieves the active trace context (which includes the traceId and parent spanId).
  2. Serializes this context and injects it into the gRPC metadata of the outgoing request, typically using the W3C Trace Context headers (traceparent).
  3. On the receiving end (auth-service), the instrumentation extracts this context from the metadata and creates a new “child” span, linking it to the parent span from user-service.

This happens transparently, without any changes to our business logic code.

Simulating the VPC with Docker Compose

To test this in an environment that mimics a VPC, we use Docker Compose. It creates a dedicated network, and services can only communicate using their service names, just like with internal DNS in a real VPC. We also need an OpenTelemetry Collector and a backend like Jaeger to visualize the traces.

graph TD
    subgraph "Virtual Private Cloud (Simulated by Docker Network)"
        A[External Request] --> B(gateway-service);
        B --gRPC call--> C(user-service);
        C --gRPC call--> D(auth-service);
        
        subgraph "Observability Stack"
            B --OTLP--> E(otel-collector);
            C --OTLP--> E;
            D --OTLP--> E;
            E --Export--> F(jaeger-all-in-one);
        end
    end
    
    G[Developer] -->|View Traces| F;

Here’s the configuration for the collector. It receives traces via OTLP gRPC and exports them to Jaeger. This is a critical piece of infrastructure. In a production environment, the collector can also be used for batching, sampling, and adding resource attributes before exporting to a managed backend.

otel-collector-config.yaml

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:

exporters:
  jaeger:
    endpoint: jaeger-all-in-one:14250
    tls:
      insecure: true
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger, logging]

And the docker-compose.yml that ties it all together.

docker-compose.yml

version: '3.8'

services:
  gateway-service:
    build: ./gateway-service
    ports:
      - "8080:8080"
    environment:
      - PORT=8080
      - USER_SERVICE_ADDRESS=user-service:50051
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
    depends_on:
      - user-service
      - otel-collector
    networks:
      - app-net

  user-service:
    build: ./user-service
    environment:
      - PORT=50051
      - AUTH_SERVICE_ADDRESS=auth-service:50052
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
    depends_on:
      - auth-service
      - otel-collector
    networks:
      - app-net

  auth-service:
    build: ./auth-service
    environment:
      - PORT=50052
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
    depends_on:
      - otel-collector
    networks:
      - app-net

  otel-collector:
    image: otel/opentelemetry-collector:0.87.0
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317" # OTLP gRPC
      - "4318:4318" # OTLP HTTP
    networks:
      - app-net

  jaeger-all-in-one:
    image: jaegertracing/all-in-one:1.49
    ports:
      - "16686:16686" # Jaeger UI
      - "14250:14250" # Jaeger gRPC receiver
    networks:
      - app-net

networks:
  app-net:
    driver: bridge

After running docker-compose up --build, we can send a request to the gateway:
curl http://localhost:8080/user/123

Navigating to the Jaeger UI at http://localhost:16686, we’ll find a single trace composed of three spans:

  1. HTTP GET /user/:id from gateway-service.
  2. grpc.user.UserService/GetUser from user-service, correctly shown as a child of the gateway span.
  3. grpc.auth.AuthService/ValidateCredentials from auth-service, correctly shown as a child of the user-service span.

The black box is gone. We can now see the exact duration of each step and pinpoint which service is responsible for any latency or errors.

The Problem with “Magic”: Manual Interceptors for Control

Auto-instrumentation is excellent for getting started, but in a production system, you often need more control. What if we want to add custom attributes to a span? Or what if a library we use isn’t supported by auto-instrumentation? This is where manual instrumentation via custom interceptors becomes necessary. Understanding how to write them is crucial for any senior engineer working with gRPC.

Let’s write a client interceptor to demonstrate the principle. This interceptor will manually propagate the OpenTelemetry context. While the auto-instrumentation already does this, building it ourselves reveals the underlying mechanism.

common/interceptors.js

const grpc = require('@grpc/grpc-js');
const { context, propagation } = require('@opentelemetry/api');

/**
 * A gRPC client interceptor that injects the current OpenTelemetry
 * context into the outgoing request metadata.
 */
const otelClientInterceptor = (options, nextCall) => {
  return new grpc.InterceptingCall(nextCall(options), {
    start: function(metadata, listener, next) {
      // Get the current active context. If there's an active span,
      // this context will contain its information.
      const activeContext = context.active();
      
      // The propagator's `inject` method serializes the context
      // into a format that can be sent over the wire.
      // It modifies the metadata object in place.
      propagation.inject(activeContext, metadata);
      
      console.log('[OtelClientInterceptor] Injected context into metadata.');

      // Proceed with the original `start` call.
      next(metadata, listener);
    },
  });
};

module.exports = { otelClientInterceptor };

To use this, we would modify our gRPC client creation in user-service to include the interceptor.

Modified user-service/server.js client creation

// ... imports
const { otelClientInterceptor } = require('../common/interceptors');

// ...
// gRPC Client for Auth Service with Interceptor
const authClient = new auth_proto.AuthService(
  AUTH_SERVICE_ADDRESS,
  grpc.credentials.createInsecure(),
  { interceptors: [otelClientInterceptor] }
);
// ...

In a real-world scenario, you wouldn’t rewrite the context propagation logic. Instead, you’d use interceptors for tasks like:

  • Adding business-specific attributes to spans: For example, adding user.id or tenant.id to every span for better filterability.
  • Creating child spans for internal operations: If a single RPC handler performs several distinct, time-consuming steps (e.g., cache check, database query, external API call), you can create a child span for each to get a more granular performance breakdown.
  • Implementing custom metrics: Counting RPCs, measuring payload sizes, and emitting them as OpenTelemetry metrics.

The key takeaway is that interceptors provide a clean, aspect-oriented way to enhance RPCs without touching the core business logic.

Lingering Issues and Future Iterations

This implementation provides a solid foundation for tracing within a VPC, but it’s not the end of the story. A production-grade observability system requires more. The current setup only traces gRPC and HTTP communication; database queries, cache interactions, and message queue operations are still invisible. The next logical step would be to add their respective instrumentations (@opentelemetry/instrumentation-pg, @opentelemetry/instrumentation-redis, etc.) to the tracing.js module to achieve a truly full-stack trace.

Furthermore, we’re currently tracing every single request. For a high-traffic service, this would be prohibitively expensive in terms of performance overhead and storage costs for the trace data. The solution is to implement a sampling strategy. This could be a simple probability-based sampler configured in the Node.js SDK (TraceIdRatioBasedSampler) or a more sophisticated, tail-based sampling strategy configured in the OpenTelemetry Collector, which makes sampling decisions based on the complete trace after all spans have been received.

Finally, tracing is only one of the three pillars of observability. To fully debug complex issues, traces must be correlated with metrics and logs. This involves configuring the collector to scrape Prometheus metrics from our services and setting up a logging pipeline (e.g., Fluentd) that ensures every log line is enriched with the corresponding traceId and spanId from the active context. This allows an engineer to jump directly from a slow span in a trace to the exact logs and system metrics from that point in time, providing the complete context needed for rapid-fire debugging.


  TOC