Achieving Correlated Full-Stack Tracing in an Event Sourced C# System with Linkerd and Pinia

Architecture

Word Count: 2.6k

Read Times: 16 Min

The system went live, and for a few weeks, everything was deceptively calm. We had built a reactive UI using Vue and Pinia, backed by a set of C# microservices following a strict CQRS and Event Sourcing pattern. Then the first critical bug report arrived: “I updated the product name, the screen spun for a second, then went back to the old name. Nothing changed.” The user’s session logs showed a successful HTTP 202 Accepted response from our API. The command service logs showed the command was validated and an event was persisted. The projection service logs, hours later, showed no record of processing that specific event. Somewhere, in the asynchronous void between the command being handled and the projection being updated, a message was lost. Finding the root cause took two engineers half a day of manually correlating timestamps across five different log streams. This was unsustainable. Our distributed architecture had become a black box, and we needed a light.

Our initial goal was simple: achieve a single, unified trace for any user interaction, from the moment a button is clicked in the browser to the final database write in our read model projector. The technology stack was chosen for specific reasons. C# with Event Sourcing gave us a full audit log and the ability to rebuild state, which was a business requirement. Pinia offered a clean, centralized state management pattern for our complex Vue front-end. The problem was the connective tissue. The system’s very nature—decoupled services communicating asynchronously over a message bus—was what made it so difficult to observe.

This led us to Linkerd. The promise of a service mesh was compelling: full observability with zero code changes. It would transparently inject a proxy sidecar into our Kubernetes pods, handle mTLS, and, most importantly, capture metrics and generate distributed traces for all TCP traffic. We decided to adopt it, expecting it to solve our observability problem out of the box.

Initial Architecture and the Linkerd Honeymoon

Our core architecture was straightforward for a CQRS system.

sequenceDiagram
    participant User as Browser (Pinia)
    participant GW as API Gateway
    participant CmdSvc as Command Service (C#)
    participant EStore as Event Store
    participant Bus as Message Bus (RabbitMQ)
    participant ProjSvc as Projection Service (C#)
    participant ReadDB as Read Model DB
    participant Notifier as Notification Service

    User->>+GW: POST /products/update-name (Command)
    GW->>+CmdSvc: Forward Request
    CmdSvc->>+EStore: Persist(ProductNameUpdated Event)
    EStore-->>-CmdSvc: Success
    CmdSvc->>Bus: Publish(ProductNameUpdated)
    CmdSvc-->>-GW: HTTP 202 Accepted
    GW-->>-User: HTTP 202 Accepted

    Bus->>ProjSvc: Deliver(ProductNameUpdated)
    ProjSvc->>+ReadDB: Update Product Read Model
    ReadDB-->>-ProjSvc: Success
    ProjSvc->>Notifier: Notify(productId)
    Notifier-->>User: WebSocket push

The C# services were standard ASP.NET Core applications. Here’s a stripped-down version of the ProductCommandService.

// Program.cs - Command Service
using System.Diagnostics;

var builder = WebApplication.CreateBuilder(args);
// ... DI setup for MediatR, EventStore, MessageBus

var app = builder.Build();
app.MapPost("/products/{id}/changename", async (Guid id, ChangeProductName command, IMediator mediator) =>
{
    // The Activity.Current here is automatically populated by ASP.NET Core
    // from incoming HTTP headers (like b3 or traceparent)
    // This is the "magic" of framework integration with System.Diagnostics.
    Log.Information("Handling ChangeProductName command. TraceId: {TraceId}", Activity.Current?.Id);
    await mediator.Send(new ChangeProductNameCommand(id, command.NewName));
    return Results.Accepted();
});
app.Run();

// ChangeProductNameCommandHandler.cs
public class ChangeProductNameCommandHandler : IRequestHandler<ChangeProductNameCommand>
{
    private readonly IEventStoreRepository<Product> _repository;
    private readonly IMessageBus _messageBus;

    public ChangeProductNameCommandHandler(IEventStoreRepository<Product> repository, IMessageBus messageBus)
    {
        _repository = repository;
        _messageBus = messageBus;
    }

    public async Task<Unit> Handle(ChangeProductNameCommand request, CancellationToken cancellationToken)
    {
        var product = await _repository.LoadAsync(request.Id);
        if (product == null)
        {
            throw new ProductNotFoundException(request.Id);
        }

        // Business logic creates the event
        var @event = product.ChangeName(request.NewName);

        // Persist the event to our event store
        await _repository.SaveAsync(product);

        // This is the critical point where the trace context is often lost
        await _messageBus.PublishAsync(@event);

        return Unit.Value;
    }
}

Deploying this to Kubernetes and injecting Linkerd was trivial. We added an annotation to our deployment manifests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: command-service
spec:
  template:
    metadata:
      annotations:
        linkerd.io/inject: enabled
    # ... rest of the manifest

Instantly, we got results. Using the linkerd viz CLI, we could see traffic flowing between the API Gateway and the Command Service. We could see success rates, latencies, and, crucially, traces for the synchronous HTTP leg of the request. It felt like a massive win. A user action in the browser would trigger a Pinia store action, which made an Axios request. Linkerd would pick up that request at the gateway’s ingress, assign it a trace ID, and propagate the context headers to the Command Service. Everything up to the _messageBus.PublishAsync(@event) call was part of a single, beautiful trace.

And then, nothing. The trace ended there. The Projection Service would later process the message, but it would start a brand new trace with no connection to the original user request. We had only solved half the problem.

The Asynchronous Chasm and the Failure of “Zero-Code”

The core issue is that Linkerd, and service meshes in general, operate at L4/L7. They understand HTTP, gRPC, and raw TCP streams. They do not, by default, understand the semantics of a message queue protocol like AMQP. The Linkerd proxy on the Command Service pod sees an outbound TCP connection to RabbitMQ, and the proxy on the Projection Service pod sees an inbound TCP connection. It has no way of knowing that a message consumed by the second service is causally related to a message produced by the first. The trace context, which lives in HTTP headers (b3-traceid, b3-spanid, etc.), is never transferred to the AMQP message’s headers.

This isn’t a failure of Linkerd; it’s a fundamental reality of distributed systems. The “zero-code” promise only applies to the protocols the mesh understands natively. For everything else, the responsibility for context propagation falls back to the application developer.

Our task was now clear: we had to manually grab the tracing context before publishing a message and re-inject it before consuming a message. This had to be done reliably, without polluting our core domain logic.

First, we defined a standard message envelope. Instead of just publishing the raw event, we’d wrap it in an object that contained metadata.

// MessageEnvelope.cs
public class MessageEnvelope<T>
{
    public T Payload { get; }
    public Dictionary<string, string> Metadata { get; }

    public MessageEnvelope(T payload, Dictionary<string, string> metadata)
    {
        Payload = payload;
        Metadata = metadata ?? new Dictionary<string, string>();
    }
}

Next, we created a wrapper around our IMessageBus implementation that would handle the context propagation transparently. This is a crucial design decision in a real-world project. By using the Decorator pattern, we keep the observability concern separate from the raw message publishing logic.

// TracingMessageBusDecorator.cs
using System.Diagnostics;
using System.Text.Json;

public class TracingMessageBusDecorator : IMessageBus
{
    private readonly IMessageBus _innerBus;
    private readonly ILogger<TracingMessageBusDecorator> _logger;

    public TracingMessageBusDecorator(IMessageBus innerBus, ILogger<TracingMessageBusDecorator> logger)
    {
        _innerBus = innerBus;
        _logger = logger;
    }

    public async Task PublishAsync<T>(T message) where T : class
    {
        var metadata = new Dictionary<string, string>();
        
        // This is the core logic: extract the current activity's context.
        // Activity.Current is the magic context object managed by System.Diagnostics.
        // It's populated by ASP.NET Core on ingress and we need to carry it forward.
        if (Activity.Current != null)
        {
            metadata["traceparent"] = Activity.Current.Id;
            if (Activity.Current.TraceStateString != null)
            {
                metadata["tracestate"] = Activity.Current.TraceStateString;
            }

            // We can also propagate any baggage items.
            foreach (var baggage in Activity.Current.Baggage)
            {
                metadata[baggage.Key] = baggage.Value;
            }
        }

        _logger.LogInformation("Publishing message with trace context: {TraceParent}", metadata.GetValueOrDefault("traceparent"));
        
        var envelope = new MessageEnvelope<T>(message, metadata);
        await _innerBus.PublishAsync(envelope);
    }
}

We then registered this decorator in our Dependency Injection container. The ChangeProductNameCommandHandler requires no changes; it still just injects IMessageBus, but now it gets our tracing-aware version.

// In Program.cs for the Command Service
// Assume RabbitMqMessageBus is the concrete implementation
builder.Services.AddSingleton<RabbitMqMessageBus>();
builder.Services.AddSingleton<IMessageBus>(provider => 
    new TracingMessageBusDecorator(
        provider.GetRequiredService<RabbitMqMessageBus>(),
        provider.GetRequiredService<ILogger<TracingMessageBusDecorator>>()
    )
);

On the consumer side (the Projection Service), we need to do the reverse. We need middleware in our message consumption pipeline that reads the headers and reconstitutes the Activity context before the actual message handler is invoked.

// RabbitMQ Consumer Hosting Service in ProjectionService
public class RabbitMqConsumerService : BackgroundService
{
    private readonly IConnection _connection;
    private readonly IModel _channel;
    private readonly IServiceProvider _serviceProvider;

    // ... connection setup in constructor ...

    protected override Task ExecuteAsync(CancellationToken stoppingToken)
    {
        var consumer = new EventingBasicConsumer(_channel);
        consumer.Received += async (model, ea) =>
        {
            var body = ea.Body.ToArray();
            var messageJson = Encoding.UTF8.GetString(body);
            
            // This is where we re-establish the trace
            await ProcessMessageWithTraceContext(messageJson, ea.BasicProperties.Headers);
        };

        _channel.BasicConsume(queue: "product_events", autoAck: true, consumer: consumer);
        return Task.CompletedTask;
    }

    private async Task ProcessMessageWithTraceContext(string messageJson, IDictionary<string, object> headers)
    {
        // A common mistake is to process the message and then start the activity.
        // The activity must be started *before* any work is done, so that all
        // subsequent logging and downstream calls are part of this trace span.

        var parentContext = new ActivityContext();
        string traceParent = GetHeader(headers, "traceparent");
        string traceState = GetHeader(headers, "tracestate");

        if (!string.IsNullOrEmpty(traceParent))
        {
            // Attempt to parse the W3C Trace Context from the headers.
            if (ActivityContext.TryParse(traceParent, traceState, out var context))
            {
                parentContext = context;
            }
        }

        // Start a new Activity, but link it to the parent context from the message.
        using var activity = new ActivitySource("ProjectionService.MessageConsumer").StartActivity("ProcessProductEvent", ActivityKind.Consumer, parentContext);

        try
        {
            // Now, inside this 'using' block, Activity.Current is correctly set.
            // Any logging or database calls made by the handler will be automatically
            // correlated with this activity and its parent trace.
            Log.Information("Processing message with rehydrated TraceId: {TraceId}", activity?.Id);

            // Deserialize the *envelope* not the raw message
            var envelopeType = typeof(MessageEnvelope<>).MakeGenericType(resolveEventType()); // Simplified
            var envelope = JsonSerializer.Deserialize(messageJson, envelopeType) as IEnvelope;
            
            using var scope = _serviceProvider.CreateScope();
            var handler = scope.ServiceProvider.GetRequiredService<IEventHandler>();
            await handler.HandleAsync(envelope.Payload);
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);
            // In a real project, implement retry/dead-letter queue logic here.
            throw;
        }
    }
    
    private string GetHeader(IDictionary<string, object> headers, string key)
    {
        if (headers != null && headers.TryGetValue(key, out var value))
        {
            return value is byte[] bytes ? Encoding.UTF8.GetString(bytes) : value.ToString();
        }
        return null;
    }
}

With this infrastructure in place, the gap was bridged. A trace initiated by the frontend request now flowed seamlessly across the HTTP boundary into the Command Service, was manually propagated via RabbitMQ headers, and was reconstituted in the Projection Service. A query in our observability tool (like Jaeger) for the original traceId now showed the full, end-to-end flow.

Closing the Loop: Pinia and Client-Side Correlation

The final piece was ensuring the frontend experience was coherent. When a user clicks “Save,” we can’t just block the UI until the entire asynchronous flow completes. We needed optimistic updates and a way to correlate the final WebSocket push notification with the action that initiated it.

The Pinia store was designed for this.

// stores/product.ts
import { defineStore } from 'pinia';
import { api } from '../api';
import { useNotificationSocket } from '../socket';

interface ProductState {
  products: Map<string, Product>;
  pendingChanges: Map<string, string>; // correlationId -> originalName
}

export const useProductStore = defineStore('product', {
  state: (): ProductState => ({
    products: new Map(),
    pendingChanges: new Map(),
  }),
  actions: {
    // This is the core action triggered by the UI.
    async updateProductName(productId: string, newName: string) {
      const product = this.products.get(productId);
      if (!product) return;

      const originalName = product.name;
      
      // A simple UUID can serve as a correlation ID.
      const correlationId = crypto.randomUUID();
      this.pendingChanges.set(correlationId, originalName);

      // 1. Optimistic Update: Change the state immediately.
      product.name = newName;

      try {
        // 2. Send the command to the backend, passing the correlation ID.
        // The backend should be designed to accept and persist this metadata.
        await api.post(`/products/${productId}/changename`, { 
            newName,
            metadata: { correlationId } // Pass it in the request body
        });
      } catch (error) {
        // 3. Rollback on failure.
        console.error("Failed to update product name:", error);
        const productToRollback = this.products.get(productId);
        const originalNameToRestore = this.pendingChanges.get(correlationId);
        if (productToRollback && originalNameToRestore) {
          productToRollback.name = originalNameToRestore;
        }
        this.pendingChanges.delete(correlationId);
      }
    },

    // Called by the WebSocket connection handler.
    _handleProductUpdateNotification(update: ProductUpdate) {
      const { correlationId, productData } = update;
      
      // 4. On successful notification, confirm the change.
      if (correlationId && this.pendingChanges.has(correlationId)) {
        this.pendingChanges.delete(correlationId);
      }
      
      // Reconcile the state with the authoritative version from the backend.
      const product = this.products.get(productData.id);
      if (product) {
        Object.assign(product, productData);
      }
    },
  },
  // In a real app, you'd initialize the WebSocket listener here
  // and have it call _handleProductUpdateNotification.
});

The key was the correlationId. We generated it client-side and passed it through the entire backend flow. The ChangeProductNameCommand and the ProductNameChanged event were modified to carry this metadata. When the Projection Service updated the read model, it also sent the correlationId to the Notification Service, which then included it in the WebSocket payload. This allowed the Pinia store to definitively link the incoming update to the specific action that started the process, enabling robust error handling and rollback logic.

The complete, observable, and correlated flow now looked like this:

graph TD
    subgraph Browser
        A[Pinia Action: updateProductName] -->|Generates correlationId| B(Optimistic UI Update)
        A -->|axios.post with trace headers| C{Linkerd Proxy}
    end
    subgraph Kubernetes Cluster
        subgraph API Gateway Pod
            C -->|Injects trace headers| D[Gateway Service]
        end
        subgraph Command Service Pod
            D --> E{Linkerd Proxy} --> F[C# Command Handler]
            F -->|Persist Event| G[Event Store]
            F -->|`TracingMessageBusDecorator` reads Activity.Current| H(Publish to RabbitMQ with trace headers)
        end
        subgraph Projection Service Pod
            I[RabbitMQ Message] -->|Contains trace headers| J{Linkerd Proxy}
            J --> K[C# Consumer Middleware]
            K -->|Reconstitutes Activity.Current| L[Event Handler]
            L --> M[Update Read Model]
            L -->|Sends notification with correlationId| N[Notification Service]
        end
    end
    subgraph Browser
        N -->|WebSocket message with correlationId| O[Pinia Store]
        O -->|Reconcile State & clear pending flag| P[Final UI State]
    end

    style C fill:#87CEFA,stroke:#333,stroke-width:2px
    style E fill:#87CEFA,stroke:#333,stroke-width:2px
    style J fill:#87CEFA,stroke:#333,stroke-width:2px

The manual work of propagating context over the message bus, while a departure from the “zero-code” dream, was the pragmatic choice. It required disciplined engineering and a solid middleware implementation, but the payoff was enormous. Debugging sessions that once took hours of guesswork were reduced to a single query in our tracing platform. We could finally see inside the black box.

This solution, however, is not without its own set of trade-offs and future considerations. The manual context propagation is a potential point of failure; a new developer adding a different communication channel (e.g., gRPC streaming) might forget to implement the same pattern. The evolution of the OpenTelemetry standard and its client libraries for message brokers aims to automate this, potentially making our custom decorator obsolete in the future. Furthermore, tracing over WebSockets remains a complex area; our correlationId approach is a practical workaround, not a fully compliant distributed trace. Finally, the resource overhead of the Linkerd sidecar proxies is non-zero. In a high-throughput, low-latency environment, the CPU and memory footprint of every sidecar must be carefully monitored and accounted for during capacity planning. The observability it provides is not free.

CQRS Observability Event Sourcing Pinia Linkerd C#

Building a Serverless CQRS Data Pipeline from DynamoDB to OpenSearch with Python

2023-10-27 Cloud Architecture

CQRS Serverless Jest Python DynamoDB OpenSearch AWS Lambda Jotai

Constructing a Dynamic MLOps Edge Architecture Using a Dart-based xDS Control Plane for Envoy Proxy and MLflow Integration

2023-10-27 MLOps

Dart Envoy Proxy Edge Computing MLflow Valtio xDS