Implementing a Zero-Trust Observability Pipeline with Vector and Consul Connect

Cloud Native

Word Count: 2.1k

Read Times: 13 Min

The mandate was clear: all internal network traffic must be encrypted, authenticated, and authorized. This wasn’t a suggestion; it was a hard requirement for passing our next security audit. Our application services were already being onboarded to a Consul Connect service mesh, enforcing mutual TLS (mTLS) for all service-to-service communication. The gaping hole in this strategy, however, was our observability pipeline. We were running Vector agents as DaemonSets, scraping logs from nodes and forwarding them over plain TCP to a central cluster of Vector aggregators. This unencrypted firehose of potentially sensitive log data traversing the cluster network was an unacceptable risk.

Our initial concept was to treat the logging infrastructure as a first-class citizen within the service mesh. This meant every component—from the log-shipping agent to the central aggregator—must have a service identity managed by Consul and communicate exclusively over mTLS-encrypted connections brokered by Connect proxies. The plan was to shift from a node-based DaemonSet model to a sidecar pattern, embedding a Vector agent directly into each application pod. This would provide better resource isolation and a cleaner path for integrating with the service mesh’s identity model.

The proposed architecture involved three key components:

OCI-packaged Vector Agent: A minimal, custom Vector OCI image to be run as a sidecar container in application pods. Its sole responsibility is to collect logs from the application container(s) within the same pod and forward them securely.
Vector Aggregator Tier: A stateful deployment of Vector responsible for receiving data from all agents, performing centralized processing (like parsing, enrichment, and sampling), and sinking the data to long-term storage like an S3 bucket or Loki.
Consul Connect Integration: Both the agents and aggregators would be registered as services in Consul and configured to use Connect proxies (Envoy) to handle all network communication between them. Access control would be managed via Consul intentions.

This approach transforms the logging pipeline from a simple, trusted internal network utility into a zero-trust distributed system. No agent can speak to an aggregator without a valid, short-lived certificate issued by Consul, and no traffic travels in cleartext.

Crafting the Core Component: The Vector OCI Image

Before deploying anything, we needed a standardized, minimal OCI image for Vector. Using the official timberio/vector image directly was not ideal for a sidecar deployment due to its size and inclusion of unnecessary components for our use case. A production-grade sidecar should be as lean as possible. We opted for a multi-stage Dockerfile to build a stripped-down version from the source archive.

# Dockerfile for a minimal Vector agent image

# ---- Builder Stage ----
ARG VECTOR_VERSION=0.35.0
FROM debian:bookworm-slim as builder

# Install build dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    curl \
    tar \
    gzip \
    && rm -rf /var/lib/apt/lists/*

# Download and extract the specific Vector version
WORKDIR /tmp
RUN curl -L -o vector.tar.gz https://packages.vector.dev/releases/download/v${VECTOR_VERSION}/vector-${VECTOR_VERSION}-x86_64-unknown-linux-musl.tar.gz && \
    tar -xzf vector.tar.gz && \
    rm vector.tar.gz

# ---- Final Stage ----
FROM gcr.io/distroless/static-debian12

LABEL org.opencontainers.image.source="https://github.com/your-repo/vector-image"
LABEL org.opencontainers.image.description="Minimal Vector OCI image for sidecar deployment"
LABEL org.opencontainers.image.licenses="Apache-2.0"

WORKDIR /vector
COPY --from=builder /tmp/vector-x86_64-unknown-linux-musl/bin/vector .
COPY --from=builder /tmp/vector-x86_64-unknown-linux-musl/etc/vector/ /etc/vector/

# Vector will run as a non-root user.
# The user ID must match the securityContext in the Kubernetes deployment.
USER 10001

# Entrypoint to run vector. Config will be mounted via a ConfigMap.
ENTRYPOINT ["/vector/vector"]
CMD ["--config", "/etc/vector/vector.toml"]

This build process results in a significantly smaller image. Using a distroless base removes the shell and other utilities, reducing the attack surface—a critical consideration for a component that will be deployed alongside every application. The built image is then pushed to our internal OCI-compliant registry.

Integrating the Agent Sidecar with Consul Connect

The next step was to inject this Vector image as a sidecar into an application pod and register both the application and the Vector agent with Consul. In a real-world project, this is managed via mutating webhooks or standardized Helm charts, but for clarity, here is the raw Kubernetes manifest for a single pod.

The pitfall here is managing the various annotations required by Consul Connect. A single typo can lead to silent connection failures that are difficult to debug.

# deployment.yaml for an application with a vector-agent sidecar

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
      annotations:
        # Enable Consul Connect injection for this pod.
        # This tells the Consul injector to add the envoy sidecar.
        "consul.hashicorp.com/connect-inject": "true"
        
        # Define the service that the vector-agent will connect to.
        # This configures the local envoy proxy to route traffic
        # intended for `vector-aggregator.service.consul:9000`
        # to the actual aggregator service through the mesh.
        "consul.hashicorp.com/connect-service-upstreams": "vector-aggregator:9000"

        # Explicitly register the vector-agent as a separate Consul service.
        # This gives the agent its own identity in the mesh.
        "consul.hashicorp.com/connect-service-sidecar": "vector-agent"
    spec:
      # Use a shared volume for logs so the sidecar can read them.
      volumes:
      - name: shared-logs
        emptyDir: {}
      - name: vector-agent-config
        configMap:
          name: vector-agent-cm

      # Security context to run containers as non-root users.
      securityContext:
        runAsUser: 10001
        runAsGroup: 10001
        fsGroup: 10001

      containers:
      # The main application container
      - name: sample-app
        image: "busybox:latest"
        command: ["/bin/sh", "-c"]
        args:
        - |
          i=0
          while true; do
            echo "{\"level\":\"info\",\"timestamp\":\"$(date -u +%Y-%m-%dT%H:%M:%S.%NZ)\",\"message\":\"Application log message ${i}\",\"app\":\"sample-app\"}"
            i=$((i+1))
            sleep 2
          done
        volumeMounts:
        - name: shared-logs
          mountPath: /var/log

      # The Vector agent sidecar
      - name: vector-agent
        image: "your-registry/vector-minimal:0.35.0"
        args: ["--config", "/etc/vector/config.toml"]
        volumeMounts:
        - name: shared-logs
          mountPath: /var/log/app
        - name: vector-agent-config
          mountPath: /etc/vector
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
          limits:
            cpu: 200m
            memory: 128Mi

The configuration for the Vector agent itself is where the magic happens. Instead of connecting directly to the aggregator’s address, the sink points to the local upstream listener created by the Envoy proxy.

# configmap-vector-agent.toml - Configuration for the sidecar agent

# Ingest logs from a file written by the main application container.
[sources.app_logs]
  type = "file"
  include = ["/var/log/app/*.log"]
  read_from = "beginning"

# Use Vector Remap Language (VRL) to parse and structure the logs.
[transforms.parsed_logs]
  type = "remap"
  inputs = ["app_logs"]
  source = '''
  . = parse_json!(.message)
  .service = "sample-app"
  .pod_name = get_env_var("HOSTNAME")
  '''

# Sink data to the vector aggregator via the local Consul Connect proxy.
# The address '127.0.0.1:9000' is the local listener port configured
# by the 'connect-service-upstreams' annotation. Envoy handles the mTLS
# connection to the actual aggregator service from here.
[sinks.vector_aggregator]
  type = "vector"
  inputs = ["parsed_logs"]
  address = "127.0.0.1:9000"
  
  # Crucial for resilience: configure buffering and backpressure.
  # If the aggregator is down, logs will be buffered on disk.
  [sinks.vector_aggregator.buffer]
    type = "disk"
    max_size = 1073741824 # 1 GB
    when_full = "block"

A common mistake is to point the sink’s address to the remote service (vector-aggregator.service.consul). This bypasses the Envoy proxy, defeating the entire purpose of using the service mesh. The traffic must be directed to the localhost port that Envoy is listening on for that specific upstream.

The Aggregator Tier and Service Authorization

The aggregator tier is a StatefulSet to provide stable network identifiers. Its configuration is simpler, as it only needs to receive data.

# statefulset-vector-aggregator.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vector-aggregator
spec:
  serviceName: "vector-aggregator"
  replicas: 2
  selector:
    matchLabels:
      app: vector-aggregator
  template:
    metadata:
      labels:
        app: vector-aggregator
      annotations:
        "consul.hashicorp.com/connect-inject": "true"
        # The aggregator identifies itself as the 'vector-aggregator' service.
        "consul.hashicorp.com/connect-service": "vector-aggregator"
        # Expose port 8000 internally for the vector source.
        "consul.hashicorp.com/connect-service-port": "8000"
    spec:
      containers:
      - name: vector-aggregator
        image: "your-registry/vector-minimal:0.35.0"
        args: ["--config", "/etc/vector/config.toml"]
        ports:
        - containerPort: 8000
          name: vector-source
        # ... volume mounts for config and buffer disk ...

The aggregator’s Vector configuration listens on a port that its local Envoy proxy will receive traffic on.

# configmap-vector-aggregator.toml

# Source to receive logs from agents.
# It listens on 0.0.0.0:8000, and the pod's Envoy proxy
# will forward incoming mTLS traffic from agents to this port.
[sources.vector_agents]
  type = "vector"
  address = "0.0.0.0:8000"
  
# Example transform: Sample high-volume info logs to reduce cost.
[transforms.sampler]
  type = "sample"
  inputs = ["vector_agents"]
  rate = 10 # Only keep 1 out of 10 logs with level 'info'
  key_field = "level"
  
# A realistic sink for long-term storage.
[sinks.loki]
  type = "loki"
  inputs = ["sampler"]
  endpoint = "http://loki-stack.loki.svc.cluster.local:3100"
  # In a production setup, labels are critical for query performance.
  labels = { service = "{{ service }}", pod = "{{ pod_name }}" }
  encoding.codec = "json"

With both services deployed, traffic would be blocked by default due to Consul’s “default deny” posture. We needed to create a Consul Intention to explicitly permit communication.

# intentions.hcl - To be applied with `consul config write`

Kind      = "service-intentions"
Name      = "vector-aggregator"
Sources = [
  {
    Name   = "vector-agent"
    Action = "allow"
  }
]

This configuration explicitly states that services with the identity vector-agent are allowed to connect to the vector-aggregator service. Without this, the Envoy proxies would reject all connection attempts, and logs would pile up in the agent’s disk buffer until it filled.

We hit a problem during the initial rollout: intermittent connection resets. After hours of debugging, we traced it to a mismatch in protocol expectations. Our Vector sink was using TCP, but the default Envoy configuration was treating the traffic as HTTP, leading to protocol errors. The solution was to explicitly tell Consul the upstream protocol is raw TCP.

This is done via another annotation on the application pod:

"consul.hashicorp.com/connect-service-upstreams": "vector-aggregator:9000:tcp"

Or, more robustly, by defining a ServiceDefaults Custom Resource Definition in Kubernetes to configure the protocol for the vector-aggregator service globally.

apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceDefaults
metadata:
  name: vector-aggregator
spec:
  protocol: "tcp"

This ensures any client connecting to vector-aggregator via the mesh defaults to the correct protocol, a much more maintainable approach.

Visualizing the Secured Data Flow

The final data path is more complex than a direct connection but provides verifiable security at each step. Using Mermaid.js, we can visualize the flow from log generation to the aggregator.

sequenceDiagram
    participant App as Application Container
    participant Agent as Vector Agent Sidecar
    participant LProxy as Local Envoy Proxy
    participant RProxy as Remote Envoy Proxy
    participant Agg as Vector Aggregator
    
    App->>Agent: Writes log to shared volume
    Agent->>LProxy: Forwards log data to 127.0.0.1:9000
    Note over LProxy,RProxy: mTLS Encrypted TCP Tunnel
    LProxy->>RProxy: Establishes secure connection
    RProxy->>Agg: Forwards decrypted data to 0.0.0.0:8000
    Agg-->>Agg: Processes and sinks data

The key insight is that neither the Vector agent nor the aggregator is aware of mTLS, certificates, or the complexities of the service mesh. They communicate over what they perceive as insecure local connections. The Consul-injected Envoy proxies handle the entire security lifecycle: identity bootstrapping, certificate issuance and rotation, traffic encryption, and authorization based on intentions. This separation of concerns is the primary architectural benefit.

This architecture fundamentally changes the security model of our observability pipeline. The internal network is no longer a trusted environment. Every single log batch is sent over a connection that is authenticated with a short-lived, service-specific certificate and encrypted with TLS 1.3. This satisfies our zero-trust requirement for observability data in transit.

However, this architecture is not without its trade-offs. The resource overhead of running an Envoy and Vector sidecar in every single application pod is non-trivial. For a cluster with thousands of pods, this adds up to significant CPU and memory consumption that must be factored into capacity planning. The operational complexity is also higher; debugging now involves checking Vector logs, Envoy proxy logs, and Consul intentions. The benefits of provable security must be weighed against these operational and resource costs. For our use case, the security mandate made this trade-off not just acceptable, but necessary. Future work could involve exploring lighter-weight L4 proxies or eBPF-based solutions to reduce the sidecar overhead, but the fundamental pattern of securing the observability plane as a first-class citizen of the service mesh remains a sound and powerful principle.

Kubernetes Observability Vector OCI Consul Connect Service Mesh

Engineering an Idempotent Data Pipeline for Synchronizing Apache Hudi Commits to a Meilisearch Index via AWS SNS

2023-10-27 Data Engineering

Meilisearch Idempotency AWS SNS Spark Apache Hudi Code Review

Validating Command Idempotency in a Pulsar-Driven CQRS System with a Jest Test Harness

2023-10-27 Software Architecture

CQRS Pulsar Jest Node.js Testcontainers Integration Testing