Orchestrating a WebRTC Infrastructure with Searchable Session Diagnostics on Nomad and Caddy


Debugging our distributed WebRTC service felt like navigating a maze in the dark. A user reports a failed connection, and the operations team is left to grep through terabytes of unstructured logs scattered across a fleet of stateless signaling servers. Correlating a specific user’s journey—from initial WebSocket connection to ICE candidate exchange and eventual connection failure—was an exercise in futility. The core pain point wasn’t a lack of data, but a lack of immediate, queryable access to the ephemeral metadata of each session. We needed a diagnostics system that was as real-time as the media streams it was supposed to monitor.

Our initial concept was to funnel all critical session events into a centralized, searchable data store. The requirements were strict: the solution had to be lightweight, horizontally scalable, and fit neatly into our existing HashiCorp-based infrastructure. This ruled out heavyweight solutions like the ELK stack; the operational overhead for indexing transient session data felt disproportionate.

The technology selection process followed a principle of minimalism and clear responsibility:

  1. Orchestration: HashiCorp Nomad was our incumbent scheduler. Its simplicity, single-binary deployment, and first-class support for non-containerized workloads (like our Caddy ingress) made it a better fit than Kubernetes for this specific problem domain.
  2. Ingress & TLS: Caddy was the obvious choice. Its automatic HTTPS via Let’s Encrypt is invaluable for WebRTC signaling (WSS), and its configuration is far more approachable than alternatives. We needed it to handle WebSocket connections for signaling and act as a secure public gateway to our internal search API.
  3. Search Index: We opted for Meilisearch. It’s a single, performant binary written in Rust, offering a simple REST API and incredible speed for a relatively small resource footprint. It could easily be managed as a Nomad job.
  4. Signaling Logic: A custom Go service using the Pion WebRTC library would give us the granular control needed to hook into the peer connection lifecycle and extract the precise metadata required for diagnostics.

The goal was to build a cohesive system where Nomad schedules all components, Caddy provides the unified entry point, and our Go application acts as the bridge, piping real-time WebRTC session data into a searchable Meilisearch index.

Phase 1: Orchestrating the Search Backend with Nomad

Before handling any WebRTC traffic, the destination for our diagnostic data needed to be established. We defined a Nomad job to run Meilisearch. In a real-world project, stateful services require careful volume management. Here, we use a host_volume for simplicity, but a production setup would use a CSI plugin for durable storage.

The key aspects of this job are resource allocation, exposing the service to Consul, and ensuring data persistence.

// search.nomad.hcl
job "meilisearch" {
  datacenters = ["dc1"]
  type        = "service"

  group "search-engine" {
    count = 1

    network {
      port "http" {
        to = 7700
      }
    }

    // In production, use a more robust storage solution like a CSI plugin.
    // This host volume is for demonstration and ensures data survives restarts.
    volume "meili_data" {
      type      = "host"
      source    = "meilisearch-data"
      read_only = false
    }

    task "meilisearch-server" {
      driver = "docker"

      config {
        image = "getmeili/meilisearch:v1.3.2"
        ports = ["http"]
        // The container entrypoint needs the master key.
        // In production, this should be injected securely via Vault.
        command = "/bin/sh"
        args = [
          "-c",
          "exec /sbin/tini -- /meilisearch --master-key='aVerySecureMasterKeyForDev' --http-addr '0.0.0.0:7700' --env 'development'"
        ]
      }
      
      volume_mount {
        volume      = "meili_data"
        destination = "/meili_data"
        read_only   = false
      }

      // Pin resources to prevent noisy neighbor problems.
      resources {
        cpu    = 500 # MHz
        memory = 1024 # MB
      }

      // Register the service in Consul for discovery by other services.
      service {
        name = "meilisearch-api"
        port = "http"
        
        // Use tags for versioning and filtering.
        tags = ["v1", "search-engine"]

        // Health check to ensure Nomad only routes to healthy instances.
        check {
          type     = "http"
          path     = "/health"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Deploying this with nomad job run search.nomad.hcl gives us a running, discoverable search instance. Any other service in the Nomad cluster can now find it by querying Consul for the meilisearch-api service. The pitfall here is hardcoding the master key. A production system must integrate with HashiCorp Vault using Nomad’s template stanza to inject secrets dynamically and securely.

Phase 2: The Go Signaling Service and Metadata Pipeline

This is the core of the system. The Go application must perform two distinct functions:

  1. Act as a standard WebRTC signaling server, relaying Session Description Protocol (SDP) offers/answers and Interactive Connectivity Establishment (ICE) candidates between peers over WebSockets.
  2. Capture key events during this handshake, package them into a JSON document, and send them to the Meilisearch instance discovered via Consul.

Here is the data structure we’ll use for our searchable session documents. It includes identifiers, network information, and the final state of the connection.

// internal/diagnostics/model.go
package diagnostics

import "time"

// SessionEvent represents a structured log of a WebRTC session for indexing.
type SessionEvent struct {
	SessionID         string    `json:"session_id"`
	ClientA_ID        string    `json:"client_a_id"`
	ClientB_ID        string    `json:"client_b_id,omitempty"`
	ClientA_RemoteAddr string    `json:"client_a_remote_addr"`
	ClientB_RemoteAddr string    `json:"client_b_remote_addr,omitempty"`
	ICEUsername        string    `json:"ice_username"`
	CreatedAt         time.Time `json:"created_at"`
	UpdatedAt         time.Time `json:"updated_at"`
	ConnectionState   string    `json:"connection_state"` // e.g., "new", "connecting", "connected", "failed", "closed"
	FailureReason     string    `json:"failure_reason,omitempty"`
	ICEGaheringState  string    `json:"ice_gathering_state"`
	SignalingState    string    `json:"signaling_state"`
	CandidatePairs    []string  `json:"candidate_pairs,omitempty"` // Store successful candidate pairs
}

The main application logic sets up the HTTP server for WebSocket connections and the Meilisearch client. A common mistake is to instantiate the client on every request. Instead, it should be a long-lived object.

// cmd/signaler/main.go
package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/gorilla/websocket"
	"github.com/hashicorp/consul/api"
	"github.com/meilisearch/meilisearch-go"
	
	"webrtc-search-infra/internal/diagnostics"
	"webrtc-search-infra/internal/signaling"
)

// Global discovery of Meilisearch. In a real app, this would be more robust.
func discoverMeilisearch(consulAddress string) (string, string) {
	config := api.DefaultConfig()
	config.Address = consulAddress
	client, err := api.NewClient(config)
	if err != nil {
		log.Fatalf("Failed to create consul client: %v", err)
	}

	services, _, err := client.Health().Service("meilisearch-api", "v1", true, nil)
	if err != nil || len(services) == 0 {
		log.Fatalf("Meilisearch service not found in Consul: %v", err)
	}

	// In a multi-instance setup, you'd add load balancing logic here.
	addr := services[0].Service.Address
	port := services[0].Service.Port
	return fmt.Sprintf("http://%s:%d", addr, port), os.Getenv("MEILI_MASTER_KEY")
}

func main() {
	consulAddr := os.Getenv("CONSUL_HTTP_ADDR")
	if consulAddr == "" {
		consulAddr = "127.0.0.1:8500" // Default for local dev
	}

	meiliURL, meiliKey := discoverMeilisearch(consulAddr)
	log.Printf("Discovered Meilisearch at %s", meiliURL)

	meiliClient := meilisearch.NewClient(meilisearch.ClientConfig{
		Host:   meiliURL,
		APIKey: meiliKey,
	})

	// Ensure the index exists before starting.
	// This is an idempotent operation.
	_, err := meiliClient.CreateIndex(&meilisearch.IndexConfig{
		Uid:        "sessions",
		PrimaryKey: "session_id",
	})
	if err != nil {
		// This can fail if index exists with different primary key. That's a fatal config error.
		log.Printf("Could not ensure 'sessions' index exists, may already be configured: %v", err)
	}

	// This is our diagnostics writer. It receives events and sends them to Meilisearch.
	diagnosticsWriter := diagnostics.NewMeiliSearchWriter(meiliClient, "sessions")

	hub := signaling.NewHub(diagnosticsWriter)
	go hub.Run()

	upgrader := websocket.Upgrader{
		CheckOrigin: func(r *http.Request) bool {
			// In production, implement a proper origin check.
			return true
		},
	}
	
	http.HandleFunc("/ws", func(w http.ResponseWriter, r *http.Request) {
		signaling.ServeWs(hub, upgrader, w, r)
	})

	srv := &http.Server{Addr: ":8080"}

	go func() {
		log.Println("Signaling server starting on :8080")
		if err := srv.ListenAndServe(); err != http.ErrServerClosed {
			log.Fatalf("ListenAndServe(): %v", err)
		}
	}()
	
	// Graceful shutdown
	quit := make(chan os.Signal, 1)
	signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
	<-quit
	log.Println("Shutting down server...")

	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	if err := srv.Shutdown(ctx); err != nil {
		log.Fatalf("Server forced to shutdown: %v", err)
	}

	log.Println("Server exiting")
}

The most critical part is integrating the diagnostics reporting directly into the WebRTC state machine callbacks provided by Pion. When a peer connection’s state changes, we generate and dispatch an event.

// internal/signaling/peer.go
package signaling

import (
	"log"
	"time"

	"github.com/pion/webrtc/v3"
	"webrtc-search-infra/internal/diagnostics"
)

// ... (other parts of the peer/client struct) ...

// createPeerConnection initializes a new WebRTC peer connection and sets up event handlers.
func (c *Client) createPeerConnection() (*webrtc.PeerConnection, error) {
	// For simplicity, we don't configure STUN/TURN servers.
	// A production service MUST have STUN servers for NAT traversal.
	config := webrtc.Configuration{}
	
	api := webrtc.NewAPI() // You can customize MediaEngine here if needed
	peerConnection, err := api.NewPeerConnection(config)
	if err != nil {
		return nil, err
	}

	sessionID := c.hub.getSessionIDForClient(c)
	event := &diagnostics.SessionEvent{
		SessionID:          sessionID,
		ClientA_ID:         c.id,
		ClientA_RemoteAddr: c.conn.RemoteAddr().String(),
		CreatedAt:          time.Now(),
		UpdatedAt:          time.Now(),
		ConnectionState:    webrtc.PeerConnectionStateNew.String(),
	}

	// This is the diagnostics hook. Every state change triggers an update.
	peerConnection.OnConnectionStateChange(func(s webrtc.PeerConnectionState) {
		log.Printf("Peer Connection State has changed for session %s: %s\n", sessionID, s.String())
		event.ConnectionState = s.String()
		event.UpdatedAt = time.Now()

		if s == webrtc.PeerConnectionStateFailed {
			event.FailureReason = "Connection failed" // A real system would have more detailed reasons
		}

		c.hub.diagnostics.Write(*event)
	})
	
	peerConnection.OnICEGatheringStateChange(func(s webrtc.ICEGatheringState) {
		event.ICEGaheringState = s.String()
		event.UpdatedAt = time.Now()
		c.hub.diagnostics.Write(*event)
	})

	peerConnection.OnSignalingStateChange(func(s webrtc.SignalingState) {
		event.SignalingState = s.String()
		event.UpdatedAt = time.Now()
		c.hub.diagnostics.Write(*event)
	})
	
	// Initial write
	c.hub.diagnostics.Write(*event)

	return peerConnection, nil
}

The corresponding Nomad job file for this service includes building the Go binary inside a Docker container and configuring networking.

// signaler.nomad.hcl
job "signaler" {
  datacenters = ["dc1"]
  type        = "service"

  group "signaling-servers" {
    count = 2 // Run two instances for availability

    network {
      port "http" {
        to = 8080
      }
    }

    task "webrtc-signaler" {
      driver = "docker"

      config {
        image = "my-registry/webrtc-signaler:latest" // Assumes a pre-built image
        ports = ["http"]
      }

      // Pass Consul and Meilisearch details as environment variables.
      // The master key should be from Vault.
      env {
        CONSUL_HTTP_ADDR = "http://${attr.unique.network.ip-address}:8500"
        MEILI_MASTER_KEY = "aVerySecureMasterKeyForDev"
      }

      resources {
        cpu    = 200 # MHz
        memory = 128 # MB
      }

      service {
        name = "webrtc-signaler"
        port = "http"
        tags = ["v1", "webrtc"]

        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Phase 3: Caddy as the Unified Ingress

With the backend services running, the final piece is the public-facing entry point. Caddy, run as a Nomad service, will handle three primary tasks:

  1. Terminate TLS automatically.
  2. Proxy WebSocket traffic (/ws) to the available webrtc-signaler instances.
  3. Proxy API traffic (/search/) to the meilisearch-api service, providing a secured, unified endpoint for administrative queries.

The power of Nomad and Consul shines here. We use Nomad’s template stanza to dynamically generate the Caddyfile based on the services registered in Consul. This makes the ingress configuration self-updating as backend services are scaled or rescheduled.

// caddy.nomad.hcl
job "caddy-ingress" {
  datacenters = ["dc1"]
  type        = "service"

  group "ingress" {
    count = 1 // Or more for HA, behind a load balancer

    network {
      port "http" {
        static = 80
      }
      port "https" {
        static = 443
      }
    }

    task "caddy" {
      driver = "docker"

      config {
        image = "caddy:2.7.5"
        ports = ["http", "https"]
        // Mount the generated Caddyfile into the container
        volumes = [
          "local/Caddyfile:/etc/caddy/Caddyfile"
        ]
      }

      // This template block is the key to dynamic configuration.
      // It queries Consul and renders a Caddyfile.
      template {
        data = <<EOF
your.domain.com {
    # WebSocket proxy for WebRTC signaling
    handle /ws* {
        reverse_proxy {
            to {{range service "webrtc-signaler"}}http://{{.Address}}:{{.Port}} {{end}}
            transport http {
                read_buffer 8k
                write_buffer 8k
            }
        }
    }

    # API Gateway for Meilisearch
    handle /search* {
        uri strip_prefix /search
        reverse_proxy {
            to {{range service "meilisearch-api"}}http://{{.Address}}:{{.Port}} {{end}}
            header_up X-Meili-API-Key "aVerySecureMasterKeyForDev"
        }
    }

    # Optional: Basic health check endpoint
    handle /health {
        respond "OK" 200
    }
}
EOF
        destination = "local/Caddyfile"
        change_mode = "signal"
        change_signal = "SIGHUP" // Reload Caddy on config change
      }

      resources {
        cpu    = 300
        memory = 128
      }
      
      service {
        name = "caddy-ingress"
        port = "https"
      }
    }
  }
}

A significant pitfall to avoid is misconfiguring the reverse_proxy for WebSockets. Standard HTTP proxying will fail. Caddy handles this gracefully, but explicitly setting buffer sizes can be important under heavy load. The uri strip_prefix directive for the search endpoint is also crucial, preventing the /search path from being passed to the Meilisearch backend, which expects requests at its root.

The Complete Architecture

With all jobs running, the flow of data and requests can be visualized.

graph TD
    subgraph User Space
        ClientA[WebRTC Client A]
        ClientB[WebRTC Client B]
        Admin[Admin User]
    end

    subgraph Nomad Cluster on dc1
        subgraph Caddy Ingress Node
            Caddy(Caddy Service)
        end

        subgraph Worker Node 1
            Signaler1(Signaler Instance 1)
            Consul1[Consul Agent]
            Signaler1 -- discovers --> Consul1
        end
        
        subgraph Worker Node 2
            Signaler2(Signaler Instance 2)
            Consul2[Consul Agent]
            Signaler2 -- discovers --> Consul2
        end

        subgraph Worker Node 3
            Meili(Meilisearch Instance)
            Consul3[Consul Agent]
        end
    end

    ClientA -- WSS Handshake --> Caddy
    ClientB -- WSS Handshake --> Caddy
    Caddy -- Load Balances --> Signaler1
    Caddy -- Load Balances --> Signaler2

    Signaler1 -- Relays SDP/ICE --> Signaler2
    
    Signaler1 -- Publishes Metadata --> Meili
    Signaler2 -- Publishes Metadata --> Meili

    Admin -- HTTPS API Query --> Caddy
    Caddy -- Reverse Proxies --> Meili

When a user reports that their call at 2:30 PM failed, an operator can now execute a simple, targeted query against the search API:

curl -X POST 'https://your.domain.com/search/indexes/sessions/search' \
-H 'Content-Type: application/json' \
--data-binary '{
    "q": "",
    "filter": "connection_state = failed AND created_at > 1698399000"
}'

This returns a structured JSON object for every failed session after the specified Unix timestamp, containing IP addresses, session IDs, and the last known state from the signaling perspective. This transforms debugging from a resource-intensive forensics task into a simple, sub-second API call.

This system, however, is not without its limitations. The diagnostic data is currently limited to what can be inferred during the signaling phase. For true end-to-end observability, client-side logs and WebRTC getStats() API results (reporting packet loss, jitter, round-trip time) would need to be collected and correlated with these server-side session documents. Furthermore, the search index could become a bottleneck and a storage burden if not managed correctly. Implementing a Time-To-Live (TTL) policy or an automated index rotation job within Nomad would be a necessary next step for any long-running production deployment. The current signaling hub logic is also naive; in a scaled-out scenario, a Redis pub/sub backplane would be required to route messages between clients connected to different signaler instances, adding another stateful component to manage.


  TOC