Constructing a Dynamic ML Inference Fleet with a gRPC-Go Gateway and Consul-Managed Service Discovery

MLOps

Word Count: 3.6k

Read Times: 22 Min

The initial architecture was a direct cause of operational friction. A single Go monolith served our primary API, and buried deep within it was a call to a Python sub-process that loaded a Scikit-learn model for fraud detection. Deploying an updated model meant redeploying the entire monolith. This tightly coupled the lifecycle of a machine learning model with our core application stack, turning a simple model refresh into a high-risk, multi-team release event. The mandate was clear: decouple ML inference from the core application, and enable the data science team to deploy models independently and dynamically.

Our first attempt was naive. We extracted the Python model-serving logic into a separate Flask application. The Go monolith, now a gateway, would make HTTP requests to it. This worked, but immediately raised a new problem: discovery. The gateway needed to know the IP address and port of the new ML service. Hardcoding it was brittle. We placed it behind a load balancer, but this just shifted the problem. Updating the load balancer’s target group was a manual, slow process involving infrastructure tickets. During a scaling event, new ML instances wouldn’t be automatically added to the pool. This approach failed to meet the “dynamic” requirement and introduced unacceptable operational overhead. The core problem was a lack of a dynamic, application-aware service registry.

This failure forced a complete rethink, leading to an architecture built on a stack that, while seemingly disparate, addresses the core challenges of dynamic infrastructure, high-performance communication, and operational visibility.

Service Discovery: We needed a mechanism for our ML services to announce their existence and for the gateway to find them automatically. HashiCorp Consul was the clear choice, particularly in our mixed-environment of VMs and containers. Its health-checking capabilities are critical; the gateway should never route traffic to a failing instance.
Communication Protocol: The constant back-and-forth for inference felt heavy with JSON over HTTP. gRPC, with its Protobuf-based schema and performance benefits, was a better fit for this high-throughput, internal service-to-service communication. Go’s excellent gRPC support made it a natural choice for the performance-critical gateway.
Infrastructure Provisioning: Manually configuring each new ML inference node with Python, dependencies, and the Consul agent was not scalable. We required an Infrastructure as Code solution. Chef was selected due to our team’s existing expertise and its ability to manage the complete state of a server declaratively, from OS packages to application service files.
Operational UI: To build trust in this new dynamic system, we needed a simple dashboard to visualize the state of the inference fleet. A React frontend was our standard, and Styled-components provided the component-level styling needed to build a clean, maintainable interface without wrestling with global CSS.

The resulting system is a fleet of self-registering, disposable ML inference nodes managed by Chef, discovered in real-time by a Go gateway via Consul, all providing a seamless, high-performance prediction API.

The gRPC Service Contract: Defining Communication

Everything starts with the contract. Protobuf forces clarity on the inputs and outputs of the ML service. We defined a simple Inference service with a Predict method.

// proto/inference.proto
syntax = "proto3";

package inference;

option go_package = ".;inference";

// The service definition for our ML model.
service Inference {
  // Predict takes a set of features and returns a prediction.
  rpc Predict(PredictRequest) returns (PredictResponse) {}
}

// A single feature can be a float, string, etc.
// Using a oneof provides flexibility for different model types.
message FeatureValue {
    oneof value {
        double float_value = 1;
        string string_value = 2;
        int64 int_value = 3;
    }
}

// The request payload contains a map of feature names to their values.
message PredictRequest {
  map<string, FeatureValue> features = 1;
}

// The response contains the prediction result.
message PredictResponse {
  double prediction = 1;
  string model_version = 2; // Useful for tracking which model version served the request.
  string node_id = 3;       // The ID of the serving node for debugging.
}

This contract is the single source of truth. Compiling it generates client and server stubs for both Go and Python, eliminating boilerplate and ensuring type safety across language boundaries.

The Self-Registering Python Inference Node

The core of the dynamic fleet is the Python service. Each instance must not only serve predictions but also register itself with Consul upon startup and de-register gracefully on shutdown.

We use the grpcio library for the server and python-consul for interacting with the Consul agent, which is assumed to be running locally on the node (installed and configured by Chef).

# ml_service/server.py
import grpc
import consul
import uuid
import socket
import logging
import time
import os
from concurrent import futures
from signal import signal, SIGTERM

# Import generated gRPC files
from proto import inference_pb2
from proto import inference_pb2_grpc

# Scikit-learn for model loading and prediction
import joblib

# --- Configuration ---
# In a real-world project, these would come from env vars or a config file.
PORT = 50051
MODEL_PATH = os.getenv("MODEL_PATH", "models/fraud_model.joblib")
MODEL_VERSION = os.getenv("MODEL_VERSION", "v1.0.2")
CONSUL_HOST = os.getenv("CONSUL_HOST", "127.0.0.1")
CONSUL_PORT = int(os.getenv("CONSUL_PORT", "8500"))
SERVICE_NAME = "ml-inference-service"

# Generate a unique ID for this specific instance of the service
SERVICE_ID = f"{SERVICE_NAME}-{uuid.uuid4()}"

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def get_local_ip():
    """Helper function to get the primary IP of the node."""
    s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    try:
        # doesn't even have to be reachable
        s.connect(('10.255.255.255', 1))
        IP = s.getsockname()[0]
    except Exception:
        IP = '127.0.0.1'
    finally:
        s.close()
    return IP

class InferenceServicer(inference_pb2_grpc.InferenceServicer):
    """Implements the gRPC Inference service."""

    def __init__(self):
        try:
            self.model = joblib.load(MODEL_PATH)
            self.node_ip = get_local_ip()
            logging.info(f"Model {MODEL_PATH} (version: {MODEL_VERSION}) loaded successfully.")
        except FileNotFoundError:
            logging.error(f"Model file not found at {MODEL_PATH}. Exiting.")
            raise
        except Exception as e:
            logging.error(f"Failed to load model: {e}")
            raise

    def Predict(self, request, context):
        """Handles prediction requests."""
        try:
            # The pitfall here is data transformation. The raw features from the map
            # must be converted into the exact format (e.g., a NumPy array with a specific column order)
            # that the Scikit-learn model expects. This is a common source of runtime errors.
            # For simplicity, we assume a simple feature vector. A real implementation
            # would have more robust feature engineering logic here.
            feature_vector = [
                request.features["amount"].float_value,
                request.features["age"].int_value,
                request.features["login_frequency"].float_value,
            ]
            
            prediction_result = self.model.predict([feature_vector])
            
            return inference_pb2.PredictResponse(
                prediction=prediction_result[0],
                model_version=MODEL_VERSION,
                node_id=SERVICE_ID
            )
        except KeyError as e:
            # Handle cases where a required feature is missing
            context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
            context.set_details(f"Missing feature in request: {e}")
            logging.warning(f"Prediction failed due to missing feature: {e}")
            return inference_pb2.PredictResponse()
        except Exception as e:
            context.set_code(grpc.StatusCode.INTERNAL)
            context.set_details(f"An internal error occurred: {e}")
            logging.error(f"Internal error during prediction: {e}", exc_info=True)
            return inference_pb2.PredictResponse()

class MLServer:
    def __init__(self):
        self.consul_client = consul.Consul(host=CONSUL_HOST, port=CONSUL_PORT)
        self.server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
        inference_pb2_grpc.add_InferenceServicer_to_server(InferenceServicer(), self.server)
        self.server.add_insecure_port(f'[::]:{PORT}')

    def register_with_consul(self):
        """Registers the service and its health check with Consul."""
        ip = get_local_ip()
        
        # Define a gRPC health check. Consul will periodically check this endpoint.
        # This is more reliable than a simple TCP check.
        grpc_health_check = consul.Check.grpc(f"{ip}:{PORT}/{SERVICE_NAME}", interval="10s", timeout="5s", deregister="30s")

        tags = [f"version:{MODEL_VERSION}", "grpc", "ml"]

        logging.info(f"Registering with Consul: ID={SERVICE_ID}, Name={SERVICE_NAME}, IP={ip}, Port={PORT}")
        
        # The registration itself.
        # A common mistake is to forget the deregister critical service after timeout.
        # This ensures that if the node dies ungracefully, Consul eventually removes it.
        self.consul_client.agent.service.register(
            name=SERVICE_NAME,
            service_id=SERVICE_ID,
            address=ip,
            port=PORT,
            tags=tags,
            check=grpc_health_check
        )

    def deregister_from_consul(self):
        """Deregisters the service from Consul."""
        logging.info(f"Deregistering service {SERVICE_ID} from Consul.")
        self.consul_client.agent.service.deregister(service_id=SERVICE_ID)

    def serve(self):
        """Starts the server and handles graceful shutdown."""
        self.register_with_consul()
        self.server.start()
        logging.info(f"Server started on port {PORT}. Awaiting requests.")
        
        # Graceful shutdown handler
        def handle_shutdown(signum, frame):
            logging.info("Shutdown signal received. Cleaning up...")
            self.deregister_from_consul()
            # This allows in-flight requests to complete
            self.server.stop(10).wait()
            logging.info("Server stopped gracefully.")

        signal(SIGTERM, handle_shutdown)

        try:
            # Keep the main thread alive
            while True:
                time.sleep(86400)
        except KeyboardInterrupt:
            handle_shutdown(None, None)

if __name__ == '__main__':
    server = MLServer()
    server.serve()

This Python service is now a self-contained, responsible member of the distributed system. It knows how to announce its availability and health, and how to cleanly remove itself.

Declarative Node Provisioning with Chef

A self-registering service is useless if we can’t deploy it reliably. This is Chef’s role. The cookbook defines the desired state of an “ML Inference Node.”

Here is a simplified recipe that handles dependencies, application code deployment, and systemd service setup.

# cookbooks/ml_inference_node/recipes/default.rb

# 1. Install system dependencies
apt_update 'update'

package 'python3'
package 'python3-pip'
package 'build-essential' # Needed for some pip package builds

# 2. Install Consul Agent
# In a production setup, you would use a dedicated Consul cookbook.
# This is a simplified example.
remote_file '/usr/local/bin/consul' do
  source 'https://releases.hashicorp.com/consul/1.13.1/consul_1.13.1_linux_amd64.zip'
  # ... add logic to unzip and set permissions ...
  mode '0755'
  action :create
end

# Create Consul config directory
directory '/etc/consul.d' do
  owner 'root'
  group 'root'
  mode '0755'
  action :create
end

# Basic Consul client config
file '/etc/consul.d/consul.json' do
  content <<~JSON
  {
    "server": false,
    "datacenter": "dc1",
    "data_dir": "/opt/consul",
    "retry_join": ["10.0.1.10", "10.0.1.11", "10.0.1.12"]
  }
  JSON
  mode '0644'
end

# Setup Consul systemd service
systemd_unit 'consul.service' do
  content <<~UNIT
  [Unit]
  Description="HashiCorp Consul - A service mesh solution"
  [Service]
  ExecStart=/usr/local/bin/consul agent -config-dir=/etc/consul.d/
  Restart=on-failure
  [Install]
  WantedBy=multi-user.target
  UNIT
  action [:create, :enable, :start]
end

# 3. Deploy the ML Service Application
app_dir = '/opt/ml_service'

directory app_dir do
  recursive true
  action :create
end

# Copy application files from the cookbook
cookbook_file "#{app_dir}/server.py" do
  source 'server.py'
  mode '0755'
end

# In a real project, protobuf files would be compiled as part of a CI pipeline.
# Here we copy them directly.
cookbook_file "#{app_dir}/proto/inference_pb2.py" # ...
cookbook_file "#{app_dir}/proto/inference_pb2_grpc.py" # ...

# This is where we would get the model file, perhaps from S3 or Artifactory.
# A common mistake is bundling large model files directly in the cookbook.
remote_file "#{app_dir}/models/fraud_model.joblib" do
  source "s3://my-model-bucket/fraud_model-#{node['ml_app']['model_version']}.joblib"
  # ... add credentials and properties ...
  action :create
end

# Install Python dependencies
pip_requirements "#{app_dir}/requirements.txt"

# 4. Create and enable the systemd service for our ML app
systemd_unit 'ml_service.service' do
  content({
    Unit: {
      Description: 'ML Inference gRPC Service',
      After: 'network.target consul.service',
    },
    Service: {
      Type: 'simple',
      User: 'nobody',
      ExecStart: "/usr/bin/python3 #{app_dir}/server.py",
      Restart: 'on-failure',
      # Pass configuration via environment variables
      Environment: [
        "MODEL_PATH=#{app_dir}/models/fraud_model.joblib",
        "MODEL_VERSION=#{node['ml_app']['model_version']}",
        "CONSUL_HOST=127.0.0.1" # The service talks to its local Consul agent
      ],
    },
    Install: {
      WantedBy: 'multi-user.target',
    },
  })
  action [:create, :enable, :start]
end

With this Chef cookbook, provisioning a new, fully functional inference node is reduced to a single command: knife bootstrap <IP> -r 'recipe[ml_inference_node]'. The node comes online, installs its dependencies, starts Consul, starts the ML service, and the service automatically registers itself.

The Dynamic Go Gateway: Discovery and Load Balancing

The gateway is the consumer of this dynamic fleet. It cannot have a static list of backends. It must query Consul to get a fresh list of healthy inference nodes and route requests to them. A simple query isn’t enough; the gateway needs to react to changes in the fleet in real-time. This is achieved using Consul’s blocking query mechanism.

// gateway/main.go
package main

import (
	"context"
	"log"
	"math/rand"
	"net/http"
	"sync"
	"time"

	"github.com/gin-gonic/gin"
	consulapi "github.com/hashicorp/consul/api"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	"gateway/proto/inference" // Your generated Go protobuf package
)

const (
	consulAddress = "http://127.0.0.1:8500"
	serviceName   = "ml-inference-service"
)

// ServicePool manages the connections to the discovered ML services.
type ServicePool struct {
	sync.RWMutex
	clients []inference.InferenceClient // Pool of gRPC clients
	conns   []*grpc.ClientConn        // Underlying connections to close later
}

// Get returns a random healthy client from the pool.
// A production implementation should use a better strategy (e.g., round-robin, least connections).
func (p *ServicePool) Get() inference.InferenceClient {
	p.RLock()
	defer p.RUnlock()
	if len(p.clients) == 0 {
		return nil
	}
	return p.clients[rand.Intn(len(p.clients))]
}

// Update rebuilds the connection pool based on a new list of service instances.
func (p *ServicePool) Update(instances []*consulapi.ServiceEntry) {
	p.Lock()
	defer p.Unlock()

	log.Printf("Updating service pool with %d instances", len(instances))

	// Close all old connections. The pitfall here is not cleaning up
	// old connections, leading to resource leaks.
	for _, conn := range p.conns {
		conn.Close()
	}

	p.clients = []inference.InferenceClient{}
	p.conns = []*grpc.ClientConn{}

	for _, inst := range instances {
		// The address for gRPC connection is taken directly from Consul's service entry.
		addr := fmt.Sprintf("%s:%d", inst.Service.Address, inst.Service.Port)
		
		// In a real-world project, you'd configure retry policies, timeouts, and credentials here.
		conn, err := grpc.Dial(addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
		if err != nil {
			log.Printf("Failed to connect to service at %s: %v", addr, err)
			continue
		}
		
		client := inference.NewInferenceClient(conn)
		p.clients = append(p.clients, client)
		p.conns = append(p.conns, conn)
	}

	log.Printf("Service pool updated. Active clients: %d", len(p.clients))
}

// watchConsulServices monitors Consul for changes to the ML service fleet.
func watchConsulServices(pool *ServicePool) {
	client, err := consulapi.NewClient(consulapi.DefaultConfig())
	if err != nil {
		log.Fatalf("Failed to create consul client: %v", err)
	}

	var lastIndex uint64 = 0
	for {
		// This is the core of dynamic discovery: the blocking query.
		// The query will wait until there's a change after `lastIndex` or a timeout occurs.
		// A common mistake is to implement a simple polling loop, which is inefficient.
		opts := &consulapi.QueryOptions{
			WaitIndex: lastIndex,
			WaitTime:  5 * time.Minute,
		}
		
		// We query for "passing" health checks only.
		instances, meta, err := client.Health().Service(serviceName, "", true, opts)
		if err != nil {
			log.Printf("Error watching consul service: %v", err)
			time.Sleep(5 * time.Second) // Backoff on error
			continue
		}

		// If the index hasn't changed, there's no update.
		if meta.LastIndex == lastIndex {
			continue
		}

		lastIndex = meta.LastIndex
		pool.Update(instances)
	}
}

func main() {
	pool := &ServicePool{}
	// Run the Consul watcher in the background.
	go watchConsulServices(pool)

	router := gin.Default()
	router.POST("/predict", func(c *gin.Context) {
		// In a real API, you would have proper request validation.
		var jsonRequest map[string]interface{}
		if err := c.ShouldBindJSON(&jsonRequest); err != nil {
			c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid request body"})
			return
		}

		// Get a client from our dynamic pool.
		client := pool.Get()
		if client == nil {
			c.JSON(http.StatusServiceUnavailable, gin.H{"error": "No healthy ML inference services available"})
			return
		}
		
		// Convert the JSON request to the Protobuf format.
		// This mapping logic can be complex and is a critical part of the gateway's responsibility.
		features := make(map[string]*inference.FeatureValue)
		// ... logic to populate `features` from `jsonRequest` ...

		grpcRequest := &inference.PredictRequest{Features: features}

		// Make the gRPC call.
		ctx, cancel := context.WithTimeout(c.Request.Context(), 2*time.Second)
		defer cancel()
		
		resp, err := client.Predict(ctx, grpcRequest)
		if err != nil {
			c.JSON(http.StatusInternalServerError, gin.H{"error": fmt.Sprintf("Failed to get prediction: %v", err)})
			return
		}

		c.JSON(http.StatusOK, gin.H{
			"prediction":     resp.Prediction,
			"model_version":  resp.ModelVersion,
			"serving_node":   resp.NodeId,
		})
	})
	
	log.Println("Gateway server starting on port 8080")
	router.Run(":8080")
}

This gateway is now resilient. It doesn’t care about specific IPs. It only cares about the service name registered in Consul. We can add or remove ML nodes, and the gateway will adapt within seconds without a restart or configuration change.

UI for Visibility: Styled-components in Action

To close the loop, a simple dashboard provides visibility into the fleet’s health. We expose an endpoint in the Go gateway (/services) that simply returns the list of healthy instances it knows about from Consul. A React app consumes this endpoint.

Styled-components are used here to create encapsulated, reusable UI elements.

// components/FleetStatusDashboard.js
import React, { useState, useEffect } from 'react';
import styled, { keyframes } from 'styled-components';

// --- Styled Components Definition ---

const DashboardWrapper = styled.div`
  padding: 2rem;
  font-family: sans-serif;
  background-color: #f4f7f9;
`;

const Title = styled.h1`
  color: #2c3e50;
  border-bottom: 2px solid #3498db;
  padding-bottom: 0.5rem;
`;

const Grid = styled.div`
  display: grid;
  grid-template-columns: repeat(auto-fill, minmax(300px, 1fr));
  gap: 1.5rem;
  margin-top: 1.5rem;
`;

const pulse = keyframes`
  0% { background-color: #2ecc71; }
  50% { background-color: #25a25a; }
  100% { background-color: #2ecc71; }
`;

const StatusIndicator = styled.span`
  display: inline-block;
  width: 12px;
  height: 12px;
  border-radius: 50%;
  margin-right: 0.75rem;
  background-color: ${props => (props.status === 'passing' ? '#2ecc71' : '#e74c3c')};
  animation: ${props => (props.status === 'passing' ? pulse : 'none')} 2s infinite;
`;

const Card = styled.div`
  background: white;
  border-radius: 8px;
  box-shadow: 0 4px 12px rgba(0,0,0,0.08);
  padding: 1.5rem;
  transition: transform 0.2s ease-in-out;
  
  &:hover {
    transform: translateY(-5px);
  }
`;

const CardHeader = styled.div`
  display: flex;
  align-items: center;
  font-size: 1.1rem;
  font-weight: 600;
  color: #34495e;
`;

const InfoList = styled.ul`
  list-style: none;
  padding: 0;
  margin-top: 1rem;
  color: #7f8c8d;
`;

const InfoItem = styled.li`
  display: flex;
  justify-content: space-between;
  padding: 0.5rem 0;
  border-bottom: 1px solid #ecf0f1;
  
  &:last-child {
    border-bottom: none;
  }
  
  strong {
    color: #555;
  }
`;

// --- The React Component ---

const FleetStatusDashboard = () => {
    const [services, setServices] = useState([]);
    const [error, setError] = useState(null);

    useEffect(() => {
        const fetchServices = async () => {
            try {
                // This endpoint is provided by our Go gateway.
                const response = await fetch('/api/gateway/services'); 
                if (!response.ok) {
                    throw new Error('Failed to fetch service status');
                }
                const data = await response.json();
                setServices(data);
            } catch (err) {
                setError(err.message);
            }
        };

        fetchServices();
        const intervalId = setInterval(fetchServices, 5000); // Poll every 5 seconds

        return () => clearInterval(intervalId);
    }, []);

    return (
        <DashboardWrapper>
            <Title>ML Inference Fleet Status</Title>
            {error && <p>Error: {error}</p>}
            <Grid>
                {services.map(service => (
                    <Card key={service.ID}>
                        <CardHeader>
                            <StatusIndicator status={service.Status} />
                            {service.Name}
                        </CardHeader>
                        <InfoList>
                            <InfoItem><strong>Node ID:</strong> <span>{service.ID.slice(-12)}</span></InfoItem>
                            <InfoItem><strong>Address:</strong> <span>{service.Address}:{service.Port}</span></InfoItem>
                            <InfoItem><strong>Version:</strong> <span>{service.Version}</span></InfoItem>
                        </InfoList>
                    </Card>
                ))}
            </Grid>
        </DashboardWrapper>
    );
};

export default FleetStatusDashboard;

This dashboard, though simple, provides immense value. It makes the abstract concept of a “dynamic fleet” tangible. Operators can now visually confirm that a new node, once provisioned by Chef, appears on the dashboard with a “passing” status within seconds.

The final architecture can be visualized as follows:

flowchart TD
    subgraph "Chef Provisioning"
        direction LR
        chef_server[Chef Server] -- Cookbook --> new_vm[New VM]
    end

    subgraph "Runtime Fleet"
        new_vm -- Installs & Starts --> python_node[Python/gRPC/scikit-learn Node]
        python_node -- 1. Register & Health Check --> consul[Consul Cluster]
        
        python_node2[Python/gRPC/scikit-learn Node] -- Register --> consul
        python_node3[Python/gRPC/scikit-learn Node] -- Register --> consul
    end

    subgraph "Gateway & UI"
        user[User] --> ui[React UI w/ Styled-components]
        ui -- API Call --> go_gateway[Go Gateway API]
        go_gateway -- HTTP Request --> go_grpc[gRPC Client Pool]
        go_grpc -- gRPC Call --> python_node
    end
    
    go_grpc -- 2. Watch for Services --> consul

The system we built achieves the initial goal of decoupling, but it’s not without its own set of trade-offs and future considerations. The client-side, random load-balancing in the Go gateway is rudimentary; a production system would benefit from a more intelligent strategy like round-robin or least-connection, or even offloading this to a service mesh like Consul Connect. Model versioning is still managed by environment variables in the Chef recipe, which is not ideal. A proper solution would involve a dedicated model registry that the inference nodes could query for the correct model to load, perhaps using Consul’s KV store as a starting point. Finally, all communication is currently unencrypted. The natural next step is to leverage Consul Connect to enforce mutual TLS between the gateway and the inference fleet, securing all traffic without changing a single line of application code.

Consul Chef Scikit-learn gRPC-Go Styled-components

Automating Canary Releases for Vue.js SPAs on AWS via GitHub Actions and CloudFront Functions

2023-10-27 DevOps

Vue.js GitHub Actions AWS CI/CD CloudFront S3

Building a Resumable WebAuthn Authentication Component with Qwik and Vitest

2023-10-27 Web Development

Qwik Performance WebAuthn Vitest Authentication