Building a GitOps-Managed TensorFlow Serving Platform with Dynamic Configuration via Consul

MLOps

Word Count: 3k

Read Times: 18 Min

The initial state of our model deployment process was, to put it mildly, chaotic. Each data science team had its own collection of shell scripts, scp commands, and manually versioned model files on a shared network drive. A “deployment” involved a senior engineer SSH’ing into a production server, backing up the old model directory, and copying over the new one. Rollbacks were a frantic search for the last known good “version,” and configuration changes—like adjusting thread pools for TensorFlow Serving—required a manual restart and a prayer. This approach was not just inefficient; it was dangerously fragile and completely opaque. It was clear that treating machine learning models as second-class citizens, exempt from the rigor of modern software delivery practices, was an accumulating source of technical debt and operational risk.

Our first principle for the redesign was simple: Model deployment is a software deployment. This meant we needed a structured, automated, and auditable process. The initial concept was to build a CI/CD pipeline that would terminate in a declarative deployment, which naturally led us to GitOps. The source of truth for what model version should be running in any given environment would be a Git repository. This solved the auditability and rollback problem almost immediately. If a model performs poorly, git revert is the emergency brake.

This led to the technology selection phase, where every choice was a trade-off.

Model Serving: TensorFlow Serving was the obvious choice. It’s a high-performance, production-ready system specifically designed for our model format. We containerized it to run on our existing Kubernetes platform. The challenge wasn’t running it, but managing it at scale.
Model Lifecycle & Metadata: Simply storing model files in Git LFS or an object store wasn’t enough. We needed metadata: which dataset was it trained on? What were its validation metrics (accuracy, loss)? Who trained it? A simple object store provides none of this context. A relational database, specifically PostgreSQL, was chosen to act as our Model Registry. It offered structured, transactional storage for all metadata, ensuring that a registered model version was an immutable, well-defined entity. In a real-world project, using a filesystem for metadata is a recipe for inconsistency.
CI/CD & GitOps: We used GitLab CI for our continuous integration pipeline. It handles the testing, training, and registration of models. For the continuous delivery part, we adopted ArgoCD. It continuously reconciles the state of our Kubernetes cluster with the desired state defined in a Git repository containing our Kubernetes manifests (specifically, Helm charts).
Dynamic Configuration: Here was the crux of the problem. A pure GitOps workflow is powerful but can be slow. If we needed to shift 10% of traffic to a new canary model, creating a pull request, getting it approved, and waiting for the ArgoCD sync cycle could take minutes. For operational toggles or rapid A/B test adjustments, this latency was unacceptable. This is where we decided to introduce HashiCorp Consul. Consul’s Key-Value (KV) store would serve as a real-time configuration backbone. The GitOps process would set the default or base configuration, but a small, privileged set of operators could modify Consul KV entries for immediate effect, bypassing the Git-to-cluster latency. The TensorFlow Serving pods would need to react to these changes without restarting.
Management UI: The entire system, while powerful, was a collection of disparate parts. We needed a single pane of glass for data scientists and ML engineers to view the model registry, track deployment status, and observe real-time configuration values. We chose Solid.js for this internal dashboard due to its fine-grained reactivity and performance. It felt like the right tool for a UI that needed to display potentially fast-updating streams of data from multiple sources without the overhead of a virtual DOM.

The final architecture can be visualized as a flow of artifacts and information.

flowchart TD
    subgraph "CI: Model Build & Registration"
        A[Developer pushes training code] --> B(GitLab CI Pipeline);
        B -- 1. Train Model --> C{TensorFlow};
        C -- 2. Produces model.pb --> D[S3 Artifact Storage];
        B -- 3. Records Metadata --> E[(PostgreSQL Model Registry)];
        D -- Artifact URL --> E;
    end

    subgraph "CD: GitOps Deployment"
        F[ML Engineer updates Git Repo] -- specify model version from Registry --> G{ArgoCD};
        G -- syncs --> H[Kubernetes Cluster];
        H -- deploys/updates --> I(TF Serving Pod);
    end

    subgraph "Dynamic Configuration & Serving"
        J[Operator/Automation] -- updates KV for A/B testing --> K(Consul KV);
        I -- watches for changes --> K;
        L[User Request] --> I;
        I -- serves prediction --> L;
    end
    
    subgraph "Observability UI"
        M[Solid.js Dashboard] -- reads data --> E;
        M -- reads config --> K;
        N[ML Engineer/Data Scientist] -- views --> M;
    end

    E -- provides model version info --> F;

Part 1: The CI Pipeline - From Code to Registered Model

The CI pipeline is the entry point. Its responsibility is to produce an immutable, versioned, and validated model artifact along with its associated metadata. A common mistake is to only version the code; the model artifact, the data it was trained on, and its performance metrics must all be versioned together.

Here is the SQL schema for our model_registry table in PostgreSQL. It’s designed to be the single source of truth for model existence and performance.

-- models.sql
CREATE TABLE model_registry (
    id SERIAL PRIMARY KEY,
    model_name VARCHAR(255) NOT NULL,
    model_version VARCHAR(50) NOT NULL,
    git_commit_hash CHAR(40) NOT NULL,
    artifact_url TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    -- Core training metadata
    training_data_source TEXT,
    -- Example performance metrics
    validation_accuracy DOUBLE PRECISION,
    validation_loss DOUBLE PRECISION,
    -- Deployment status
    status VARCHAR(20) DEFAULT 'PENDING' NOT NULL, -- e.g., PENDING, APPROVED, DEPRECATED

    CONSTRAINT unique_model_version UNIQUE (model_name, model_version)
);

CREATE INDEX idx_model_name ON model_registry (model_name);

The GitLab CI pipeline executes the training script, and upon successful validation, it runs a Python script to populate this table.

Here’s a simplified .gitlab-ci.yml showing the key stages.

# .gitlab-ci.yml
stages:
  - build
  - test
  - train
  - register

variables:
  MODEL_NAME: "fraud-detector"
  # Version is dynamically generated, e.g., from a tag or timestamp
  MODEL_VERSION: "v1.${CI_PIPELINE_IID}"
  
build_image:
  stage: build
  script:
    - docker build -t my-training-env:latest .
    - docker push my-training-env:latest

unit_tests:
  stage: test
  script:
    - echo "Running unit tests for data processing and model architecture..."
    - python -m pytest tests/

train_model:
  stage: train
  script:
    - echo "Starting model training..."
    - python train.py --model-name ${MODEL_NAME} --model-version ${MODEL_VERSION}
    - echo "Training complete. Model artifact saved to ./models/${MODEL_NAME}/${MODEL_VERSION}/"
    - echo "Uploading artifact to S3..."
    # Assume awscli is configured
    - aws s3 sync ./models/${MODEL_NAME}/${MODEL_VERSION}/ s3://my-model-artifacts/${MODEL_NAME}/${MODEL_VERSION}/
  artifacts:
    paths:
      - ./model_metrics.json
    expire_in: 1 day

register_model:
  stage: register
  needs: ["train_model"]
  script:
    - echo "Registering model in PostgreSQL..."
    # The Python script reads DB credentials from CI/CD variables
    - pip install -r requirements_registry.txt
    - python scripts/register_model.py \
        --model-name ${MODEL_NAME} \
        --model-version ${MODEL_VERSION} \
        --commit-hash ${CI_COMMIT_SHA} \
        --artifact-url "s3://my-model-artifacts/${MODEL_NAME}/${MODEL_VERSION}/" \
        --metrics-file ./model_metrics.json

The register_model.py script is critical. It’s a transactional piece of code that ensures the model is only registered if all information is present.

# scripts/register_model.py
import os
import json
import argparse
import psycopg2
import sys
import logging

# Basic logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def get_db_connection():
    """Establishes a connection to the PostgreSQL database."""
    try:
        conn = psycopg2.connect(
            dbname=os.environ.get("DB_NAME"),
            user=os.environ.get("DB_USER"),
            password=os.environ.get("DB_PASSWORD"),
            host=os.environ.get("DB_HOST"),
            port=os.environ.get("DB_PORT", "5432")
        )
        return conn
    except psycopg2.OperationalError as e:
        logging.error(f"Database connection failed: {e}")
        sys.exit(1)

def register_model_in_db(args, metrics):
    """Inserts a new model version into the registry."""
    sql = """
        INSERT INTO model_registry (
            model_name, model_version, git_commit_hash, artifact_url,
            validation_accuracy, validation_loss, status
        ) VALUES (%s, %s, %s, %s, %s, %s, %s);
    """
    conn = None
    try:
        conn = get_db_connection()
        cur = conn.cursor()
        cur.execute(sql, (
            args.model_name,
            args.model_version,
            args.commit_hash,
            args.artifact_url,
            metrics.get('accuracy'),
            metrics.get('loss'),
            'PENDING'  # New models require approval before deployment
        ))
        conn.commit()
        cur.close()
        logging.info(f"Successfully registered model {args.model_name}:{args.model_version}")
    except (Exception, psycopg2.DatabaseError) as error:
        logging.error(f"Error during model registration: {error}")
        if conn:
            conn.rollback()
        sys.exit(1)
    finally:
        if conn:
            conn.close()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Register a new model in the database.")
    parser.add_argument("--model-name", required=True)
    parser.add_argument("--model-version", required=True)
    parser.add_argument("--commit-hash", required=True)
    parser.add_argument("--artifact-url", required=True)
    parser.add_argument("--metrics-file", required=True)
    
    args = parser.parse_args()

    try:
        with open(args.metrics_file, 'r') as f:
            metrics = json.load(f)
    except FileNotFoundError:
        logging.error(f"Metrics file not found: {args.metrics_file}")
        sys.exit(1)
    except json.JSONDecodeError:
        logging.error(f"Error decoding JSON from {args.metrics_file}")
        sys.exit(1)

    register_model_in_db(args, metrics)

This disciplined CI process ensures that by the time a model is in the registry, it’s a well-defined, traceable asset.

Part 2: The GitOps Loop - Declarative Deployment

With the model registered, deployment becomes a declarative action. Our deployment configuration is managed in a separate Git repository, structured with Helm. To deploy a new model version, an engineer opens a pull request to change one line in a values.yaml file.

# deployment-repo/models/fraud-detector/values.yaml
replicaCount: 3

# This is the key that gets updated to trigger a new deployment
modelVersion: "v1.234" 

image:
  repository: tensorflow/serving
  tag: "latest"

resources:
  requests:
    cpu: "1"
    memory: "2Gi"
  limits:
    cpu: "2"
    memory: "4Gi"

ArgoCD monitors this repository. When the PR is merged, ArgoCD detects the change in modelVersion and triggers a rolling update of the TensorFlow Serving Deployment in Kubernetes.

The Helm template for the deployment is where we inject the configuration and introduce the Consul integration.

# deployment-repo/models/fraud-detector/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-tf-serving
spec:
  replicas: {{ .Values.replicaCount }}
  template:
    spec:
      containers:
        - name: tensorflow-serving
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          # The command tells TF Serving to look for a config file that our sidecar will manage.
          command: ["/usr/bin/tensorflow_model_server"]
          args:
            - --port=8500
            - --rest_api_port=8501
            - --model_config_file=/models/models.config
            - --model_config_file_poll_wait_seconds=60 # TF Serving can poll, but our sidecar is faster
          volumeMounts:
            - name: model-storage
              mountPath: /models
            - name: consul-config-volume
              mountPath: /consul-config
        
        # This is the critical sidecar for dynamic configuration
        - name: consul-config-watcher
          image: "my-org/consul-watcher:latest"
          env:
            - name: CONSUL_HTTP_ADDR
              value: "consul-server.consul.svc:8500"
            - name: CONSUL_KV_PATH
              # This path is specific to this model deployment
              value: "tf-serving/config/{{ .Release.Name }}/models.config"
            - name: OUTPUT_CONFIG_PATH
              value: "/models/models.config"
          volumeMounts:
            - name: model-storage
              mountPath: /models
      
      initContainers:
        # This init container downloads the correct model version from S3
        # before the main containers start.
        - name: model-downloader
          image: "amazon/aws-cli"
          command: ["/bin/sh", "-c"]
          args:
            - >
              aws s3 sync s3://my-model-artifacts/{{ .Values.modelName }}/{{ .Values.modelVersion }}/ /models/{{ .Values.modelName }}/{{ .Values.modelVersion }}/
          volumeMounts:
            - name: model-storage
              mountPath: /models
      volumes:
        - name: model-storage
          emptyDir: {}

The key takeaway here is the consul-config-watcher sidecar. The GitOps loop deploys the correct model binary, but the sidecar is responsible for managing the serving configuration in real-time.

Part 3: The Real-Time Layer - Dynamic Configuration with Consul

The problem with configuration stored solely in a Kubernetes ConfigMap and managed by GitOps is the reconciliation latency. We use Consul to manage configurations that need to change faster than a PR review cycle, such as routing for A/B testing.

The models.config file used by TensorFlow Serving allows for serving multiple models or versions simultaneously, with traffic splitting.

# Example models.config content
model_config_list: {
  config: {
    name: "fraud-detector",
    base_path: "/models/fraud-detector/",
    model_platform: "tensorflow",
    model_version_policy: {
      specific: {
        versions: 123
        versions: 124
      }
    }
  }
}

Our consul-config-watcher is a simple Go application that uses the Consul API to watch a key and write its content to a file.

// main.go for the consul-config-watcher
package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"time"

	"github.com/hashicorp/consul/api"
)

func main() {
	consulAddr := os.Getenv("CONSUL_HTTP_ADDR")
	kvPath := os.Getenv("CONSUL_KV_PATH")
	outputPath := os.Getenv("OUTPUT_CONFIG_PATH")

	if consulAddr == "" || kvPath == "" || outputPath == "" {
		log.Fatal("CONSUL_HTTP_ADDR, CONSUL_KV_PATH, and OUTPUT_CONFIG_PATH must be set")
	}

	client, err := api.NewClient(&api.Config{Address: consulAddr})
	if err != nil {
		log.Fatalf("Error creating Consul client: %v", err)
	}

	var lastIndex uint64 = 0
	for {
		kvPair, meta, err := client.KV().Get(kvPath, &api.QueryOptions{WaitIndex: lastIndex})
		if err != nil {
			log.Printf("Error watching key %s: %v. Retrying in 5s.", kvPath, err)
			time.Sleep(5 * time.Second)
			continue
		}

		// If WaitIndex is the same, it means a timeout occurred with no changes.
		if meta.LastIndex == lastIndex {
			continue
		}
		
		lastIndex = meta.LastIndex

		if kvPair == nil || kvPair.Value == nil {
			log.Printf("Key %s not found or value is nil. Ensuring target file is absent.", kvPath)
			// Handle deletion if necessary
			_ = os.Remove(outputPath)
			continue
		}

		log.Printf("Detected change in %s at index %d. Writing to %s", kvPath, lastIndex, outputPath)

		// This is not atomic, but for this use case it's sufficient.
		// For true atomic writes, write to a temp file and then rename.
		err = ioutil.WriteFile(outputPath, kvPair.Value, 0644)
		if err != nil {
			log.Printf("Error writing config file: %v", err)
		}
	}
}

Now, an ML engineer can perform a canary release. They would first deploy v1.235 alongside v1.234 via GitOps. Both model versions would now be on the pod’s filesystem. Then, an operator can update the Consul key to introduce the new version to the serving configuration.

# Initial state in Consul (set by GitOps or bootstrap script)
consul kv put tf-serving/config/fraud-detector/models.config \
'model_config_list: { config: { name: "fraud-detector", base_path: "/models/fraud-detector/", model_version_policy: { specific: { versions: 234 } } } }'

# Operator performs a live update to start a canary for version 235
# This change is picked up by the sidecar within seconds.
consul kv put tf-serving/config/fraud-detector/models.config \
'model_config_list: { config: { name: "fraud-detector-v234", base_path: "/models/fraud-detector/234", ... }, config: { name: "fraud-detector-v235", base_path: "/models/fraud-detector/235", ...} }'

# Note: More advanced setups would use TF Serving's routing APIs, 
# but the principle of dynamic config from Consul remains the same.

This hybrid GitOps+Consul approach gives us auditable, declarative deployments for the model artifacts and fast, dynamic control over serving behavior.

Part 4: The UI - A Reactive Dashboard with Solid.js

The final piece is a dashboard to make this system usable for the entire team. It needs to display the contents of the model registry and the live configuration from Consul. Solid.js’s reactivity is perfect for this.

We’d have a backend API (e.g., in Go or Python) that exposes two endpoints:

/api/models/:model_name - Queries the PostgreSQL registry.
/api/models/:model_name/live_config - Queries the Consul KV store.

A Solid.js component might look like this:

// src/components/ModelDashboard.jsx
import { createResource, For, Show } from "solid-js";

// Assume fetchers are defined to call our backend API
import { fetchModelVersions, fetchLiveConfig } from "../api";

function ModelDashboard(props) {
  // createResource automatically fetches and re-fetches when the source signal changes
  const [versions] = createResource(() => props.modelName, fetchModelVersions);
  const [config] = createResource(() => props.modelName, fetchLiveConfig, { initialValue: "Loading..." });

  return (
    <div class="model-dashboard">
      <h2>Model: {props.modelName}</h2>
      
      <div class="live-config">
        <h3>Live Serving Configuration</h3>
        <pre>
          <code>{config()}</code>
        </pre>
      </div>

      <div class="version-history">
        <h3>Registered Versions (from PostgreSQL)</h3>
        <Show when={!versions.loading} fallback={<div>Loading versions...</div>}>
          <table>
            <thead>
              <tr>
                <th>Version</th>
                <th>Commit</th>
                <th>Accuracy</th>
                <th>Status</th>
                <th>Created At</th>
              </tr>
            </thead>
            <tbody>
              <For each={versions()}>
                {(version) => (
                  <tr classclassList={{ 'approved-version': version.status === 'APPROVED' }}>
                    <td>{version.model_version}</td>
                    <td>{version.git_commit_hash.substring(0, 7)}</td>
                    <td>{version.validation_accuracy.toFixed(4)}</td>
                    <td>{version.status}</td>
                    <td>{new Date(version.created_at).toLocaleString()}</td>
                  </tr>
                )}
              </For>
            </tbody>
          </table>
        </Show>
      </div>
    </div>
  );
}

export default ModelDashboard;

This UI ties everything together, providing crucial visibility that was completely absent in our original manual process. It transforms the system from a series of command-line tools into a cohesive platform.

The system isn’t without its complexities and remaining challenges. The consul-config-watcher sidecar adds another moving part to every pod, increasing the resource footprint slightly. Furthermore, Consul now becomes a Tier-0 critical component for our ML platform; its availability and performance must be rigorously monitored. The interaction between the slow GitOps loop and the fast Consul updates also requires clear operational guidelines: Git should always represent the desired steady-state, while Consul is for transient adjustments. Future work could involve building more sophisticated automation that programmatically updates Consul based on real-time performance metrics, effectively creating a closed-loop system for model canarying and rollback. The SQL registry could also be evolved to store more complex lineage, linking models directly to the specific version of the dataset used for training.

Consul CI/CD TensorFlow GitOps Solid.js SQL

Building a Dynamic Distributed Rate Limiter for a Containerized API Using ZooKeeper Coordination

2023-10-27 Distributed Systems

Zookeeper Microservices Rate Limiting Docker .NET Web API

Implementing a CDC-Based Dual-Write Pipeline from Oracle to Apache Iceberg and Weaviate via Azure Functions

2023-10-27 Data Engineering

Oracle Azure Functions CDC Apache Iceberg Weaviate Data Lakehouse MobX