Our machine learning model deployment pipeline was brittle. Every time a data scientist trained a superior model, promoting it to production involved a manual checklist: a pull request to update a gateway configuration file, a scheduled deployment window, and a coordinated effort to monitor the new endpoint. A/B testing was even worse, often requiring duplicated infrastructure and complex, static routing rules that quickly became technical debt. This process was not only slow but also fraught with risk; a bad model could slip through and degrade the user experience for hours before we could roll it back. The core pain point was the disconnect between the model registry, which served as our source of truth for model quality, and the infrastructure that served the model.
We needed a system where the act of promoting a model within our ML platform, MLflow, would automatically and safely orchestrate the traffic shift in production. The goal was a zero-touch, registry-driven canary deployment system. A model version transition in the MLflow Model Registry from Staging
to Production
should trigger a controlled, observable, and reversible shift of live traffic.
This led to an architectural design centered on a dynamic control plane that observes the model registry and reconfigures the API gateway in real-time. The core components selected were Apache APISIX as the gateway, etcd
as its dynamic configuration store, Google Cloud Functions for serverless model serving, and a NestJS application to act as the orchestrating controller.
The Architectural Blueprint
In a real-world project, connecting disparate systems requires a clear data flow. The architecture we settled on is fundamentally event-driven, even if our initial implementation uses polling as a stand-in for true eventing.
graph TD subgraph MLflow Platform A[Data Scientist promotes Model v2 to 'Production'] --> B{MLflow Model Registry}; end subgraph Control Plane [NestJS Controller] C(Scheduler) -- polls every 30s --> D{MLflow Service}; D -- GET /api/2.0/mlflow/model-versions/search --> B; D -- detects change --> E{Orchestration Logic}; E -- computes new traffic split --> F{etcd Service}; F -- writes /apisix/routes/model_X_route --> G[etcd Cluster]; end subgraph Data Plane H[End User Request] --> I(APISIX Gateway); I -- watches for key changes --> G; I -- hot-reloads route --> I; I -- applies traffic-split plugin --> J{Upstreams}; J -- 90% traffic --> K[GCF: Model v1 Endpoint]; J -- 10% traffic --> L[GCF: Model v2 Endpoint]; end style Control Plane fill:#f9f,stroke:#333,stroke-width:2px style Data Plane fill:#ccf,stroke:#333,stroke-width:2px
The flow is as follows:
- MLflow Registry as Source of Truth: A data scientist validates a model and transitions its stage in the MLflow UI. This is the sole manual trigger for the entire process.
- NestJS Controller: A background service continuously polls the MLflow API. It maintains a state of the current production models and their versions. Upon detecting a change (e.g., a new version is tagged ‘Production’ for a registered model), it initiates the deployment logic.
- Dynamic Configuration Generation: The controller calculates the new routing strategy. For a new canary release, it might decide on a 90/10 split between the old stable version and the new candidate version. It then constructs the precise JSON configuration that the APISIX
traffic-split
plugin requires. -
etcd
as the State Store: The controller writes this new routing configuration to a specific key inetcd
. This is the critical handoff point. - APISIX Real-time Updates: The APISIX gateway instances are configured to use
etcd
for service discovery and configuration. They maintain a persistent watch on the relevant key prefixes. When the controller writes the new route, APISIX detects the change almost instantly and hot-reloads its routing table in memory without dropping connections or requiring a restart. - Serverless Model Endpoints: Traffic is routed to the appropriate Google Cloud Function URLs, which host the ML models. Using serverless functions was a deliberate choice to optimize costs for models with intermittent traffic patterns, accepting the trade-off of potential cold starts.
Setting the Foundation: APISIX and etcd
The entire system hinges on APISIX’s ability to be reconfigured without downtime. This is achieved by using etcd
as its configuration backend instead of the default yaml
file.
A production-grade APISIX deployment requires a dedicated etcd
cluster. For the gateway configuration, config.yaml
, the key change is in the etcd
section:
# apisix/conf/config.yaml
apisix:
node_listen: 9080
enable_ipv6: false
enable_admin: true
admin_key:
- name: "admin"
key: "super_secret_admin_key"
role: admin
etcd:
host:
- "http://etcd-node1:2379"
- "http://etcd-node2:2379"
- "http://etcd-node3:2379"
prefix: "/apisix"
timeout: 30
# Enable health checks for etcd cluster
health_check:
interval: 30s
timeout: 3s
retries: 3
plugins:
- traffic-split
# other necessary plugins
- prometheus
- logging
# Minimalist plugin configuration to just enable what we need
plugin_attr:
traffic-split:
# Set a default rule if none is provided in the route
rules:
- weighted_upstreams:
- upstream:
id: "default"
weight: 100
With this configuration, APISIX will ignore local YAML files for routes and upstreams and instead read them from the /apisix
prefix in the etcd
cluster. Any change made under this prefix is immediately reflected in the gateway’s behavior.
The Brains of the Operation: The NestJS Controller
A simple Python script could have worked, but in a production environment, we need structure, testability, and maintainability. NestJS, a TypeScript framework, provides these. It forces a clean separation of concerns, making the controller service robust.
Project Structure
.
├── src
│ ├── app.module.ts
│ ├── main.ts
│ ├── config
│ │ └── configuration.ts
│ ├── etcd
│ │ ├── etcd.module.ts
│ │ └── etcd.service.ts
│ ├── mlflow
│ │ ├── mlflow.module.ts
│ │ └── mlflow.service.ts
│ └── scheduler
│ ├── scheduler.module.ts
│ └── scheduler.service.ts
└── test
├── ...
Configuration Management
We use @nestjs/config
to manage environment variables securely. A common pitfall is hardcoding URLs or credentials.
// src/config/configuration.ts
import { registerAs } from '@nestjs/config';
export default registerAs('app', () => ({
mlflowTrackingUri: process.env.MLFLOW_TRACKING_URI,
etcdEndpoints: process.env.ETCD_ENDPOINTS?.split(','),
pollInterval: parseInt(process.env.POLL_INTERVAL_MS, 10) || 30000,
canaryTrafficWeight: parseInt(process.env.CANARY_TRAFFIC_WEIGHT, 10) || 10,
}));
Interacting with etcd
The etcd.service.ts
encapsulates all logic for communicating with etcd
. We use the etcd3
npm package.
// src/etcd/etcd.service.ts
import { Injectable, Logger, OnModuleInit, OnModuleDestroy } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';
import { Etcd3, ILease, ILock } from 'etcd3';
@Injectable()
export class EtcdService implements OnModuleInit, OnModuleDestroy {
private readonly logger = new Logger(EtcdService.name);
private client: Etcd3;
constructor(private configService: ConfigService) {}
onModuleInit() {
const endpoints = this.configService.get<string[]>('app.etcdEndpoints');
if (!endpoints) {
throw new Error('ETCD_ENDPOINTS environment variable not set.');
}
this.client = new Etcd3({ hosts: endpoints });
this.logger.log(`Connected to etcd at ${endpoints.join(',')}`);
}
onModuleDestroy() {
this.client.close();
}
public async setKey(key: string, value: object): Promise<void> {
try {
const jsonValue = JSON.stringify(value);
await this.client.put(key).value(jsonValue);
this.logger.log(`Successfully set key: ${key}`);
} catch (error) {
this.logger.error(`Failed to set key ${key}`, error.stack);
throw error;
}
}
public async getKey(key: string): Promise<any | null> {
try {
const value = await this.client.get(key).json();
return value;
} catch (error) {
// It's normal for a key not to exist initially
if (error.message.includes('key not found')) {
return null;
}
this.logger.error(`Failed to get key ${key}`, error.stack);
throw error;
}
}
// A lock is crucial in a distributed system to prevent multiple controller
// instances from updating the same route simultaneously.
public async withLock<T>(lockName: string, operation: () => Promise<T>): Promise<T> {
const lockKey = `locks/controller/${lockName}`;
let lock: ILock;
try {
// Lease for 60 seconds, if the controller crashes, the lock is released.
const lease: ILease = this.client.lease(60);
lock = await lease.lock(lockKey);
this.logger.debug(`Acquired lock: ${lockName}`);
const result = await operation();
return result;
} catch (error) {
this.logger.error(`Operation failed with lock ${lockName}: ${error.message}`);
throw error;
} finally {
if (lock) {
await lock.unlock();
this.logger.debug(`Released lock: ${lockName}`);
}
}
}
}
The Core Orchestration Logic
The scheduler.service.ts
drives the process. It polls MLflow, compares the state, and triggers updates.
// src/scheduler/scheduler.service.ts
import { Injectable, Logger } from '@nestjs/common';
import { Interval } from '@nestjs/schedule';
import { ConfigService } from '@nestjs/config';
import { MlflowService, ProductionModelInfo } from '../mlflow/mlflow.service';
import { EtcdService } from '../etcd/etcd.service';
interface ModelRouteState {
modelName: string;
stableVersion: string;
canaryVersion?: string;
}
@Injectable()
export class SchedulerService {
private readonly logger = new Logger(SchedulerService.name);
// In-memory state of what we believe is deployed.
// In a real multi-instance setup, this state should live in etcd/Redis.
private modelStates: Map<string, ModelRouteState> = new Map();
constructor(
private configService: ConfigService,
private mlflowService: MlflowService,
private etcdService: EtcdService,
) {}
@Interval(30000) // Poll every 30 seconds
async handleModelUpdates() {
this.logger.log('Checking for model updates...');
try {
const productionModels = await this.mlflowService.getProductionModels();
for (const [modelName, modelInfo] of Object.entries(productionModels)) {
await this.etcdService.withLock(modelName, async () => {
await this.reconcileModelState(modelName, modelInfo);
});
}
} catch (error) {
this.logger.error('Failed to handle model updates', error.stack);
}
}
private async reconcileModelState(modelName: string, modelInfo: ProductionModelInfo) {
const currentState = this.modelStates.get(modelName);
const productionVersion = modelInfo.production;
const stagingVersion = modelInfo.staging;
if (!currentState) {
// First time seeing this model, deploy production version directly.
this.logger.log(`New model found: ${modelName}. Deploying version ${productionVersion.version}.`);
await this.updateApisixRoute(modelName, productionVersion.version);
this.modelStates.set(modelName, { modelName, stableVersion: productionVersion.version });
return;
}
// Logic for canary deployment
if (stagingVersion && productionVersion.version !== currentState.stableVersion) {
// A new production version has been promoted, AND the old production version is now 'staging'.
// This is our signal to start a canary deployment.
const newProductionVersion = productionVersion.version;
const oldProductionVersion = currentState.stableVersion;
if (currentState.canaryVersion !== newProductionVersion) {
this.logger.log(`Starting canary for ${modelName}: stable=${oldProductionVersion}, canary=${newProductionVersion}`);
await this.updateApisixRoute(modelName, oldProductionVersion, newProductionVersion);
this.modelStates.set(modelName, { ...currentState, canaryVersion: newProductionVersion });
}
} else if (productionVersion.version !== currentState.stableVersion && !currentState.canaryVersion) {
// A new production version is live, but there's no active canary. This means a full promotion.
this.logger.log(`Promoting ${modelName} to version ${productionVersion.version} fully.`);
await this.updateApisixRoute(modelName, productionVersion.version);
this.modelStates.set(modelName, { modelName, stableVersion: productionVersion.version });
}
}
private async updateApisixRoute(
modelName: string,
stableVersion: string,
canaryVersion?: string,
) {
const routeId = `model_${modelName.replace(/[^a-zA-Z0-9]/g, '_')}`;
const routeKey = `/apisix/routes/${routeId}`;
const upstreamId = `model_upstream_${modelName.replace(/[^a-zA-Z0-9]/g, '_')}`;
const upstreamKey = `/apisix/upstreams/${upstreamId}`;
const canaryWeight = this.configService.get<number>('app.canaryTrafficWeight');
// URLs for GCF functions are predictable or can be fetched via an API
const stableUrl = `https://<region>-<project>.cloudfunctions.net/${modelName}-${stableVersion}`;
let upstreamConfig: any;
let routeConfig: any;
// A common mistake is not defining the upstream before the route depends on it.
// APISIX can handle this gracefully, but it's better to be explicit.
if (canaryVersion) {
const canaryUrl = `https://<region>-<project>.cloudfunctions.net/${modelName}-${canaryVersion}`;
upstreamConfig = {
id: upstreamId,
type: "roundrobin",
nodes: [ // APISIX expects host:port format
{ host: new URL(stableUrl).hostname, port: 443, weight: 1 },
{ host: new URL(canaryUrl).hostname, port: 443, weight: 1 },
],
scheme: "https"
};
routeConfig = {
id: routeId,
uri: `/predict/${modelName}`,
methods: ["POST"],
upstream_id: upstreamId,
plugins: {
"traffic-split": {
rules: [{
"weighted_upstreams": [
{
"upstream": {
"nodes": { [new URL(stableUrl).hostname + ':443']: 1 },
"type": "roundrobin",
"scheme": "https"
},
"weight": 100 - canaryWeight
},
{
"upstream": {
"nodes": { [new URL(canaryUrl).hostname + ':443']: 1 },
"type": "roundrobin",
"scheme": "httpshttps"
},
"weight": canaryWeight
}
]
}]
}
}
};
} else {
// Single, stable version
upstreamConfig = {
id: upstreamId,
type: "roundrobin",
nodes: [{ host: new URL(stableUrl).hostname, port: 443, weight: 1 }],
scheme: "https"
};
routeConfig = {
id: routeId,
uri: `/predict/${modelName}`,
methods: ["POST"],
upstream_id: upstreamId,
plugins: {} // No traffic split
};
}
await this.etcdService.setKey(upstreamKey, upstreamConfig);
await this.etcdService.setKey(routeKey, routeConfig);
this.logger.log(`Updated APISIX config in etcd for model: ${modelName}`);
}
}
The JSON payload written to etcd
for the traffic-split
plugin is the heart of the mechanism. The controller constructs this dynamically based on the state detected in MLflow. The structure must be exact for APISIX to parse it correctly. A mistake here, like a typo in weighted_upstreams
, can break routing for that endpoint. Rigorous unit testing of this configuration generation logic is non-negotiable.
The Workhorses: Serverless Model Servers
For the models themselves, we package them into simple Python Google Cloud Functions. The function’s responsibility is minimal: load a specific model version from the MLflow registry and serve predictions.
# main.py for a Google Cloud Function
import os
import json
import mlflow
import pandas as pd
from flask import jsonify
# These would be set as environment variables during deployment
MLFLOW_TRACKING_URI = os.environ.get("MLFLOW_TRACKING_URI")
MODEL_NAME = os.environ.get("MODEL_NAME")
MODEL_VERSION = os.environ.get("MODEL_VERSION")
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
model_uri = f"models:/{MODEL_NAME}/{MODEL_VERSION}"
# Load the model into memory. This happens during the cold start.
try:
model = mlflow.pyfunc.load_model(model_uri)
print(f"Successfully loaded model '{MODEL_NAME}' version '{MODEL_VERSION}'")
except Exception as e:
model = None
print(f"FATAL: Could not load model. Error: {e}")
def predict_handler(request):
"""
HTTP Cloud Function to serve predictions.
"""
if model is None:
return "Model not loaded", 500
if request.method != 'POST':
return 'Only POST requests are accepted', 405
try:
request_json = request.get_json(silent=True)
if request_json and 'instances' in request_json:
data = pd.DataFrame(request_json['instances'])
prediction = model.predict(data)
return jsonify({"predictions": prediction.tolist()})
else:
return "Invalid JSON format: 'instances' key not found.", 400
except Exception as e:
print(f"Prediction error: {e}")
return "Error during prediction", 500
Each model version is deployed as a separate Cloud Function. A CI/CD pipeline automates this: on a git tag like models/iris-classifier/v3
, it triggers a gcloud functions deploy
command.
gcloud functions deploy iris-classifier-v3 \
--region=us-central1 \
--project=my-gcp-project \
--entry-point=predict_handler \
--runtime=python39 \
--trigger-http \
--allow-unauthenticated \
--set-env-vars MLFLOW_TRACKING_URI=http://...,MODEL_NAME=iris-classifier,MODEL_VERSION=3
The resulting function URL is what the NestJS controller uses to configure the APISIX upstream.
Final Result and Lingering Questions
The final system works as designed. A data scientist promotes a model in MLflow. Within a minute, the controller detects the change, calculates a 90/10 traffic split, and updates the configuration in etcd
. APISIX immediately begins routing 10% of production traffic to the new serverless model endpoint. We monitor Prometheus metrics exported from APISIX (latency, error rates per upstream) to validate the canary’s performance. If it’s healthy, a subsequent promotion (e.g., archiving the old version) triggers the controller to shift 100% of traffic.
However, this architecture is not without its trade-offs and remaining challenges. The choice of Google Cloud Functions introduces significant cold-start latency for the first request after a period of inactivity. For our use case with non-real-time predictions, this was an acceptable cost saving. For a latency-critical application, this architecture would need to target provisioned GKE pods or Cloud Run instances with minimum instances set to 1.
Our controller’s polling mechanism is simple but inefficient. The proper solution is to leverage MLflow’s webhooks. When a model stage transition occurs, MLflow should actively POST
to an endpoint on our controller. This would make the system reactive and reduce the detection-to-deployment latency from ~30 seconds to near-instantaneous.
Furthermore, the rollout strategy is a fixed percentage. A more advanced implementation would incorporate automated analysis of the canary’s performance metrics. If the error rate of the canary upstream surpasses a defined threshold or its p99 latency degrades significantly, the controller should automatically initiate a rollback by rewriting the APISIX configuration to send 100% of traffic back to the stable version. This would close the loop on a truly automated, safe deployment system.