Implementing an SLO-Driven Canary Deployment Pipeline with Terraform Nacos and Grafana


Manual canary analysis is a bottleneck that introduces human error and subjectivity into the release process. A typical scenario involves a DevOps engineer deploying a new version, manually shifting a small percentage of traffic, and then spending the next 30 minutes staring at dashboards, trying to decide if “things look okay.” This process is neither scalable nor reliable. Our team faced this exact problem, where release velocity was being throttled by the fear of production incidents and the tediousness of manual verification.

We needed a system that could make promotion or rollback decisions autonomously, based on predefined, objective criteria—Service Level Objectives (SLOs). The goal was to build a closed-loop, fully automated canary release platform where the only manual step is triggering the deployment.

Our architecture is centered around a feedback loop:

sequenceDiagram
    participant CI/CD Pipeline
    participant Canary Controller
    participant Terraform
    participant Nacos
    participant Application Fleet
    participant Prometheus
    participant Grafana

    CI/CD Pipeline->>+Terraform: Apply new version (inactive)
    Terraform-->>-CI/CD Pipeline: Deployment complete
    CI/CD Pipeline->>+Canary Controller: POST /api/v1/canary/start (service: checkout, version: v1.2.1)
    Canary Controller->>+Nacos: Update config for 'checkout': { "canaryVersion": "v1.2.1", "canaryWeight": 5 }
    Nacos-->>-Application Fleet: Push config update
    Application Fleet->>Application Fleet: Route 5% traffic to v1.2.1
    Application Fleet->>+Prometheus: Scrape metrics (stable vs canary)
    Prometheus-->>-Grafana: Provide metrics
    Grafana->>Grafana: Evaluate SLOs (e.g., error rate, latency)
    alt SLO Breached
        Grafana->>+Canary Controller: Fire 'Rollback' alert webhook
        Canary Controller->>+Nacos: Update config: { "canaryWeight": 0 }
        Nacos-->>-Canary Controller: ACK
        Canary Controller-->>-Grafana: ACK
    else SLO Met for duration
        Grafana->>+Canary Controller: Fire 'Promote' alert webhook
        Canary Controller->>+Nacos: Update config: { "canaryWeight": 100 }
        Nacos-->>-Canary Controller: ACK
        Canary Controller-->>-Grafana: ACK
    end

The choice of tooling was deliberate. We needed a solution that was infrastructure-agnostic, avoiding a hard dependency on Kubernetes-native tools like Flagger, as parts of our legacy fleet still run on VMs.

  • Terraform: Provides the declarative foundation to provision everything consistently—the Nacos cluster, the monitoring stack (Prometheus, Grafana), and the application instances themselves.
  • Nacos: Acts as the dynamic control plane. Its configuration management feature is leveraged to dynamically adjust traffic weights for our services without requiring restarts or re-deployments.
  • Grafana & Prometheus: Form the observability core. Prometheus scrapes high-cardinality metrics labeled by application version, and Grafana’s alerting engine evaluates these metrics against our SLOs to trigger the feedback loop.

Part 1: Declarative Infrastructure with Terraform

A common mistake in such projects is setting up infrastructure manually. For our system to be reliable, its foundation must be repeatable. We defined the entire stack in Terraform modules.

Here’s a simplified structure for managing the infrastructure:

terraform/
├── environments/
│   ├── staging/
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   └── production/
│       ├── ...
├── modules/
│   ├── nacos/
│   │   ├── main.tf
│   │   └── ...
│   ├── monitoring/
│   │   ├── main.tf
│   │   └── ...
│   └── app_service/
│       ├── main.tf
│       └── ...
└── main.tf

The monitoring module sets up Prometheus and Grafana. For this demonstration, we’ll use Docker, but in a real-world project, this would provision cloud resources like an AWS Managed Service for Prometheus (AMP) and an Amazon Managed Grafana workspace.

modules/monitoring/main.tf

# This is a simplified example using Docker for local demonstration.
# In production, this would use official Helm charts or cloud provider resources.

resource "docker_network" "monitoring_net" {
  name = "monitoring-net"
}

resource "docker_container" "prometheus" {
  name    = "prometheus"
  image   = "prom/prometheus:v2.47.0"
  network_mode = "host" # Simplified for service discovery on localhost

  volumes {
    host_path      = abspath("${path.module}/prometheus.yml")
    container_path = "/etc/prometheus/prometheus.yml"
    read_only      = true
  }
}

resource "docker_container" "grafana" {
  name  = "grafana"
  image = "grafana/grafana:10.1.5"
  ports {
    internal = 3000
    external = 3000
  }
  network_mode = "host"
}

output "grafana_url" {
  value = "http://localhost:3000"
}

output "prometheus_url" {
  value = "http://localhost:9090"
}

The corresponding prometheus.yml needs to be configured to scrape our applications. We’ll label deployments by their version.

modules/monitoring/prometheus.yml

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'payment-service'
    static_configs:
      - targets: ['localhost:8080', 'localhost:8081'] # Stable instance
      - targets: ['localhost:9090', 'localhost:9091'] # Canary instance

In a dynamic environment, these targets would be discovered via service discovery integration (e.g., from Nacos itself or from cloud provider tags).

Similarly, the nacos module provisions a Nacos cluster. Again, using Docker for this example:

modules/nacos/main.tf

# Using a standalone Nacos server for simplicity.
# Production setups should use a cluster with an external database.
resource "docker_container" "nacos" {
  name  = "nacos-server"
  image = "nacos/nacos-server:v2.2.3"
  
  ports {
    internal = 8848
    external = 8848
  }

  environment = [
    "PREFER_HOST_MODE=hostname",
    "MODE=standalone"
  ]
}

output "nacos_endpoint" {
  value = "http://${docker_container.nacos.name}:8848"
}

Using Terraform to manage these components ensures that a developer can spin up an identical environment for testing or that we can recover from a disaster by simply running terraform apply.

Part 2: Dynamic Traffic Control with Nacos

Nacos is the heart of our traffic shifting mechanism. We don’t use its service discovery feature here, but rather its powerful configuration management. For each service undergoing a canary release, we create a configuration entry.

  • Data ID: canary-rules.{{service_name}}.json (e.g., canary-rules.payment-service.json)
  • Group: CANARY_GROUP
  • Format: JSON

The content of this configuration file dictates the traffic routing logic.

Example canary-rules.payment-service.json:

{
  "serviceName": "payment-service",
  "rules": [
    {
      "type": "weight",
      "condition": {
        "canaryVersion": "v1.5.1",
        "canaryWeight": 5,
        "stableVersion": "v1.5.0"
      }
    }
  ],
  "enabled": true,
  "lastModified": "2023-10-27T10:00:00Z"
}

This tells our routing layer (e.g., an API gateway or a smart client) to send 5% of traffic to instances of version v1.5.1 and the remaining 95% to v1.5.0.

The application must be architected to fetch and react to these configuration changes in real time. Here’s a conceptual example in Go using the official Nacos SDK. A common pitfall is to fetch configuration only at startup; this defeats the purpose of dynamic control. The key is to use a listener.

package main

import (
	"encoding/json"
	"fmt"
	"log"
	"sync"
	"time"

	"github.com/nacos-group/nacos-sdk-go/v2/clients"
	"github.com/nacos-group/nacos-sdk-go/v2/common/constant"
	"github.com/nacos-group/nacos-sdk-go/v2/vo"
)

// CanaryConfig represents the structure of our routing rules.
type CanaryConfig struct {
	ServiceName string `json:"serviceName"`
	Rules       []struct {
		Type      string `json:"type"`
		Condition struct {
			CanaryVersion string `json:"canaryVersion"`
			CanaryWeight  int    `json:"canaryWeight"`
			StableVersion string `json:"stableVersion"`
		} `json:"condition"`
	} `json:"rules"`
	Enabled bool `json:"enabled"`
}

// Global variable to hold the current configuration, protected by a mutex.
var (
	currentConfig = &CanaryConfig{}
	configLock    sync.RWMutex
)

func main() {
	// Nacos client configuration
	clientConfig := *constant.NewClientConfig(
		constant.WithNamespaceId(""), // namespace
		constant.WithTimeoutMs(5000),
		constant.WithNotLoadCacheAtStart(true),
		constant.WithLogDir("/tmp/nacos/log"),
		constant.WithCacheDir("/tmp/nacos/cache"),
		constant.WithLogLevel("debug"),
	)
	serverConfigs := []constant.ServerConfig{
		*constant.NewServerConfig("127.0.0.1", 8848, constant.WithContextPath("/nacos")),
	}

	configClient, err := clients.NewConfigClient(
		vo.NacosClientParam{
			ClientConfig:  &clientConfig,
			ServerConfigs: serverConfigs,
		},
	)
	if err != nil {
		log.Fatalf("Failed to create Nacos Config Client: %v", err)
	}

	// Data ID and Group for our service
	dataID := "canary-rules.payment-service.json"
	group := "CANARY_GROUP"

	// Initial fetch of the configuration
	content, err := configClient.GetConfig(vo.ConfigParam{
		DataId: dataID,
		Group:  group,
	})
	if err != nil {
		log.Printf("Warning: Failed to get initial config: %v", err)
	} else {
		updateConfig(content)
	}

	// Listen for configuration changes. This is the critical part.
	err = configClient.ListenConfig(vo.ConfigParam{
		DataId: dataID,
		Group:  group,
		OnChange: func(namespace, group, dataId, data string) {
			log.Println("Configuration changed!")
			log.Printf("Data: %s", data)
			updateConfig(data)
		},
	})
	if err != nil {
		log.Fatalf("Failed to listen for config changes: %v", err)
	}
    
    // The application would continue running its main logic here.
    // For demonstration, we just wait.
	select {}
}

func updateConfig(data string) {
	var newConfig CanaryConfig
	err := json.Unmarshal([]byte(data), &newConfig)
	if err != nil {
		log.Printf("Error unmarshalling config: %v", err)
		return
	}

	configLock.Lock()
	defer configLock.Unlock()
	*currentConfig = newConfig
	log.Printf("Successfully updated canary config: Weight %d%% for version %s", currentConfig.Rules[0].Condition.CanaryWeight, currentConfig.Rules[0].Condition.CanaryVersion)
}

// This function would be called by the routing logic of the application
func getCanaryWeight() int {
    configLock.RLock()
    defer configLock.RUnlock()
    if len(currentConfig.Rules) > 0 {
        return currentConfig.Rules[0].Condition.CanaryWeight
    }
    return 0 // Default to no canary traffic
}

This ensures any change made in the Nacos console or via its API is reflected in the application’s routing logic within seconds.

Part 3: Quantifying “Good” with SLOs in Grafana

“Things look okay” is not a metric. An SLO is. For our payment-service, we defined two key Service Level Indicators (SLIs):

  1. Availability: The percentage of successful requests (non-5xx HTTP status codes).
  2. Latency: The percentage of requests served faster than 300ms (99th percentile).

Our SLOs state:

  • The canary version’s error rate must not exceed the stable version’s error rate by more than 0.5% absolute.
  • The canary version’s p99 latency must not exceed the stable version’s p99 latency by more than 50ms.

To measure this, our application’s metrics must be labeled with the version. Using a Prometheus client library:

// In the HTTP handler of the Go application
httpRequestsTotal.WithLabelValues(method, path, statusCode, os.Getenv("APP_VERSION")).Inc()
requestLatency.WithLabelValues(method, path, os.Getenv("APP_VERSION")).Observe(duration.Seconds())

The APP_VERSION environment variable is set during deployment by Terraform.

Now we can write the PromQL queries to power our Grafana alerts. This is where the real intelligence of the system lies. A common mistake is to set a static threshold for the canary. A much better approach is to compare the canary’s performance against the stable version’s baseline in real-time.

PromQL for Canary Error Rate Delta:

# Canary Error Rate
(
  sum(rate(http_requests_total{job="payment-service", version="v1.5.1", code=~"5.."}[5m]))
  /
  sum(rate(http_requests_total{job="payment-service", version="v1.5.1"}[5m]))
)
-
# Stable Error Rate
(
  sum(rate(http_requests_total{job="payment-service", version="v1.5.0", code=~"5.."}[5m]))
  /
  sum(rate(http_requests_total{job="payment-service", version="v1.5.0"}[5m]))
)

PromQL for Canary p99 Latency Delta:

histogram_quantile(0.99, sum(rate(request_latency_bucket{job="payment-service", version="v1.5.1"}[5m])) by (le))
-
histogram_quantile(0.99, sum(rate(request_latency_bucket{job="payment-service", version="v1.5.0"}[5m])) by (le))

In Grafana, we create an alert rule that combines these. The alert will fire if the error rate delta exceeds 0.005 (0.5%) OR the latency delta exceeds 0.05 (50ms). Crucially, we use the FOR clause to prevent firing on transient spikes. The condition must hold for a sustained period, for instance, 5 minutes, before we declare the canary unhealthy.

The alert notification channel is configured to send a webhook to our Canary Controller.

Part 4: The Canary Controller: Closing the Loop

This component is the custom “glue” that automates the decision-making process. It’s a simple microservice (written in Python with Flask for this example) that exposes endpoints for the CI/CD pipeline and listens for webhooks from Grafana.

controller.py

import os
import requests
import logging
from flask import Flask, request, jsonify

app = Flask(__name__)

# --- Configuration ---
# In a real app, use a proper config management system.
NACOS_API_URL = os.environ.get("NACOS_API_URL", "http://localhost:8848/nacos/v1/cs/configs")
NACOS_NAMESPACE = os.environ.get("NACOS_NAMESPACE", "")
CANARY_GROUP = "CANARY_GROUP"
CANARY_ANALYSIS_DURATION_MINS = 10 # How long to wait for a success signal

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# --- State Management (In-memory, use Redis/DB for production) ---
canary_state = {} # e.g., {"payment-service": {"version": "v1.5.1", "status": "ANALYZING"}}


def update_nacos_config(service_name, version, weight):
    """Updates the canary configuration in Nacos."""
    data_id = f"canary-rules.{service_name}.json"
    
    # This payload should be a template for robustness
    content = f"""
    {{
      "serviceName": "{service_name}",
      "rules": [
        {{
          "type": "weight",
          "condition": {{
            "canaryVersion": "{version}",
            "canaryWeight": {weight},
            "stableVersion": "v1.5.0"
          }}
        }}
      ],
      "enabled": true
    }}
    """
    
    params = {
        "dataId": data_id,
        "group": CANARY_GROUP,
        "content": content,
        "namespaceId": NACOS_NAMESPACE
    }
    
    try:
        response = requests.post(NACOS_API_URL, data=params)
        response.raise_for_status()
        logging.info(f"Successfully updated Nacos for {service_name} to weight {weight}")
        return True
    except requests.exceptions.RequestException as e:
        logging.error(f"Failed to update Nacos config for {service_name}: {e}")
        return False

@app.route('/api/v1/canary/start', methods=['POST'])
def start_canary():
    """Endpoint for CI/CD to start a canary analysis."""
    data = request.json
    service_name = data.get('service_name')
    canary_version = data.get('canary_version')

    if not service_name or not canary_version:
        return jsonify({"error": "service_name and canary_version are required"}), 400

    logging.info(f"Starting canary analysis for {service_name}, version {canary_version}")
    
    # A pitfall: race conditions. A simple state check helps.
    if canary_state.get(service_name, {}).get("status") == "ANALYZING":
        return jsonify({"error": f"Canary for {service_name} is already in progress"}), 409

    canary_state[service_name] = {"version": canary_version, "status": "ANALYZING"}
    
    if update_nacos_config(service_name, canary_version, 5): # Start with 5% traffic
        return jsonify({"status": "Canary analysis started"}), 202
    else:
        canary_state.pop(service_name, None)
        return jsonify({"error": "Failed to start canary analysis"}), 500

@app.route('/webhooks/grafana', methods=['POST'])
def grafana_webhook():
    """Receives alerts from Grafana to drive decisions."""
    alert = request.json
    logging.info(f"Received webhook from Grafana: {alert.get('title')} - {alert.get('state')}")
    
    # In a real system, you'd parse labels from the alert to get the service_name
    service_name = alert.get('tags', {}).get('service', 'payment-service') # Simplified
    
    if service_name not in canary_state or canary_state[service_name]["status"] != "ANALYZING":
        logging.warning(f"Received alert for service '{service_name}' which is not in analysis state. Ignoring.")
        return jsonify({"status": "ignored"}), 200

    canary_version = canary_state[service_name]["version"]

    if alert.get('state') == 'alerting': # This means an SLO was breached
        logging.warning(f"SLO breach detected for {service_name} v{canary_version}. Rolling back.")
        update_nacos_config(service_name, canary_version, 0)
        canary_state[service_name]["status"] = "ROLLED_BACK"
    
    elif alert.get('state') == 'ok': # This means the SLO is no longer breached
        # We need a dedicated 'promotion' signal, not just 'ok'.
        # A good practice is to have a separate alert rule that fires on 'success'.
        # e.g., "Alert if canary has been running for 10 minutes with no SLO breaches".
        logging.info(f"Canary analysis for {service_name} v{canary_version} successful. Promoting.")
        update_nacos_config(service_name, canary_version, 100)
        canary_state[service_name]["status"] = "PROMOTED"
        
    return jsonify({"status": "processed"}), 200

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5001)

This controller is the final piece. It subscribes to Grafana’s judgment and executes the necessary actions via Nacos’s API, creating the autonomous feedback loop we set out to build.

This system is not without its limitations. The controller’s state management is naive; for a production system, a persistent store like Redis would be required to survive restarts. The logic doesn’t yet support progressive rollouts (e.g., 5% -> 25% -> 50% -> 100%), which would require a more sophisticated state machine. Furthermore, this approach is best suited for stateless services where traffic can be shifted granularly. It does not solve challenges related to stateful services or database schema migrations, which require different canary strategies.

Future iterations could involve replacing the simple webhook controller with a more robust workflow engine or integrating a statistical analysis tool to move beyond simple threshold-based SLOs. However, this architecture provides a powerful, open-source, and cloud-agnostic foundation for automating one of the most critical and risk-prone phases of the software delivery lifecycle.


  TOC