Manual canary analysis is a bottleneck that introduces human error and subjectivity into the release process. A typical scenario involves a DevOps engineer deploying a new version, manually shifting a small percentage of traffic, and then spending the next 30 minutes staring at dashboards, trying to decide if “things look okay.” This process is neither scalable nor reliable. Our team faced this exact problem, where release velocity was being throttled by the fear of production incidents and the tediousness of manual verification.
We needed a system that could make promotion or rollback decisions autonomously, based on predefined, objective criteria—Service Level Objectives (SLOs). The goal was to build a closed-loop, fully automated canary release platform where the only manual step is triggering the deployment.
Our architecture is centered around a feedback loop:
sequenceDiagram participant CI/CD Pipeline participant Canary Controller participant Terraform participant Nacos participant Application Fleet participant Prometheus participant Grafana CI/CD Pipeline->>+Terraform: Apply new version (inactive) Terraform-->>-CI/CD Pipeline: Deployment complete CI/CD Pipeline->>+Canary Controller: POST /api/v1/canary/start (service: checkout, version: v1.2.1) Canary Controller->>+Nacos: Update config for 'checkout': { "canaryVersion": "v1.2.1", "canaryWeight": 5 } Nacos-->>-Application Fleet: Push config update Application Fleet->>Application Fleet: Route 5% traffic to v1.2.1 Application Fleet->>+Prometheus: Scrape metrics (stable vs canary) Prometheus-->>-Grafana: Provide metrics Grafana->>Grafana: Evaluate SLOs (e.g., error rate, latency) alt SLO Breached Grafana->>+Canary Controller: Fire 'Rollback' alert webhook Canary Controller->>+Nacos: Update config: { "canaryWeight": 0 } Nacos-->>-Canary Controller: ACK Canary Controller-->>-Grafana: ACK else SLO Met for duration Grafana->>+Canary Controller: Fire 'Promote' alert webhook Canary Controller->>+Nacos: Update config: { "canaryWeight": 100 } Nacos-->>-Canary Controller: ACK Canary Controller-->>-Grafana: ACK end
The choice of tooling was deliberate. We needed a solution that was infrastructure-agnostic, avoiding a hard dependency on Kubernetes-native tools like Flagger, as parts of our legacy fleet still run on VMs.
- Terraform: Provides the declarative foundation to provision everything consistently—the Nacos cluster, the monitoring stack (Prometheus, Grafana), and the application instances themselves.
- Nacos: Acts as the dynamic control plane. Its configuration management feature is leveraged to dynamically adjust traffic weights for our services without requiring restarts or re-deployments.
- Grafana & Prometheus: Form the observability core. Prometheus scrapes high-cardinality metrics labeled by application version, and Grafana’s alerting engine evaluates these metrics against our SLOs to trigger the feedback loop.
Part 1: Declarative Infrastructure with Terraform
A common mistake in such projects is setting up infrastructure manually. For our system to be reliable, its foundation must be repeatable. We defined the entire stack in Terraform modules.
Here’s a simplified structure for managing the infrastructure:
terraform/
├── environments/
│ ├── staging/
│ │ ├── main.tf
│ │ ├── outputs.tf
│ │ └── variables.tf
│ └── production/
│ ├── ...
├── modules/
│ ├── nacos/
│ │ ├── main.tf
│ │ └── ...
│ ├── monitoring/
│ │ ├── main.tf
│ │ └── ...
│ └── app_service/
│ ├── main.tf
│ └── ...
└── main.tf
The monitoring
module sets up Prometheus and Grafana. For this demonstration, we’ll use Docker, but in a real-world project, this would provision cloud resources like an AWS Managed Service for Prometheus (AMP) and an Amazon Managed Grafana workspace.
modules/monitoring/main.tf
# This is a simplified example using Docker for local demonstration.
# In production, this would use official Helm charts or cloud provider resources.
resource "docker_network" "monitoring_net" {
name = "monitoring-net"
}
resource "docker_container" "prometheus" {
name = "prometheus"
image = "prom/prometheus:v2.47.0"
network_mode = "host" # Simplified for service discovery on localhost
volumes {
host_path = abspath("${path.module}/prometheus.yml")
container_path = "/etc/prometheus/prometheus.yml"
read_only = true
}
}
resource "docker_container" "grafana" {
name = "grafana"
image = "grafana/grafana:10.1.5"
ports {
internal = 3000
external = 3000
}
network_mode = "host"
}
output "grafana_url" {
value = "http://localhost:3000"
}
output "prometheus_url" {
value = "http://localhost:9090"
}
The corresponding prometheus.yml
needs to be configured to scrape our applications. We’ll label deployments by their version.
modules/monitoring/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'payment-service'
static_configs:
- targets: ['localhost:8080', 'localhost:8081'] # Stable instance
- targets: ['localhost:9090', 'localhost:9091'] # Canary instance
In a dynamic environment, these targets would be discovered via service discovery integration (e.g., from Nacos itself or from cloud provider tags).
Similarly, the nacos
module provisions a Nacos cluster. Again, using Docker for this example:
modules/nacos/main.tf
# Using a standalone Nacos server for simplicity.
# Production setups should use a cluster with an external database.
resource "docker_container" "nacos" {
name = "nacos-server"
image = "nacos/nacos-server:v2.2.3"
ports {
internal = 8848
external = 8848
}
environment = [
"PREFER_HOST_MODE=hostname",
"MODE=standalone"
]
}
output "nacos_endpoint" {
value = "http://${docker_container.nacos.name}:8848"
}
Using Terraform to manage these components ensures that a developer can spin up an identical environment for testing or that we can recover from a disaster by simply running terraform apply
.
Part 2: Dynamic Traffic Control with Nacos
Nacos is the heart of our traffic shifting mechanism. We don’t use its service discovery feature here, but rather its powerful configuration management. For each service undergoing a canary release, we create a configuration entry.
- Data ID:
canary-rules.{{service_name}}.json
(e.g.,canary-rules.payment-service.json
) - Group:
CANARY_GROUP
- Format: JSON
The content of this configuration file dictates the traffic routing logic.
Example canary-rules.payment-service.json
:
{
"serviceName": "payment-service",
"rules": [
{
"type": "weight",
"condition": {
"canaryVersion": "v1.5.1",
"canaryWeight": 5,
"stableVersion": "v1.5.0"
}
}
],
"enabled": true,
"lastModified": "2023-10-27T10:00:00Z"
}
This tells our routing layer (e.g., an API gateway or a smart client) to send 5% of traffic to instances of version v1.5.1
and the remaining 95% to v1.5.0
.
The application must be architected to fetch and react to these configuration changes in real time. Here’s a conceptual example in Go using the official Nacos SDK. A common pitfall is to fetch configuration only at startup; this defeats the purpose of dynamic control. The key is to use a listener.
package main
import (
"encoding/json"
"fmt"
"log"
"sync"
"time"
"github.com/nacos-group/nacos-sdk-go/v2/clients"
"github.com/nacos-group/nacos-sdk-go/v2/common/constant"
"github.com/nacos-group/nacos-sdk-go/v2/vo"
)
// CanaryConfig represents the structure of our routing rules.
type CanaryConfig struct {
ServiceName string `json:"serviceName"`
Rules []struct {
Type string `json:"type"`
Condition struct {
CanaryVersion string `json:"canaryVersion"`
CanaryWeight int `json:"canaryWeight"`
StableVersion string `json:"stableVersion"`
} `json:"condition"`
} `json:"rules"`
Enabled bool `json:"enabled"`
}
// Global variable to hold the current configuration, protected by a mutex.
var (
currentConfig = &CanaryConfig{}
configLock sync.RWMutex
)
func main() {
// Nacos client configuration
clientConfig := *constant.NewClientConfig(
constant.WithNamespaceId(""), // namespace
constant.WithTimeoutMs(5000),
constant.WithNotLoadCacheAtStart(true),
constant.WithLogDir("/tmp/nacos/log"),
constant.WithCacheDir("/tmp/nacos/cache"),
constant.WithLogLevel("debug"),
)
serverConfigs := []constant.ServerConfig{
*constant.NewServerConfig("127.0.0.1", 8848, constant.WithContextPath("/nacos")),
}
configClient, err := clients.NewConfigClient(
vo.NacosClientParam{
ClientConfig: &clientConfig,
ServerConfigs: serverConfigs,
},
)
if err != nil {
log.Fatalf("Failed to create Nacos Config Client: %v", err)
}
// Data ID and Group for our service
dataID := "canary-rules.payment-service.json"
group := "CANARY_GROUP"
// Initial fetch of the configuration
content, err := configClient.GetConfig(vo.ConfigParam{
DataId: dataID,
Group: group,
})
if err != nil {
log.Printf("Warning: Failed to get initial config: %v", err)
} else {
updateConfig(content)
}
// Listen for configuration changes. This is the critical part.
err = configClient.ListenConfig(vo.ConfigParam{
DataId: dataID,
Group: group,
OnChange: func(namespace, group, dataId, data string) {
log.Println("Configuration changed!")
log.Printf("Data: %s", data)
updateConfig(data)
},
})
if err != nil {
log.Fatalf("Failed to listen for config changes: %v", err)
}
// The application would continue running its main logic here.
// For demonstration, we just wait.
select {}
}
func updateConfig(data string) {
var newConfig CanaryConfig
err := json.Unmarshal([]byte(data), &newConfig)
if err != nil {
log.Printf("Error unmarshalling config: %v", err)
return
}
configLock.Lock()
defer configLock.Unlock()
*currentConfig = newConfig
log.Printf("Successfully updated canary config: Weight %d%% for version %s", currentConfig.Rules[0].Condition.CanaryWeight, currentConfig.Rules[0].Condition.CanaryVersion)
}
// This function would be called by the routing logic of the application
func getCanaryWeight() int {
configLock.RLock()
defer configLock.RUnlock()
if len(currentConfig.Rules) > 0 {
return currentConfig.Rules[0].Condition.CanaryWeight
}
return 0 // Default to no canary traffic
}
This ensures any change made in the Nacos console or via its API is reflected in the application’s routing logic within seconds.
Part 3: Quantifying “Good” with SLOs in Grafana
“Things look okay” is not a metric. An SLO is. For our payment-service
, we defined two key Service Level Indicators (SLIs):
- Availability: The percentage of successful requests (non-5xx HTTP status codes).
- Latency: The percentage of requests served faster than 300ms (99th percentile).
Our SLOs state:
- The canary version’s error rate must not exceed the stable version’s error rate by more than 0.5% absolute.
- The canary version’s p99 latency must not exceed the stable version’s p99 latency by more than 50ms.
To measure this, our application’s metrics must be labeled with the version. Using a Prometheus client library:
// In the HTTP handler of the Go application
httpRequestsTotal.WithLabelValues(method, path, statusCode, os.Getenv("APP_VERSION")).Inc()
requestLatency.WithLabelValues(method, path, os.Getenv("APP_VERSION")).Observe(duration.Seconds())
The APP_VERSION
environment variable is set during deployment by Terraform.
Now we can write the PromQL queries to power our Grafana alerts. This is where the real intelligence of the system lies. A common mistake is to set a static threshold for the canary. A much better approach is to compare the canary’s performance against the stable version’s baseline in real-time.
PromQL for Canary Error Rate Delta:
# Canary Error Rate
(
sum(rate(http_requests_total{job="payment-service", version="v1.5.1", code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payment-service", version="v1.5.1"}[5m]))
)
-
# Stable Error Rate
(
sum(rate(http_requests_total{job="payment-service", version="v1.5.0", code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payment-service", version="v1.5.0"}[5m]))
)
PromQL for Canary p99 Latency Delta:
histogram_quantile(0.99, sum(rate(request_latency_bucket{job="payment-service", version="v1.5.1"}[5m])) by (le))
-
histogram_quantile(0.99, sum(rate(request_latency_bucket{job="payment-service", version="v1.5.0"}[5m])) by (le))
In Grafana, we create an alert rule that combines these. The alert will fire if the error rate delta exceeds 0.005
(0.5%) OR the latency delta exceeds 0.05
(50ms). Crucially, we use the FOR
clause to prevent firing on transient spikes. The condition must hold for a sustained period, for instance, 5 minutes, before we declare the canary unhealthy.
The alert notification channel is configured to send a webhook to our Canary Controller.
Part 4: The Canary Controller: Closing the Loop
This component is the custom “glue” that automates the decision-making process. It’s a simple microservice (written in Python with Flask for this example) that exposes endpoints for the CI/CD pipeline and listens for webhooks from Grafana.
controller.py
import os
import requests
import logging
from flask import Flask, request, jsonify
app = Flask(__name__)
# --- Configuration ---
# In a real app, use a proper config management system.
NACOS_API_URL = os.environ.get("NACOS_API_URL", "http://localhost:8848/nacos/v1/cs/configs")
NACOS_NAMESPACE = os.environ.get("NACOS_NAMESPACE", "")
CANARY_GROUP = "CANARY_GROUP"
CANARY_ANALYSIS_DURATION_MINS = 10 # How long to wait for a success signal
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# --- State Management (In-memory, use Redis/DB for production) ---
canary_state = {} # e.g., {"payment-service": {"version": "v1.5.1", "status": "ANALYZING"}}
def update_nacos_config(service_name, version, weight):
"""Updates the canary configuration in Nacos."""
data_id = f"canary-rules.{service_name}.json"
# This payload should be a template for robustness
content = f"""
{{
"serviceName": "{service_name}",
"rules": [
{{
"type": "weight",
"condition": {{
"canaryVersion": "{version}",
"canaryWeight": {weight},
"stableVersion": "v1.5.0"
}}
}}
],
"enabled": true
}}
"""
params = {
"dataId": data_id,
"group": CANARY_GROUP,
"content": content,
"namespaceId": NACOS_NAMESPACE
}
try:
response = requests.post(NACOS_API_URL, data=params)
response.raise_for_status()
logging.info(f"Successfully updated Nacos for {service_name} to weight {weight}")
return True
except requests.exceptions.RequestException as e:
logging.error(f"Failed to update Nacos config for {service_name}: {e}")
return False
@app.route('/api/v1/canary/start', methods=['POST'])
def start_canary():
"""Endpoint for CI/CD to start a canary analysis."""
data = request.json
service_name = data.get('service_name')
canary_version = data.get('canary_version')
if not service_name or not canary_version:
return jsonify({"error": "service_name and canary_version are required"}), 400
logging.info(f"Starting canary analysis for {service_name}, version {canary_version}")
# A pitfall: race conditions. A simple state check helps.
if canary_state.get(service_name, {}).get("status") == "ANALYZING":
return jsonify({"error": f"Canary for {service_name} is already in progress"}), 409
canary_state[service_name] = {"version": canary_version, "status": "ANALYZING"}
if update_nacos_config(service_name, canary_version, 5): # Start with 5% traffic
return jsonify({"status": "Canary analysis started"}), 202
else:
canary_state.pop(service_name, None)
return jsonify({"error": "Failed to start canary analysis"}), 500
@app.route('/webhooks/grafana', methods=['POST'])
def grafana_webhook():
"""Receives alerts from Grafana to drive decisions."""
alert = request.json
logging.info(f"Received webhook from Grafana: {alert.get('title')} - {alert.get('state')}")
# In a real system, you'd parse labels from the alert to get the service_name
service_name = alert.get('tags', {}).get('service', 'payment-service') # Simplified
if service_name not in canary_state or canary_state[service_name]["status"] != "ANALYZING":
logging.warning(f"Received alert for service '{service_name}' which is not in analysis state. Ignoring.")
return jsonify({"status": "ignored"}), 200
canary_version = canary_state[service_name]["version"]
if alert.get('state') == 'alerting': # This means an SLO was breached
logging.warning(f"SLO breach detected for {service_name} v{canary_version}. Rolling back.")
update_nacos_config(service_name, canary_version, 0)
canary_state[service_name]["status"] = "ROLLED_BACK"
elif alert.get('state') == 'ok': # This means the SLO is no longer breached
# We need a dedicated 'promotion' signal, not just 'ok'.
# A good practice is to have a separate alert rule that fires on 'success'.
# e.g., "Alert if canary has been running for 10 minutes with no SLO breaches".
logging.info(f"Canary analysis for {service_name} v{canary_version} successful. Promoting.")
update_nacos_config(service_name, canary_version, 100)
canary_state[service_name]["status"] = "PROMOTED"
return jsonify({"status": "processed"}), 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5001)
This controller is the final piece. It subscribes to Grafana’s judgment and executes the necessary actions via Nacos’s API, creating the autonomous feedback loop we set out to build.
This system is not without its limitations. The controller’s state management is naive; for a production system, a persistent store like Redis would be required to survive restarts. The logic doesn’t yet support progressive rollouts (e.g., 5% -> 25% -> 50% -> 100%), which would require a more sophisticated state machine. Furthermore, this approach is best suited for stateless services where traffic can be shifted granularly. It does not solve challenges related to stateful services or database schema migrations, which require different canary strategies.
Future iterations could involve replacing the simple webhook controller with a more robust workflow engine or integrating a statistical analysis tool to move beyond simple threshold-based SLOs. However, this architecture provides a powerful, open-source, and cloud-agnostic foundation for automating one of the most critical and risk-prone phases of the software delivery lifecycle.