Protecting a Saga Orchestrator with APISIX Circuit Breaking and Consul Health Checks

Distributed Systems

Word Count: 2.7k

Read Times: 17 Min

Our order processing pipeline went down for 45 minutes last Tuesday. The root cause was not a database failure or a network partition, but a subtle bug in a single compensating transaction endpoint within our Saga-based ordering system. An inventory-service, deployed independently from its own repository, had a flawed POST /compensate handler that consistently returned 500 Internal Server Error. Our central order-orchestrator, doing its job, dutifully retried the compensation, creating a feedback loop that saturated its connection pool and exhausted its CPU. The orchestrator, the heart of our transaction system, became the bottleneck and eventually fell over, causing a complete halt to new orders.

The initial post-mortem pointed to adding more sophisticated backoff and retry logic within the orchestrator. This is a common but flawed reaction. It bloats the orchestrator with infrastructure concerns, violates the single responsibility principle, and still doesn’t prevent a determined retry storm. The core problem is that the orchestrator shouldn’t need to know about the transient (or not-so-transient) health of downstream services. Its role is to define business logic flow, not to be a network resilience framework. The failure needs to be isolated at the infrastructure layer, preventing the pathological traffic from ever reaching the orchestrator.

This led to a redesign of our service-to-service communication fabric. We decided to leverage our existing API Gateway, Apache APISIX, not just for ingress traffic, but as a smart proxy for critical internal service calls. By combining APISIX’s api-breaker plugin with Consul’s robust health-checking capabilities, we can create a defensive perimeter around the orchestrator, allowing it to fail a saga gracefully without compromising its own stability.

The architecture is straightforward: all calls from the orchestrator to participating services are routed through a local APISIX instance. APISIX, in turn, uses Consul as a service discovery mechanism. This is where the resilience pattern comes into play. When Consul’s health checks detect a failing service instance, APISIX’s circuit breaker trips, immediately failing fast on subsequent calls to that specific endpoint, giving the faulty service time to recover and, more importantly, protecting the orchestrator from self-induced DDoS.

Here’s the visual representation of the interaction:

sequenceDiagram
    participant Client
    participant APISIX
    participant OrderOrchestrator as Orchestrator
    participant PaymentService as Payment Svc
    participant InventoryService as Inventory Svc
    participant Consul

    Note over Inventory Svc, Consul: Inventory Svc health check starts failing.
    Consul->>APISIX: Notifies APISIX of unhealthy inventory-service instance.

    Client->>APISIX: POST /orders/create
    APISIX->>Orchestrator: Forward Request
    Orchestrator->>APISIX: 1. call /execute (Payment)
    APISIX->>Payment Svc: Forward Request
    Payment Svc-->>APISIX: 200 OK
    APISIX-->>Orchestrator: 200 OK

    Orchestrator->>APISIX: 2. call /execute (Inventory) - This call fails
    APISIX->>Inventory Svc: Forward Request
    Inventory Svc-->>APISIX: 500 Internal Error
    APISIX-->>Orchestrator: 500 Internal Error

    Note over Orchestrator: Execution fails, start compensation.

    Orchestrator->>APISIX: 3. call /compensate (Payment)
    APISIX->>Payment Svc: Forward Request
    Payment Svc-->>APISIX: 200 OK
    APISIX-->>Orchestrator: 200 OK

    Note right of APISIX: Circuit breaker on inventory/compensate is now active.

    Orchestrator->>APISIX: 4. call /compensate (Inventory) - First attempt
    APISIX-->>Inventory Svc: Forward Request
    Inventory Svc-->>APISIX: 500 Internal Error
    APISIX-->>Orchestrator: 500 Internal Error (Breaker counts failure)

    Orchestrator->>APISIX: 5. call /compensate (Inventory) - Retry
    APISIX-->>Inventory Svc: Forward Request
    Inventory Svc-->>APISIX: 500 Internal Error
    APISIX-->>Orchestrator: 500 Internal Error (Breaker trips!)

    Note right of APISIX: Breaker is OPEN.

    Orchestrator->>APISIX: 6. call /compensate (Inventory) - Subsequent Retries
    APISIX-->>Orchestrator: 503 Service Unavailable (Immediately, no backend call)
    APISIX-->>Orchestrator: 503 Service Unavailable (Immediately)

    Note over Orchestrator: Orchestrator stops retrying, marks Saga as FAILED_COMPENSATION, remains healthy.

Let’s build this entire stack to demonstrate the principle in a production-grade context. The setup will involve a Docker Compose environment, Go-based microservices, Consul for service discovery, and a fully configured APISIX instance.

The Full Stack Configuration: Docker Compose

A polyrepo environment implies that each service is managed and deployed independently. For this demonstration, we’ll bring them together using docker-compose.yml, which also serves as a clear definition of the infrastructure.

# docker-compose.yml
version: "3.8"

services:
  consul:
    image: "consul:1.15"
    container_name: consul-server
    ports:
      - "8500:8500"
    volumes:
      - ./consul/config:/consul/config
    command: "agent -server -bootstrap-expect=1 -ui -data-dir=/consul/data -client=0.0.0.0"

  apisix:
    image: apache/apisix:3.5.0-debian
    container_name: apisix-gateway
    restart: always
    volumes:
      - ./apisix/config.yaml:/usr/local/apisix/conf/config.yaml:ro
      - ./apisix/apisix.yaml:/usr/local/apisix/conf/apisix.yaml:ro
    depends_on:
      - consul
    ports:
      - "9080:9080"
      - "9180:9180"
    environment:
      - APISIX_STAND_ALONE=true

  order-orchestrator:
    build:
      context: ./order-orchestrator
    container_name: order-orchestrator-svc
    depends_on:
      - consul
      - apisix
    environment:
      - SERVICE_NAME=order-orchestrator
      - SERVICE_PORT=8080
      - CONSUL_HTTP_ADDR=consul:8500
      - APISIX_URL=http://apisix-gateway:9080
    expose:
      - "8080"

  payment-service:
    build:
      context: ./payment-service
    container_name: payment-service-svc
    depends_on:
      - consul
    environment:
      - SERVICE_NAME=payment-service
      - SERVICE_PORT=8081
      - CONSUL_HTTP_ADDR=consul:8500
    expose:
      - "8081"

  inventory-service:
    build:
      context: ./inventory-service
    container_name: inventory-service-svc
    depends_on:
      - consul
    environment:
      - SERVICE_NAME=inventory-service
      - SERVICE_PORT=8082
      - CONSUL_HTTP_ADDR=consul:8500
      - FAIL_COMPENSATION=true # Deliberate failure flag
    expose:
      - "8082"

networks:
  default:
    name: saga-resilience-net

The Core Logic: Saga Orchestrator in Go

The orchestrator’s code is notable for its simplicity. All resilience logic is externalized. It only knows the steps of the transaction and their corresponding compensations. It communicates via the APISIX URL, not directly with services.

// order-orchestrator/main.go
package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"os"
	"time"

	consul "github.com/hashicorp/consul/api"
)

// SagaStep defines a single step in the transaction, including its compensation.
type SagaStep struct {
	Name        string
	ExecuteURL  string
	CompensateURL string
}

type OrderRequest struct {
	ItemID string `json:"item_id"`
	UserID string `json:"user_id"`
}

var apisixURL string

func main() {
	// Standard service setup: get config from environment, register with Consul.
	serviceName := os.Getenv("SERVICE_NAME")
	servicePort := os.Getenv("SERVICE_PORT")
	consulAddr := os.Getenv("CONSUL_HTTP_ADDR")
	apisixURL = os.Getenv("APISIX_URL")
	if apisixURL == "" {
		log.Fatal("APISIX_URL environment variable not set")
	}

	registerWithConsul(serviceName, servicePort, consulAddr)

	http.HandleFunc("/create-order", createOrderHandler)
	log.Printf("Order Orchestrator started on port %s", servicePort)
	log.Fatal(http.ListenAndServe(":"+servicePort, nil))
}

func createOrderHandler(w http.ResponseWriter, r *http.Request) {
	log.Println("Received new order request")
	var req OrderRequest
	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
		http.Error(w, "Invalid request body", http.StatusBadRequest)
		return
	}

	// Define the saga steps. Note the URLs point to APISIX routes, not direct services.
	saga := []SagaStep{
		{
			Name:        "Payment",
			ExecuteURL:  fmt.Sprintf("%s/payment/execute", apisixURL),
			CompensateURL: fmt.Sprintf("%s/payment/compensate", apisixURL),
		},
		{
			Name:        "Inventory",
			ExecuteURL:  fmt.Sprintf("%s/inventory/execute", apisixURL),
			CompensateURL: fmt.Sprintf("%s/inventory/compensate", apisixURL),
		},
	}

	// Execute the saga
	executedSteps := []SagaStep{}
	for _, step := range saga {
		log.Printf("Executing step: %s", step.Name)
		err := executeStep(step.ExecuteURL, req)
		if err != nil {
			log.Printf("Execution failed for step %s: %v. Starting compensation.", step.Name, err)
			compensate(executedSteps)
			http.Error(w, fmt.Sprintf("Order failed at step %s and could not be compensated fully.", step.Name), http.StatusInternalServerError)
			return
		}
		executedSteps = append(executedSteps, step)
	}

	w.WriteHeader(http.StatusOK)
	w.Write([]byte("Order created successfully"))
	log.Println("Saga completed successfully")
}

func executeStep(url string, payload interface{}) error {
	jsonData, _ := json.Marshal(payload)
	resp, err := http.Post(url, "application/json", bytes.NewBuffer(jsonData))
	if err != nil {
		return fmt.Errorf("HTTP request failed: %w", err)
	}
	defer resp.Body.Close()

	if resp.StatusCode >= 400 {
		body, _ := io.ReadAll(resp.Body)
		return fmt.Errorf("service returned error: %s - %s", resp.Status, string(body))
	}
	return nil
}

func compensate(steps []SagaStep) {
	// Compensate in reverse order
	for i := len(steps) - 1; i >= 0; i-- {
		step := steps[i]
		log.Printf("Compensating step: %s", step.Name)
		
		// This is the critical part. If the compensation call fails repeatedly,
		// without a circuit breaker, this loop would hammer the failing service.
		err := executeStep(step.CompensateURL, OrderRequest{}) // Dummy payload for simplicity
		if err != nil {
			// In a real system, you'd flag this transaction for manual review.
			// The key is that the orchestrator itself doesn't retry indefinitely.
			log.Printf("FATAL: Compensation failed for step %s: %v. Saga is in an inconsistent state.", step.Name, err)
		}
	}
}

// Helper function to register service with Consul
func registerWithConsul(serviceName, servicePort, consulAddr string) {
	config := consul.DefaultConfig()
	config.Address = consulAddr
	client, err := consul.NewClient(config)
	if err != nil {
		log.Fatalf("Failed to create Consul client: %v", err)
	}

	agent := client.Agent()
	serviceDef := &consul.AgentServiceRegistration{
		ID:      serviceName,
		Name:    serviceName,
		Port:    8080,
		Address: "order-orchestrator-svc", // Docker DNS name
		Check: &consul.AgentServiceCheck{
			HTTP:     fmt.Sprintf("http://order-orchestrator-svc:%s/health", servicePort),
			Interval: "10s",
			Timeout:  "1s",
		},
	}
	if err := agent.ServiceRegister(serviceDef); err != nil {
		log.Fatalf("Failed to register service with Consul: %v", err)
	}
	log.Printf("Successfully registered '%s' with Consul", serviceName)
}

The Failing Participant: Inventory Service

To simulate our production failure, the inventory-service is programmed to fail its compensation endpoint based on an environment variable. This mimics a bad deployment.

// inventory-service/main.go
package main

import (
	"fmt"
	"log"
	"net/http"
	"os"
	
	consul "github.com/hashicorp/consul/api"
)

var failCompensation bool

func main() {
	if os.Getenv("FAIL_COMPENSATION") == "true" {
		failCompensation = true
		log.Println("WARNING: Inventory service is configured to FAIL compensation requests.")
	}

	serviceName := os.Getenv("SERVICE_NAME")
	servicePort := os.Getenv("SERVICE_PORT")
	consulAddr := os.Getenv("CONSUL_HTTP_ADDR")

	registerWithConsul(serviceName, servicePort, consulAddr)

	http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) })
	http.HandleFunc("/execute", handleExecute)
	http.HandleFunc("/compensate", handleCompensate)
	
	log.Printf("Inventory Service started on port %s", servicePort)
	log.Fatal(http.ListenAndServe(":"+servicePort, nil))
}

func handleExecute(w http.ResponseWriter, r *http.Request) {
	log.Println("Handling execute request...")
	// In a real-world scenario, this would fail and trigger the saga compensation.
	// For this test, we make it fail to start the rollback process.
	http.Error(w, "Failed to reserve inventory", http.StatusInternalServerError)
}

func handleCompensate(w http.ResponseWriter, r *http.Request) {
	if failCompensation {
		log.Println("CRITICAL: Handling compensate request and failing intentionally.")
		http.Error(w, "Database connection failed during compensation", http.StatusInternalServerError)
		return
	}
	log.Println("Handling compensate request successfully.")
	w.WriteHeader(http.StatusOK)
}

// Consul registration is similar to the orchestrator's
func registerWithConsul(serviceName, servicePort, consulAddr string) {
	// ... implementation omitted for brevity, it's identical to the orchestrator's
}

The payment-service is implemented similarly but with handlers that always return 200 OK.

The Glue and The Shield: APISIX and Consul Configuration

This is where the pattern is actually implemented. We define APISIX routes and upstreams declaratively.

First, the apisix/config.yaml tells APISIX to use Consul for service discovery.

# apisix/config.yaml
apisix:
  node_listen: 9080
  enable_admin: true
  admin_key:
    - name: "admin"
      key: "edd1c9f034335f136f87ad84b625c8f1"
      role: "admin"

etcd:
  host: "http://etcd:2379" # This is unused in standalone mode
  prefix: "/apisix"

discovery:
  consul:
    servers:
      - "http://consul-server:8500"
    wait_time: 5

Next, the core configuration in apisix/apisix.yaml defines our services and the crucial circuit breaker plugin.

# apisix/apisix.yaml
# This file is used when APISIX starts in standalone mode
# It defines all routes, services, and plugins.

routes:
  # Route for the order orchestrator
  - id: "order-orchestrator-route"
    uri: "/orders/*"
    upstream_id: "order-orchestrator-upstream"

  # Route for the payment service
  - id: "payment-service-route"
    uri: "/payment/*"
    upstream_id: "payment-service-upstream"
  
  # Route for the inventory service.
  # THIS IS WHERE THE CIRCUIT BREAKER IS APPLIED.
  - id: "inventory-service-route"
    uri: "/inventory/*"
    plugins:
      api-breaker:
        # We only apply the breaker to the compensation endpoint.
        # This is a critical detail: don't break the main execution path unless necessary.
        break_on_uri: "/inventory/compensate"
        unhealthy:
          http_statuses: [500, 503] # Trip on these status codes
          failures: 2 # Trip the breaker after 2 consecutive failures
        healthy:
          http_statuses: [200]
          successes: 1
        max_breaker_sec: 30 # Keep the breaker open for 30 seconds
    upstream_id: "inventory-service-upstream"

upstreams:
  # The upstreams use Consul for dynamic node discovery.
  - id: "order-orchestrator-upstream"
    type: roundrobin
    discovery_type: consul
    service_name: order-orchestrator # This name must match the service name in Consul

  - id: "payment-service-upstream"
    type: roundrobin
    discovery_type: consul
    service_name: payment-service

  - id: "inventory-service-upstream"
    type: roundrobin
    discovery_type: consul
    service_name: inventory-service

A common mistake is to apply a circuit breaker to an entire service. Here, we apply it only to the break_on_uri: "/inventory/compensate" endpoint. This is a surgical application of the pattern. The execution path might be healthy while the compensation path is not; we don’t want to block new transactions unnecessarily. The failures: 2 setting means after just two failed retries from the orchestrator, APISIX will trip the breaker and start returning 503 Service Unavailable immediately, without forwarding the request.

Demonstrating the Resilience

After running docker-compose up -d --build, we can trigger the flow.

1. Initial Request to Trigger the Saga

We send a request to the orchestrator via APISIX. The inventory-service‘s /execute endpoint will fail, triggering the compensation flow.

curl -i -X POST http://localhost:9080/orders/create -d '{"item_id": "item-123", "user_id": "user-456"}'

2. Observing the Logs

Let’s trace the logs from docker-compose logs -f.

Order Orchestrator Logs:

order-orchestrator-svc | 2023/10/27 11:50:00 Received new order request
order-orchestrator-svc | 2023/10/27 11:50:00 Executing step: Payment
order-orchestrator-svc | 2023/10/27 11:50:00 Executing step: Inventory
order-orchestrator-svc | 2023/10/27 11:50:00 Execution failed for step Inventory: service returned error: 500 Internal Server Error - Failed to reserve inventory. Starting compensation.
order-orchestrator-svc | 2023/10/27 11:50:00 Compensating step: Payment
order-orchestrator-svc | 2023/10/27 11:50:00 Compensating step: Inventory
# First attempt to compensate Inventory...
order-orchestrator-svc | 2023/10/27 11:50:01 FATAL: Compensation failed for step Inventory: service returned error: 500 Internal Server Error - Database connection failed during compensation
# Second attempt...
order-orchestrator-svc | 2023/10/27 11:50:02 FATAL: Compensation failed for step Inventory: service returned error: 500 Internal Server Error - Database connection failed during compensation
# Breaker trips, subsequent calls fail fast
order-orchestrator-svc | 2023/10/27 11:50:03 FATAL: Compensation failed for step Inventory: service returned error: 503 Service Unavailable -

The orchestrator logs show it tried to compensate, received two 500 errors, and then immediately received a 503. It gives up and logs the fatal error, but remains operational for new requests.

APISIX Logs (Error Log):
We need to check the error.log inside the APISIX container.

docker exec apisix-gateway tail -f /usr/local/apisix/logs/error.log

You will see messages like this:

2023/10/27 11:50:02 [warn] 25#25: *12 api-breaker: health check for ("/inventory/compensate") is unhealthy, reason: consecutive failures 2 >= 2, client: 172.24.0.4, server: _, request: "POST /inventory/compensate HTTP/1.1", host: "localhost:9080"
2023/10/27 11:50:03 [error] 25#25: *14 api-breaker: circuit is open for uri /inventory/compensate, client: 172.24.0.4, server: _, request: "POST /inventory/compensate HTTP/1.1", host: "localhost:9080"

This is a clear confirmation. APISIX identified the consecutive failures and opened the circuit, protecting the entire system. The orchestrator is no longer able to hammer the failing inventory-service.

This pattern successfully offloaded the responsibility of handling a misbehaving downstream service from the application layer to the infrastructure layer. The orchestrator remains simple and focused on business logic. The polyrepo structure, which can exacerbate such issues by creating deployment information silos, is now manageable because the infrastructure provides a safety net.

The solution is not without its own set of considerations. The primary limitation is that the business transaction is now left in an inconsistent state, which is logged as FAILED_COMPENSATION. This necessitates a separate process, either automated or manual, for reconciliation. This is an accepted trade-off in Saga patterns; our architecture simply ensures the system survives to allow for that reconciliation. Furthermore, our current breaker is reactive, based on HTTP status codes. A more proactive approach could integrate metrics from a system like Prometheus, allowing the breaker to trip based on elevated latency or error rates before they become critical failures. This would be a logical next step in maturing the resilience of the platform.

Consul APISIX Saga Pattern Resilience Engineering Polyrepo

Building a Resilient Client-Side Logging Pipeline with Service Workers Laravel and a Fluentd Lua Filter

2023-10-27 Observability

Fluentd Lua Service Workers Observability Laravel

Automating a Resilient Serverless Visualization Pipeline with Ansible OpenFaaS and SQS on EKS

2023-10-27 Cloud Native

AWS EKS Matplotlib AWS SQS OpenFaaS Ansible