Our order processing pipeline went down for 45 minutes last Tuesday. The root cause was not a database failure or a network partition, but a subtle bug in a single compensating transaction endpoint within our Saga-based ordering system. An inventory-service
, deployed independently from its own repository, had a flawed POST /compensate
handler that consistently returned 500 Internal Server Error
. Our central order-orchestrator
, doing its job, dutifully retried the compensation, creating a feedback loop that saturated its connection pool and exhausted its CPU. The orchestrator, the heart of our transaction system, became the bottleneck and eventually fell over, causing a complete halt to new orders.
The initial post-mortem pointed to adding more sophisticated backoff and retry logic within the orchestrator. This is a common but flawed reaction. It bloats the orchestrator with infrastructure concerns, violates the single responsibility principle, and still doesn’t prevent a determined retry storm. The core problem is that the orchestrator shouldn’t need to know about the transient (or not-so-transient) health of downstream services. Its role is to define business logic flow, not to be a network resilience framework. The failure needs to be isolated at the infrastructure layer, preventing the pathological traffic from ever reaching the orchestrator.
This led to a redesign of our service-to-service communication fabric. We decided to leverage our existing API Gateway, Apache APISIX, not just for ingress traffic, but as a smart proxy for critical internal service calls. By combining APISIX’s api-breaker
plugin with Consul’s robust health-checking capabilities, we can create a defensive perimeter around the orchestrator, allowing it to fail a saga gracefully without compromising its own stability.
The architecture is straightforward: all calls from the orchestrator to participating services are routed through a local APISIX instance. APISIX, in turn, uses Consul as a service discovery mechanism. This is where the resilience pattern comes into play. When Consul’s health checks detect a failing service instance, APISIX’s circuit breaker trips, immediately failing fast on subsequent calls to that specific endpoint, giving the faulty service time to recover and, more importantly, protecting the orchestrator from self-induced DDoS.
Here’s the visual representation of the interaction:
sequenceDiagram participant Client participant APISIX participant OrderOrchestrator as Orchestrator participant PaymentService as Payment Svc participant InventoryService as Inventory Svc participant Consul Note over Inventory Svc, Consul: Inventory Svc health check starts failing. Consul->>APISIX: Notifies APISIX of unhealthy inventory-service instance. Client->>APISIX: POST /orders/create APISIX->>Orchestrator: Forward Request Orchestrator->>APISIX: 1. call /execute (Payment) APISIX->>Payment Svc: Forward Request Payment Svc-->>APISIX: 200 OK APISIX-->>Orchestrator: 200 OK Orchestrator->>APISIX: 2. call /execute (Inventory) - This call fails APISIX->>Inventory Svc: Forward Request Inventory Svc-->>APISIX: 500 Internal Error APISIX-->>Orchestrator: 500 Internal Error Note over Orchestrator: Execution fails, start compensation. Orchestrator->>APISIX: 3. call /compensate (Payment) APISIX->>Payment Svc: Forward Request Payment Svc-->>APISIX: 200 OK APISIX-->>Orchestrator: 200 OK Note right of APISIX: Circuit breaker on inventory/compensate is now active. Orchestrator->>APISIX: 4. call /compensate (Inventory) - First attempt APISIX-->>Inventory Svc: Forward Request Inventory Svc-->>APISIX: 500 Internal Error APISIX-->>Orchestrator: 500 Internal Error (Breaker counts failure) Orchestrator->>APISIX: 5. call /compensate (Inventory) - Retry APISIX-->>Inventory Svc: Forward Request Inventory Svc-->>APISIX: 500 Internal Error APISIX-->>Orchestrator: 500 Internal Error (Breaker trips!) Note right of APISIX: Breaker is OPEN. Orchestrator->>APISIX: 6. call /compensate (Inventory) - Subsequent Retries APISIX-->>Orchestrator: 503 Service Unavailable (Immediately, no backend call) APISIX-->>Orchestrator: 503 Service Unavailable (Immediately) Note over Orchestrator: Orchestrator stops retrying, marks Saga as FAILED_COMPENSATION, remains healthy.
Let’s build this entire stack to demonstrate the principle in a production-grade context. The setup will involve a Docker Compose environment, Go-based microservices, Consul for service discovery, and a fully configured APISIX instance.
The Full Stack Configuration: Docker Compose
A polyrepo environment implies that each service is managed and deployed independently. For this demonstration, we’ll bring them together using docker-compose.yml
, which also serves as a clear definition of the infrastructure.
# docker-compose.yml
version: "3.8"
services:
consul:
image: "consul:1.15"
container_name: consul-server
ports:
- "8500:8500"
volumes:
- ./consul/config:/consul/config
command: "agent -server -bootstrap-expect=1 -ui -data-dir=/consul/data -client=0.0.0.0"
apisix:
image: apache/apisix:3.5.0-debian
container_name: apisix-gateway
restart: always
volumes:
- ./apisix/config.yaml:/usr/local/apisix/conf/config.yaml:ro
- ./apisix/apisix.yaml:/usr/local/apisix/conf/apisix.yaml:ro
depends_on:
- consul
ports:
- "9080:9080"
- "9180:9180"
environment:
- APISIX_STAND_ALONE=true
order-orchestrator:
build:
context: ./order-orchestrator
container_name: order-orchestrator-svc
depends_on:
- consul
- apisix
environment:
- SERVICE_NAME=order-orchestrator
- SERVICE_PORT=8080
- CONSUL_HTTP_ADDR=consul:8500
- APISIX_URL=http://apisix-gateway:9080
expose:
- "8080"
payment-service:
build:
context: ./payment-service
container_name: payment-service-svc
depends_on:
- consul
environment:
- SERVICE_NAME=payment-service
- SERVICE_PORT=8081
- CONSUL_HTTP_ADDR=consul:8500
expose:
- "8081"
inventory-service:
build:
context: ./inventory-service
container_name: inventory-service-svc
depends_on:
- consul
environment:
- SERVICE_NAME=inventory-service
- SERVICE_PORT=8082
- CONSUL_HTTP_ADDR=consul:8500
- FAIL_COMPENSATION=true # Deliberate failure flag
expose:
- "8082"
networks:
default:
name: saga-resilience-net
The Core Logic: Saga Orchestrator in Go
The orchestrator’s code is notable for its simplicity. All resilience logic is externalized. It only knows the steps of the transaction and their corresponding compensations. It communicates via the APISIX URL, not directly with services.
// order-orchestrator/main.go
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"os"
"time"
consul "github.com/hashicorp/consul/api"
)
// SagaStep defines a single step in the transaction, including its compensation.
type SagaStep struct {
Name string
ExecuteURL string
CompensateURL string
}
type OrderRequest struct {
ItemID string `json:"item_id"`
UserID string `json:"user_id"`
}
var apisixURL string
func main() {
// Standard service setup: get config from environment, register with Consul.
serviceName := os.Getenv("SERVICE_NAME")
servicePort := os.Getenv("SERVICE_PORT")
consulAddr := os.Getenv("CONSUL_HTTP_ADDR")
apisixURL = os.Getenv("APISIX_URL")
if apisixURL == "" {
log.Fatal("APISIX_URL environment variable not set")
}
registerWithConsul(serviceName, servicePort, consulAddr)
http.HandleFunc("/create-order", createOrderHandler)
log.Printf("Order Orchestrator started on port %s", servicePort)
log.Fatal(http.ListenAndServe(":"+servicePort, nil))
}
func createOrderHandler(w http.ResponseWriter, r *http.Request) {
log.Println("Received new order request")
var req OrderRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
http.Error(w, "Invalid request body", http.StatusBadRequest)
return
}
// Define the saga steps. Note the URLs point to APISIX routes, not direct services.
saga := []SagaStep{
{
Name: "Payment",
ExecuteURL: fmt.Sprintf("%s/payment/execute", apisixURL),
CompensateURL: fmt.Sprintf("%s/payment/compensate", apisixURL),
},
{
Name: "Inventory",
ExecuteURL: fmt.Sprintf("%s/inventory/execute", apisixURL),
CompensateURL: fmt.Sprintf("%s/inventory/compensate", apisixURL),
},
}
// Execute the saga
executedSteps := []SagaStep{}
for _, step := range saga {
log.Printf("Executing step: %s", step.Name)
err := executeStep(step.ExecuteURL, req)
if err != nil {
log.Printf("Execution failed for step %s: %v. Starting compensation.", step.Name, err)
compensate(executedSteps)
http.Error(w, fmt.Sprintf("Order failed at step %s and could not be compensated fully.", step.Name), http.StatusInternalServerError)
return
}
executedSteps = append(executedSteps, step)
}
w.WriteHeader(http.StatusOK)
w.Write([]byte("Order created successfully"))
log.Println("Saga completed successfully")
}
func executeStep(url string, payload interface{}) error {
jsonData, _ := json.Marshal(payload)
resp, err := http.Post(url, "application/json", bytes.NewBuffer(jsonData))
if err != nil {
return fmt.Errorf("HTTP request failed: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode >= 400 {
body, _ := io.ReadAll(resp.Body)
return fmt.Errorf("service returned error: %s - %s", resp.Status, string(body))
}
return nil
}
func compensate(steps []SagaStep) {
// Compensate in reverse order
for i := len(steps) - 1; i >= 0; i-- {
step := steps[i]
log.Printf("Compensating step: %s", step.Name)
// This is the critical part. If the compensation call fails repeatedly,
// without a circuit breaker, this loop would hammer the failing service.
err := executeStep(step.CompensateURL, OrderRequest{}) // Dummy payload for simplicity
if err != nil {
// In a real system, you'd flag this transaction for manual review.
// The key is that the orchestrator itself doesn't retry indefinitely.
log.Printf("FATAL: Compensation failed for step %s: %v. Saga is in an inconsistent state.", step.Name, err)
}
}
}
// Helper function to register service with Consul
func registerWithConsul(serviceName, servicePort, consulAddr string) {
config := consul.DefaultConfig()
config.Address = consulAddr
client, err := consul.NewClient(config)
if err != nil {
log.Fatalf("Failed to create Consul client: %v", err)
}
agent := client.Agent()
serviceDef := &consul.AgentServiceRegistration{
ID: serviceName,
Name: serviceName,
Port: 8080,
Address: "order-orchestrator-svc", // Docker DNS name
Check: &consul.AgentServiceCheck{
HTTP: fmt.Sprintf("http://order-orchestrator-svc:%s/health", servicePort),
Interval: "10s",
Timeout: "1s",
},
}
if err := agent.ServiceRegister(serviceDef); err != nil {
log.Fatalf("Failed to register service with Consul: %v", err)
}
log.Printf("Successfully registered '%s' with Consul", serviceName)
}
The Failing Participant: Inventory Service
To simulate our production failure, the inventory-service
is programmed to fail its compensation endpoint based on an environment variable. This mimics a bad deployment.
// inventory-service/main.go
package main
import (
"fmt"
"log"
"net/http"
"os"
consul "github.com/hashicorp/consul/api"
)
var failCompensation bool
func main() {
if os.Getenv("FAIL_COMPENSATION") == "true" {
failCompensation = true
log.Println("WARNING: Inventory service is configured to FAIL compensation requests.")
}
serviceName := os.Getenv("SERVICE_NAME")
servicePort := os.Getenv("SERVICE_PORT")
consulAddr := os.Getenv("CONSUL_HTTP_ADDR")
registerWithConsul(serviceName, servicePort, consulAddr)
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) })
http.HandleFunc("/execute", handleExecute)
http.HandleFunc("/compensate", handleCompensate)
log.Printf("Inventory Service started on port %s", servicePort)
log.Fatal(http.ListenAndServe(":"+servicePort, nil))
}
func handleExecute(w http.ResponseWriter, r *http.Request) {
log.Println("Handling execute request...")
// In a real-world scenario, this would fail and trigger the saga compensation.
// For this test, we make it fail to start the rollback process.
http.Error(w, "Failed to reserve inventory", http.StatusInternalServerError)
}
func handleCompensate(w http.ResponseWriter, r *http.Request) {
if failCompensation {
log.Println("CRITICAL: Handling compensate request and failing intentionally.")
http.Error(w, "Database connection failed during compensation", http.StatusInternalServerError)
return
}
log.Println("Handling compensate request successfully.")
w.WriteHeader(http.StatusOK)
}
// Consul registration is similar to the orchestrator's
func registerWithConsul(serviceName, servicePort, consulAddr string) {
// ... implementation omitted for brevity, it's identical to the orchestrator's
}
The payment-service
is implemented similarly but with handlers that always return 200 OK
.
The Glue and The Shield: APISIX and Consul Configuration
This is where the pattern is actually implemented. We define APISIX routes and upstreams declaratively.
First, the apisix/config.yaml
tells APISIX to use Consul for service discovery.
# apisix/config.yaml
apisix:
node_listen: 9080
enable_admin: true
admin_key:
- name: "admin"
key: "edd1c9f034335f136f87ad84b625c8f1"
role: "admin"
etcd:
host: "http://etcd:2379" # This is unused in standalone mode
prefix: "/apisix"
discovery:
consul:
servers:
- "http://consul-server:8500"
wait_time: 5
Next, the core configuration in apisix/apisix.yaml
defines our services and the crucial circuit breaker plugin.
# apisix/apisix.yaml
# This file is used when APISIX starts in standalone mode
# It defines all routes, services, and plugins.
routes:
# Route for the order orchestrator
- id: "order-orchestrator-route"
uri: "/orders/*"
upstream_id: "order-orchestrator-upstream"
# Route for the payment service
- id: "payment-service-route"
uri: "/payment/*"
upstream_id: "payment-service-upstream"
# Route for the inventory service.
# THIS IS WHERE THE CIRCUIT BREAKER IS APPLIED.
- id: "inventory-service-route"
uri: "/inventory/*"
plugins:
api-breaker:
# We only apply the breaker to the compensation endpoint.
# This is a critical detail: don't break the main execution path unless necessary.
break_on_uri: "/inventory/compensate"
unhealthy:
http_statuses: [500, 503] # Trip on these status codes
failures: 2 # Trip the breaker after 2 consecutive failures
healthy:
http_statuses: [200]
successes: 1
max_breaker_sec: 30 # Keep the breaker open for 30 seconds
upstream_id: "inventory-service-upstream"
upstreams:
# The upstreams use Consul for dynamic node discovery.
- id: "order-orchestrator-upstream"
type: roundrobin
discovery_type: consul
service_name: order-orchestrator # This name must match the service name in Consul
- id: "payment-service-upstream"
type: roundrobin
discovery_type: consul
service_name: payment-service
- id: "inventory-service-upstream"
type: roundrobin
discovery_type: consul
service_name: inventory-service
A common mistake is to apply a circuit breaker to an entire service. Here, we apply it only to the break_on_uri: "/inventory/compensate"
endpoint. This is a surgical application of the pattern. The execution path might be healthy while the compensation path is not; we don’t want to block new transactions unnecessarily. The failures: 2
setting means after just two failed retries from the orchestrator, APISIX will trip the breaker and start returning 503 Service Unavailable
immediately, without forwarding the request.
Demonstrating the Resilience
After running docker-compose up -d --build
, we can trigger the flow.
1. Initial Request to Trigger the Saga
We send a request to the orchestrator via APISIX. The inventory-service
‘s /execute
endpoint will fail, triggering the compensation flow.
curl -i -X POST http://localhost:9080/orders/create -d '{"item_id": "item-123", "user_id": "user-456"}'
2. Observing the Logs
Let’s trace the logs from docker-compose logs -f
.
Order Orchestrator Logs:
order-orchestrator-svc | 2023/10/27 11:50:00 Received new order request
order-orchestrator-svc | 2023/10/27 11:50:00 Executing step: Payment
order-orchestrator-svc | 2023/10/27 11:50:00 Executing step: Inventory
order-orchestrator-svc | 2023/10/27 11:50:00 Execution failed for step Inventory: service returned error: 500 Internal Server Error - Failed to reserve inventory. Starting compensation.
order-orchestrator-svc | 2023/10/27 11:50:00 Compensating step: Payment
order-orchestrator-svc | 2023/10/27 11:50:00 Compensating step: Inventory
# First attempt to compensate Inventory...
order-orchestrator-svc | 2023/10/27 11:50:01 FATAL: Compensation failed for step Inventory: service returned error: 500 Internal Server Error - Database connection failed during compensation
# Second attempt...
order-orchestrator-svc | 2023/10/27 11:50:02 FATAL: Compensation failed for step Inventory: service returned error: 500 Internal Server Error - Database connection failed during compensation
# Breaker trips, subsequent calls fail fast
order-orchestrator-svc | 2023/10/27 11:50:03 FATAL: Compensation failed for step Inventory: service returned error: 503 Service Unavailable -
The orchestrator logs show it tried to compensate, received two 500
errors, and then immediately received a 503
. It gives up and logs the fatal error, but remains operational for new requests.
APISIX Logs (Error Log):
We need to check the error.log
inside the APISIX container.
docker exec apisix-gateway tail -f /usr/local/apisix/logs/error.log
You will see messages like this:
2023/10/27 11:50:02 [warn] 25#25: *12 api-breaker: health check for ("/inventory/compensate") is unhealthy, reason: consecutive failures 2 >= 2, client: 172.24.0.4, server: _, request: "POST /inventory/compensate HTTP/1.1", host: "localhost:9080"
2023/10/27 11:50:03 [error] 25#25: *14 api-breaker: circuit is open for uri /inventory/compensate, client: 172.24.0.4, server: _, request: "POST /inventory/compensate HTTP/1.1", host: "localhost:9080"
This is a clear confirmation. APISIX identified the consecutive failures and opened the circuit, protecting the entire system. The orchestrator is no longer able to hammer the failing inventory-service
.
This pattern successfully offloaded the responsibility of handling a misbehaving downstream service from the application layer to the infrastructure layer. The orchestrator remains simple and focused on business logic. The polyrepo structure, which can exacerbate such issues by creating deployment information silos, is now manageable because the infrastructure provides a safety net.
The solution is not without its own set of considerations. The primary limitation is that the business transaction is now left in an inconsistent state, which is logged as FAILED_COMPENSATION
. This necessitates a separate process, either automated or manual, for reconciliation. This is an accepted trade-off in Saga patterns; our architecture simply ensures the system survives to allow for that reconciliation. Furthermore, our current breaker is reactive, based on HTTP status codes. A more proactive approach could integrate metrics from a system like Prometheus, allowing the breaker to trip based on elevated latency or error rates before they become critical failures. This would be a logical next step in maturing the resilience of the platform.