A set of Flask-based microservices were exhibiting classic cascading failure symptoms. The order-service
depended on both a stable user-service
and a notoriously fragile, third-party legacy inventory-service
. Under even moderate load, the inventory-service
would begin to throw 503 Service Unavailable
errors, which in turn caused requests to the order-service
to hang and eventually time out. This locked up worker threads, leading to a complete system stall. The initial mandate was clear: stabilize the system without undertaking a full rewrite of the legacy inventory service, a task for which we had neither the time nor the resources.
The first iteration was to simply put a proxy in front of everything. This provided a unified entry point but did nothing to solve the underlying failure propagation. The core pain point remained: a fault in one downstream service was crippling the entire request lifecycle. The initial concept, therefore, evolved into deploying an intelligent edge proxy that could act as a control plane, isolating failures and managing traffic flow based on the real-time health of the backend services.
Envoy Proxy was selected over alternatives like Nginx or HAProxy for three primary reasons. First, its dynamic configuration capabilities via xDS APIs are unparalleled, allowing for configuration changes without disruptive reloads. Second, its rich, built-in support for advanced resilience patterns like circuit breaking, outlier detection, and sophisticated rate limiting is production-proven. Finally, its first-class observability, exposing a wealth of statistics, was crucial for diagnosing the very problems we were trying to solve. Implementing this logic in application-level middleware within each Flask service was dismissed as it would lead to code duplication, language-specific implementations, and tighter coupling between business logic and infrastructure concerns.
Baseline Architecture: The Point of Failure
To replicate the failure scenario, we’ll define a docker-compose.yml
that orchestrates our services and the Envoy proxy.
# docker-compose.yml
version: '3.8'
services:
user-service:
build: ./user-service
ports:
- "5001:5000"
environment:
- FLASK_APP=app.py
inventory-service:
build: ./inventory-service
ports:
- "5002:5000"
environment:
- FLASK_APP=app.py
order-service:
build: ./order-service
ports:
- "5003:5000"
environment:
- FLASK_APP=app.py
- USER_SERVICE_URL=http://user-service:5000
- INVENTORY_SERVICE_URL=http://inventory-service:5000
envoy:
image: envoyproxy/envoy:v1.27.0
ports:
- "10000:10000" # Listener port
- "9901:9901" # Admin port
volumes:
- ./envoy.yaml:/etc/envoy/envoy.yaml:ro
The Flask services are minimal. The user-service
is always stable. The inventory-service
is designed to be fragile.
user-service/app.py:
# user-service/app.py
import logging
from flask import Flask, jsonify
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
@app.route('/users/<user_id>')
def get_user(user_id):
"""A consistently reliable service."""
app.logger.info(f"User service queried for user_id: {user_id}")
return jsonify({"user_id": user_id, "name": "John Doe", "status": "active"})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
# user-service/Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["flask", "run", "--host=0.0.0.0"]
# user-service/requirements.txt
Flask==2.2.2
inventory-service/app.py:
This service simulates failure by tracking request counts and returning 503
errors after a threshold is breached.
# inventory-service/app.py
import logging
import random
import threading
from flask import Flask, jsonify, abort
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
# Simulate a fragile service with a request counter
REQUEST_COUNT = 0
FAILURE_THRESHOLD = 5
LOCK = threading.Lock()
@app.route('/inventory/<item_id>')
def get_inventory(item_id):
"""
This service becomes unreliable under load.
It will start returning 503 errors after 5 requests and occasionally fail.
"""
global REQUEST_COUNT
with LOCK:
REQUEST_COUNT += 1
current_count = REQUEST_COUNT
app.logger.info(f"Inventory service queried for item_id: {item_id}. Request count: {current_count}")
if current_count > FAILURE_THRESHOLD:
app.logger.error("Service overloaded! Returning 503.")
# Reset counter to allow recovery after a while
if current_count > FAILURE_THRESHOLD + 10:
with LOCK:
REQUEST_COUNT = 0
abort(503, description="Inventory service is currently overloaded.")
# Simulate random transient failures
if random.random() < 0.1: # 10% chance of random failure
app.logger.warning("Simulating transient failure.")
abort(503, description="Transient database connection error.")
return jsonify({"item_id": item_id, "stock": 100})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
The Dockerfile and requirements for inventory-service
are identical to user-service
.
order-service/app.py:
This service orchestrates calls to the other two.
# order-service/app.py
import os
import logging
import requests
from flask import Flask, jsonify, abort
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
USER_SERVICE_URL = os.getenv("USER_SERVICE_URL")
INVENTORY_SERVICE_URL = os.getenv("INVENTORY_SERVICE_URL")
@app.route('/orders/create/<user_id>/<item_id>')
def create_order(user_id, item_id):
"""
Creates an order by coordinating with user and inventory services.
A failure in the inventory service will cause this endpoint to fail.
"""
try:
app.logger.info(f"Fetching user: {user_id}")
user_resp = requests.get(f"{USER_SERVICE_URL}/users/{user_id}", timeout=2)
user_resp.raise_for_status()
user_data = user_resp.json()
app.logger.info(f"Checking inventory for item: {item_id}")
inv_resp = requests.get(f"{INVENTORY_SERVICE_URL}/inventory/{item_id}", timeout=2)
inv_resp.raise_for_status() # This is where the failure will propagate
inv_data = inv_resp.json()
except requests.exceptions.RequestException as e:
app.logger.error(f"Error communicating with downstream service: {e}")
abort(504, description="Gateway timeout communicating with a backend service.")
app.logger.info("Successfully created order.")
return jsonify({
"status": "order_created",
"user": user_data,
"inventory": inv_data
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
The Dockerfile and requirements (adding requests
) for order-service
are straightforward.
Our initial envoy.yaml
is a simple pass-through router.
# envoy.yaml (v1 - Basic Routing)
admin:
address:
socket_address:
address: 0.0.0.0
port_value: 9901
static_resources:
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 10000
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: backend
domains: ["*"]
routes:
- match: { prefix: "/users/" }
route: { cluster: user_service_cluster }
- match: { prefix: "/inventory/" }
route: { cluster: inventory_service_cluster }
- match: { prefix: "/orders/" }
route: { cluster: order_service_cluster }
http_filters:
- name: envoy.filters.http.router
typed_config: {}
clusters:
- name: user_service_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: user_service_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: user-service
port_value: 5000
- name: inventory_service_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: inventory_service_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: inventory-service
port_value: 5000
- name: order_service_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: order_service_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: order-service
port_value: 5000
Running docker-compose up
and hitting http://localhost:10000/orders/create/123/abc
with a simple load test script (e.g., for i in {1..10}; do curl ...; done
) quickly demonstrates the problem. After five successful requests, the order-service
starts returning 504 Gateway Timeout
because the inventory-service
is returning 503
.
Step 1: Implementing Circuit Breaking with Outlier Detection
The first line of defense is to stop sending traffic to an upstream service that is clearly failing. This prevents the order-service
from wasting resources on requests that are doomed to fail. Envoy’s outlier_detection
feature implements this circuit-breaking pattern. We modify the inventory_service_cluster
configuration in envoy.yaml
.
# envoy.yaml (v2 - Adding Outlier Detection)
...
clusters:
- name: user_service_cluster
...
- name: inventory_service_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: inventory_service_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: inventory-service
port_value: 5000
# Key addition for circuit breaking
outlier_detection:
consecutive_5xx: 3 # Eject after 3 consecutive 5xx responses
interval: 10s # Check health every 10 seconds
base_ejection_time: 30s # Eject for a minimum of 30 seconds
max_ejection_percent: 100 # Allow ejecting all hosts in the cluster
enforcing_consecutive_5xx: 100 # Enforce ejection for 100% of hosts
split_external_local_origin_errors: true
- name: order_service_cluster
...
A breakdown of these parameters is critical for production tuning:
-
consecutive_5xx: 3
: This is our trigger. If Envoy receives three5xx
responses in a row from an endpoint, it considers it unhealthy. In a real-world project, this value must be balanced. Too low, and you risk ejecting hosts due to transient blips; too high, and you react too slowly. -
interval: 10s
: Envoy checks the health of hosts every 10 seconds. This determines the frequency of the analysis that can lead to an ejection. -
base_ejection_time: 30s
: Once a host is ejected, it will be kept out of the load-balancing pool for at least 30 seconds. The actual ejection time increases with subsequent ejections. -
max_ejection_percent: 100
: This is vital. It allows Envoy to eject all endpoints for theinventory-service
if necessary. Without this, if you only have one instance, it would never be ejected. -
split_external_local_origin_errors: true
: This is a subtle but important configuration. It tells Envoy to distinguish between errors generated by the upstream service (like our Flask503
) and errors generated by Envoy itself (like a connection timeout). We only want to trigger on genuine upstream failures.
After restarting Envoy with this configuration, re-running the load test yields a different result. The first few requests to /orders/create/123/abc
succeed. Then, as the inventory-service
begins to fail, the order-service
starts receiving immediate connection failures from Envoy instead of hanging. Envoy logs will show UH
(Upstream Health check failed) response flags, and its admin stats (:9901/stats
) will show an incrementing cluster.inventory_service_cluster.upstream_cx_ejected
counter. The system is now failing fast instead of failing slow, which is a significant improvement.
sequenceDiagram participant Client participant Envoy participant OrderService participant InventoryService Client->>+Envoy: POST /orders/... (1) Envoy->>+OrderService: POST /orders/... (1) OrderService->>+InventoryService: GET /inventory/... (1) InventoryService-->>-OrderService: 200 OK OrderService-->>-Envoy: 200 OK Envoy-->>-Client: 200 OK Client->>+Envoy: POST /orders/... (2-5) Note right of InventoryService: Service hits failure threshold Client->>+Envoy: POST /orders/... (6) Envoy->>+OrderService: POST /orders/... (6) OrderService->>+InventoryService: GET /inventory/... (6) InventoryService-->>-OrderService: 503 Overloaded OrderService-->>-Envoy: 504 Gateway Timeout Envoy-->>-Client: 504 Gateway Timeout Note over Envoy,InventoryService: After 3 consecutive 5xx, Envoy ejects InventoryService. Client->>+Envoy: POST /orders/... (7) Envoy-->>-Client: 503 Service Unavailable (UH flag) Note right of Envoy: Envoy immediately returns 503
without contacting OrderService
as its upstream dependency is broken.
Step 2: From Blunt Instrument to Adaptive Control
Circuit breaking protects the system when a service is already dead. The next step is to prevent the service from dying in the first place. A simple rate limit is a start. Envoy’s local rate limit filter is easy to configure but has a major drawback: it’s static.
# In the HttpConnectionManager filter chain
- name: envoy.filters.http.local_ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
stat_prefix: http_local_rate_limiter
token_bucket:
max_tokens: 5
tokens_per_fill: 5
fill_interval: 10s
filter_enabled:
runtime_key: local_rate_limit_enabled
default_value:
numerator: 100
denominator: HUNDRED
filter_enforced:
runtime_key: local_rate_limit_enforced
default_value:
numerator: 100
denominator: HUNDRED
This configuration would apply a crude limit of 5 requests every 10 seconds to all traffic. This is not what we want. We need to selectively and dynamically rate-limit traffic only to the fragile inventory-service
, and ideally, adjust that limit based on its health.
This requires moving from local rate limiting to global rate limiting, which involves an external gRPC service that Envoy queries for every request to make a rate limit decision. The powerful part is that we can implement any logic we want inside this Rate Limit Service (RLS). We will build this RLS using Flask and gRPC.
First, we define the gRPC service contract. Envoy requires a specific protobuf definition.
ratelimit.proto:
syntax = "proto3";
package pb;
import "google/protobuf/struct.proto";
service RateLimitService {
rpc ShouldRateLimit(RateLimitRequest) returns (RateLimitResponse) {}
}
message RateLimitRequest {
string domain = 1;
repeated RateLimitDescriptor descriptors = 2;
uint32 hits_addend = 3;
}
message RateLimitDescriptor {
message Entry {
string key = 1;
string value = 2;
}
repeated Entry entries = 1;
}
message RateLimitResponse {
enum Code {
UNKNOWN = 0;
OK = 1;
OVER_LIMIT = 2;
}
Code overall_code = 1;
repeated RateLimit.Status statuses = 2;
message RateLimit {
message Status {
Code code = 1;
RateLimit current_limit = 2;
uint32 limit_remaining = 3;
}
message Unit {
enum Enum {
UNKNOWN = 0;
SECOND = 1;
MINUTE = 2;
HOUR = 3;
DAY = 4;
}
Enum unit = 1;
}
uint32 requests_per_unit = 1;
Unit unit = 2;
}
}
Next, we build the Flask-based gRPC service. This service will maintain a simple in-memory state of error counts for different services and adjust the rate limit accordingly.
rate-limit-service/app.py:
# rate-limit-service/app.py
import grpc
import logging
from concurrent import futures
import time
import threading
# Import generated gRPC code
import ratelimit_pb2
import ratelimit_pb2_grpc
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# In-memory store for adaptive logic. In production, use Redis or a similar distributed store.
ERROR_RATES = {"inventory_service": 0}
REQUEST_COUNTS = {"inventory_service": 0}
CURRENT_LIMITS = {"inventory_service": 10} # Start with a high limit
LOCK = threading.Lock()
class RateLimitService(ratelimit_pb2_grpc.RateLimitServiceServicer):
def ShouldRateLimit(self, request, context):
"""
Main logic for adaptive rate limiting.
"""
overall_code = ratelimit_pb2.RateLimitResponse.Code.OK
for descriptor in request.descriptors:
service_key = None
for entry in descriptor.entries:
if entry.key == "service":
service_key = entry.value
if service_key == "inventory_service":
with LOCK:
# Increment request count for this service
REQUEST_COUNTS[service_key] += 1
# Get the current dynamic limit
limit = CURRENT_LIMITS.get(service_key, 10)
# Basic time window logic (requests per 10 seconds)
# A real implementation would use a sliding window algorithm.
if REQUEST_COUNTS[service_key] > limit:
logging.warning(f"'{service_key}' is OVER_LIMIT. Current count: {REQUEST_COUNTS[service_key]}, Limit: {limit}")
overall_code = ratelimit_pb2.RateLimitResponse.Code.OVER_LIMIT
else:
logging.info(f"'{service_key}' is OK. Current count: {REQUEST_COUNTS[service_key]}, Limit: {limit}")
overall_code = ratelimit_pb2.RateLimitResponse.Code.OK
return ratelimit_pb2.RateLimitResponse(overall_code=overall_code)
def monitor_and_adjust_limits():
"""
A background thread to simulate monitoring backend health and adjusting limits.
In a real system, this would subscribe to a metrics source like Prometheus.
Here, we'll just periodically reset our request counters to simulate a time window.
"""
global REQUEST_COUNTS
while True:
time.sleep(10) # Our time window is 10 seconds
with LOCK:
logging.info("Resetting request counters for time window.")
# A more sophisticated logic would adjust CURRENT_LIMITS based on observed error rates
# For this example, we'll just reset counts.
for key in REQUEST_COUNTS:
REQUEST_COUNTS[key] = 0
logging.info(f"Current limits: {CURRENT_LIMITS}")
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
ratelimit_pb2_grpc.add_RateLimitServiceServicer_to_server(RateLimitService(), server)
server.add_insecure_port('[::]:50051')
server.start()
logging.info("gRPC Rate Limit Service started on port 50051.")
# Start the monitoring thread
monitor_thread = threading.Thread(target=monitor_and_adjust_limits, daemon=True)
monitor_thread.start()
server.wait_for_termination()
if __name__ == '__main__':
serve()
We also need to update docker-compose.yml
to build and run this new service.
Finally, we update envoy.yaml
to use this global rate limiter.
# envoy.yaml (v3 - Adaptive Global Rate Limiting)
...
static_resources:
listeners:
- name: listener_0
...
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
...
http_filters:
# The rate limit filter must come before the router filter.
- name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: app_edge # A namespace for rate limit rules
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: rate_limit_cluster
transport_api_version: V3
- name: envoy.filters.http.router
typed_config: {}
route_config:
name: local_route
virtual_hosts:
- name: backend
domains: ["*"]
routes:
- match: { prefix: "/users/" }
route: { cluster: user_service_cluster }
- match: { prefix: "/inventory/" }
route:
cluster: inventory_service_cluster
# This section tells Envoy what descriptors to send to the RLS for this route.
rate_limits:
- actions:
- request_headers:
header_name: ":path"
descriptor_key: "path"
- generic_key:
descriptor_value: "inventory_service"
descriptor_key: "service"
- match: { prefix: "/orders/" }
route: { cluster: order_service_cluster }
clusters:
...
# New cluster definition for our gRPC Rate Limit Service
- name: rate_limit_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
# Important: This must be HTTP/2 for gRPC
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {}
load_assignment:
cluster_name: rate_limit_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: rate-limit-service
port_value: 50051
The key change is in the route configuration for /inventory/
. The rate_limits
section defines a set of actions
that generate descriptors
to be sent to our Flask RLS. Here, we’re sending a generic key-value pair {"service": "inventory_service"}
. Our RLS uses this key to apply the correct logic.
With this final configuration, the system behavior is now much more robust.
- When traffic to
/inventory/
is low, requests pass through Envoy, the RLS returnsOK
, and the backend service responds. - As traffic increases, the RLS starts returning
OVER_LIMIT
once its internal counter exceeds the dynamic threshold. Envoy sees this response and immediately returns a429 Too Many Requests
to the client, protecting theinventory-service
from overload. - If a burst of errors still gets through and the
inventory-service
starts failing, theoutlier_detection
circuit breaker will kick in as a final safety net, ejecting the service and preventing cascading failure.
graph TD subgraph Client Request Flow Client -- "POST /inventory/abc" --> Envoy; end subgraph Envoy Processing Envoy -- "1. Match Route" --> RouteConfig; RouteConfig -- "2. Generate Descriptors
{'service': 'inventory_service'}" --> RateLimitFilter; RateLimitFilter -- "3. gRPC Request" --> RLS[Flask Rate Limit Service]; RLS -- "4. Decision (OK/OVER_LIMIT)" --> RateLimitFilter; subgraph "Decision Path" direction LR RateLimitFilter -- "If OK" --> Router; RateLimitFilter -- "If OVER_LIMIT" --> Return429[Return 429]; end Router -- "5. Forward to Upstream" --> InventoryCluster; end subgraph Backend InventoryCluster -- "HTTP GET" --> FlaskInventory[Flask Inventory Service]; end subgraph Circuit Breaker Logic direction TB FlaskInventory -- "Monitors 5xx errors" --> OutlierDetection[Outlier Detection]; OutlierDetection -- "If unhealthy" --> EjectHost{Eject Host}; EjectHost -- "Blocks traffic" --> InventoryCluster; end style RLS fill:#f9f,stroke:#333,stroke-width:2px style OutlierDetection fill:#ccf,stroke:#333,stroke-width:2px
The current implementation of the Rate Limit Service is a proof-of-concept. Its state is ephemeral and local to a single instance, making it a single point of failure and unsuitable for a multi-instance deployment. A production-grade RLS would require a distributed key-value store like Redis to maintain counters and rate limit configurations, ensuring consistency across RLS replicas. Furthermore, the logic for adapting the limits is simplistic; a more advanced system would consume metrics from a system like Prometheus, analyzing error rates and latency percentiles of the upstream Flask services to make more intelligent, proactive adjustments to traffic shaping policies. While this architecture successfully decouples resilience logic from the application, the responsibility for maintaining the health of the core services ultimately still lies with the application code itself; Envoy provides a powerful shield, not a cure for underlying instability.