Implementing Adaptive Rate Limiting and Circuit Breaking for Flask Services with Envoy Proxy

Backend Architecture

Word Count: 3.1k

Read Times: 19 Min

A set of Flask-based microservices were exhibiting classic cascading failure symptoms. The order-service depended on both a stable user-service and a notoriously fragile, third-party legacy inventory-service. Under even moderate load, the inventory-service would begin to throw 503 Service Unavailable errors, which in turn caused requests to the order-service to hang and eventually time out. This locked up worker threads, leading to a complete system stall. The initial mandate was clear: stabilize the system without undertaking a full rewrite of the legacy inventory service, a task for which we had neither the time nor the resources.

The first iteration was to simply put a proxy in front of everything. This provided a unified entry point but did nothing to solve the underlying failure propagation. The core pain point remained: a fault in one downstream service was crippling the entire request lifecycle. The initial concept, therefore, evolved into deploying an intelligent edge proxy that could act as a control plane, isolating failures and managing traffic flow based on the real-time health of the backend services.

Envoy Proxy was selected over alternatives like Nginx or HAProxy for three primary reasons. First, its dynamic configuration capabilities via xDS APIs are unparalleled, allowing for configuration changes without disruptive reloads. Second, its rich, built-in support for advanced resilience patterns like circuit breaking, outlier detection, and sophisticated rate limiting is production-proven. Finally, its first-class observability, exposing a wealth of statistics, was crucial for diagnosing the very problems we were trying to solve. Implementing this logic in application-level middleware within each Flask service was dismissed as it would lead to code duplication, language-specific implementations, and tighter coupling between business logic and infrastructure concerns.

Baseline Architecture: The Point of Failure

To replicate the failure scenario, we’ll define a docker-compose.yml that orchestrates our services and the Envoy proxy.

# docker-compose.yml
version: '3.8'

services:
  user-service:
    build: ./user-service
    ports:
      - "5001:5000"
    environment:
      - FLASK_APP=app.py

  inventory-service:
    build: ./inventory-service
    ports:
      - "5002:5000"
    environment:
      - FLASK_APP=app.py

  order-service:
    build: ./order-service
    ports:
      - "5003:5000"
    environment:
      - FLASK_APP=app.py
      - USER_SERVICE_URL=http://user-service:5000
      - INVENTORY_SERVICE_URL=http://inventory-service:5000

  envoy:
    image: envoyproxy/envoy:v1.27.0
    ports:
      - "10000:10000" # Listener port
      - "9901:9901"   # Admin port
    volumes:
      - ./envoy.yaml:/etc/envoy/envoy.yaml:ro

The Flask services are minimal. The user-service is always stable. The inventory-service is designed to be fragile.

user-service/app.py:

# user-service/app.py
import logging
from flask import Flask, jsonify

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

@app.route('/users/<user_id>')
def get_user(user_id):
    """A consistently reliable service."""
    app.logger.info(f"User service queried for user_id: {user_id}")
    return jsonify({"user_id": user_id, "name": "John Doe", "status": "active"})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

# user-service/Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["flask", "run", "--host=0.0.0.0"]

# user-service/requirements.txt
Flask==2.2.2

inventory-service/app.py:
This service simulates failure by tracking request counts and returning 503 errors after a threshold is breached.

# inventory-service/app.py
import logging
import random
import threading
from flask import Flask, jsonify, abort

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

# Simulate a fragile service with a request counter
REQUEST_COUNT = 0
FAILURE_THRESHOLD = 5
LOCK = threading.Lock()

@app.route('/inventory/<item_id>')
def get_inventory(item_id):
    """
    This service becomes unreliable under load.
    It will start returning 503 errors after 5 requests and occasionally fail.
    """
    global REQUEST_COUNT
    with LOCK:
        REQUEST_COUNT += 1
        current_count = REQUEST_COUNT

    app.logger.info(f"Inventory service queried for item_id: {item_id}. Request count: {current_count}")

    if current_count > FAILURE_THRESHOLD:
        app.logger.error("Service overloaded! Returning 503.")
        # Reset counter to allow recovery after a while
        if current_count > FAILURE_THRESHOLD + 10:
            with LOCK:
                REQUEST_COUNT = 0
        abort(503, description="Inventory service is currently overloaded.")

    # Simulate random transient failures
    if random.random() < 0.1: # 10% chance of random failure
        app.logger.warning("Simulating transient failure.")
        abort(503, description="Transient database connection error.")

    return jsonify({"item_id": item_id, "stock": 100})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

The Dockerfile and requirements for inventory-service are identical to user-service.

order-service/app.py:
This service orchestrates calls to the other two.

# order-service/app.py
import os
import logging
import requests
from flask import Flask, jsonify, abort

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

USER_SERVICE_URL = os.getenv("USER_SERVICE_URL")
INVENTORY_SERVICE_URL = os.getenv("INVENTORY_SERVICE_URL")

@app.route('/orders/create/<user_id>/<item_id>')
def create_order(user_id, item_id):
    """
    Creates an order by coordinating with user and inventory services.
    A failure in the inventory service will cause this endpoint to fail.
    """
    try:
        app.logger.info(f"Fetching user: {user_id}")
        user_resp = requests.get(f"{USER_SERVICE_URL}/users/{user_id}", timeout=2)
        user_resp.raise_for_status()
        user_data = user_resp.json()

        app.logger.info(f"Checking inventory for item: {item_id}")
        inv_resp = requests.get(f"{INVENTORY_SERVICE_URL}/inventory/{item_id}", timeout=2)
        inv_resp.raise_for_status() # This is where the failure will propagate
        inv_data = inv_resp.json()

    except requests.exceptions.RequestException as e:
        app.logger.error(f"Error communicating with downstream service: {e}")
        abort(504, description="Gateway timeout communicating with a backend service.")

    app.logger.info("Successfully created order.")
    return jsonify({
        "status": "order_created",
        "user": user_data,
        "inventory": inv_data
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

The Dockerfile and requirements (adding requests) for order-service are straightforward.

Our initial envoy.yaml is a simple pass-through router.

# envoy.yaml (v1 - Basic Routing)
admin:
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 10000
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: backend
              domains: ["*"]
              routes:
              - match: { prefix: "/users/" }
                route: { cluster: user_service_cluster }
              - match: { prefix: "/inventory/" }
                route: { cluster: inventory_service_cluster }
              - match: { prefix: "/orders/" }
                route: { cluster: order_service_cluster }
          http_filters:
          - name: envoy.filters.http.router
            typed_config: {}

  clusters:
  - name: user_service_cluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: user_service_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: user-service
                port_value: 5000
  - name: inventory_service_cluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: inventory_service_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: inventory-service
                port_value: 5000
  - name: order_service_cluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: order_service_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: order-service
                port_value: 5000

Running docker-compose up and hitting http://localhost:10000/orders/create/123/abc with a simple load test script (e.g., for i in {1..10}; do curl ...; done) quickly demonstrates the problem. After five successful requests, the order-service starts returning 504 Gateway Timeout because the inventory-service is returning 503.

Step 1: Implementing Circuit Breaking with Outlier Detection

The first line of defense is to stop sending traffic to an upstream service that is clearly failing. This prevents the order-service from wasting resources on requests that are doomed to fail. Envoy’s outlier_detection feature implements this circuit-breaking pattern. We modify the inventory_service_cluster configuration in envoy.yaml.

# envoy.yaml (v2 - Adding Outlier Detection)
...
  clusters:
  - name: user_service_cluster
    ...
  - name: inventory_service_cluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: inventory_service_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: inventory-service
                port_value: 5000
    # Key addition for circuit breaking
    outlier_detection:
      consecutive_5xx: 3 # Eject after 3 consecutive 5xx responses
      interval: 10s # Check health every 10 seconds
      base_ejection_time: 30s # Eject for a minimum of 30 seconds
      max_ejection_percent: 100 # Allow ejecting all hosts in the cluster
      enforcing_consecutive_5xx: 100 # Enforce ejection for 100% of hosts
      split_external_local_origin_errors: true

  - name: order_service_cluster
    ...

A breakdown of these parameters is critical for production tuning:

consecutive_5xx: 3: This is our trigger. If Envoy receives three 5xx responses in a row from an endpoint, it considers it unhealthy. In a real-world project, this value must be balanced. Too low, and you risk ejecting hosts due to transient blips; too high, and you react too slowly.
interval: 10s: Envoy checks the health of hosts every 10 seconds. This determines the frequency of the analysis that can lead to an ejection.
base_ejection_time: 30s: Once a host is ejected, it will be kept out of the load-balancing pool for at least 30 seconds. The actual ejection time increases with subsequent ejections.
max_ejection_percent: 100: This is vital. It allows Envoy to eject all endpoints for the inventory-service if necessary. Without this, if you only have one instance, it would never be ejected.
split_external_local_origin_errors: true: This is a subtle but important configuration. It tells Envoy to distinguish between errors generated by the upstream service (like our Flask 503) and errors generated by Envoy itself (like a connection timeout). We only want to trigger on genuine upstream failures.

After restarting Envoy with this configuration, re-running the load test yields a different result. The first few requests to /orders/create/123/abc succeed. Then, as the inventory-service begins to fail, the order-service starts receiving immediate connection failures from Envoy instead of hanging. Envoy logs will show UH (Upstream Health check failed) response flags, and its admin stats (:9901/stats) will show an incrementing cluster.inventory_service_cluster.upstream_cx_ejected counter. The system is now failing fast instead of failing slow, which is a significant improvement.

sequenceDiagram
    participant Client
    participant Envoy
    participant OrderService
    participant InventoryService

    Client->>+Envoy: POST /orders/... (1)
    Envoy->>+OrderService: POST /orders/... (1)
    OrderService->>+InventoryService: GET /inventory/... (1)
    InventoryService-->>-OrderService: 200 OK
    OrderService-->>-Envoy: 200 OK
    Envoy-->>-Client: 200 OK

    Client->>+Envoy: POST /orders/... (2-5)
    Note right of InventoryService: Service hits failure threshold

    Client->>+Envoy: POST /orders/... (6)
    Envoy->>+OrderService: POST /orders/... (6)
    OrderService->>+InventoryService: GET /inventory/... (6)
    InventoryService-->>-OrderService: 503 Overloaded
    OrderService-->>-Envoy: 504 Gateway Timeout
    Envoy-->>-Client: 504 Gateway Timeout

    Note over Envoy,InventoryService: After 3 consecutive 5xx, Envoy ejects InventoryService.

    Client->>+Envoy: POST /orders/... (7)
    Envoy-->>-Client: 503 Service Unavailable (UH flag)
    Note right of Envoy: Envoy immediately returns 503 
 without contacting OrderService 
 as its upstream dependency is broken.

Step 2: From Blunt Instrument to Adaptive Control

Circuit breaking protects the system when a service is already dead. The next step is to prevent the service from dying in the first place. A simple rate limit is a start. Envoy’s local rate limit filter is easy to configure but has a major drawback: it’s static.

# In the HttpConnectionManager filter chain
- name: envoy.filters.http.local_ratelimit
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
    stat_prefix: http_local_rate_limiter
    token_bucket:
      max_tokens: 5
      tokens_per_fill: 5
      fill_interval: 10s
    filter_enabled:
      runtime_key: local_rate_limit_enabled
      default_value:
        numerator: 100
        denominator: HUNDRED
    filter_enforced:
      runtime_key: local_rate_limit_enforced
      default_value:
        numerator: 100
        denominator: HUNDRED

This configuration would apply a crude limit of 5 requests every 10 seconds to all traffic. This is not what we want. We need to selectively and dynamically rate-limit traffic only to the fragile inventory-service, and ideally, adjust that limit based on its health.

This requires moving from local rate limiting to global rate limiting, which involves an external gRPC service that Envoy queries for every request to make a rate limit decision. The powerful part is that we can implement any logic we want inside this Rate Limit Service (RLS). We will build this RLS using Flask and gRPC.

First, we define the gRPC service contract. Envoy requires a specific protobuf definition.

ratelimit.proto:

syntax = "proto3";

package pb;

import "google/protobuf/struct.proto";

service RateLimitService {
  rpc ShouldRateLimit(RateLimitRequest) returns (RateLimitResponse) {}
}

message RateLimitRequest {
  string domain = 1;
  repeated RateLimitDescriptor descriptors = 2;
  uint32 hits_addend = 3;
}

message RateLimitDescriptor {
  message Entry {
    string key = 1;
    string value = 2;
  }
  repeated Entry entries = 1;
}

message RateLimitResponse {
  enum Code {
    UNKNOWN = 0;
    OK = 1;
    OVER_LIMIT = 2;
  }
  Code overall_code = 1;
  repeated RateLimit.Status statuses = 2;

  message RateLimit {
    message Status {
      Code code = 1;
      RateLimit current_limit = 2;
      uint32 limit_remaining = 3;
    }

    message Unit {
      enum Enum {
        UNKNOWN = 0;
        SECOND = 1;
        MINUTE = 2;
        HOUR = 3;
        DAY = 4;
      }
      Enum unit = 1;
    }
    uint32 requests_per_unit = 1;
    Unit unit = 2;
  }
}

Next, we build the Flask-based gRPC service. This service will maintain a simple in-memory state of error counts for different services and adjust the rate limit accordingly.

rate-limit-service/app.py:

# rate-limit-service/app.py
import grpc
import logging
from concurrent import futures
import time
import threading

# Import generated gRPC code
import ratelimit_pb2
import ratelimit_pb2_grpc

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# In-memory store for adaptive logic. In production, use Redis or a similar distributed store.
ERROR_RATES = {"inventory_service": 0}
REQUEST_COUNTS = {"inventory_service": 0}
CURRENT_LIMITS = {"inventory_service": 10} # Start with a high limit
LOCK = threading.Lock()

class RateLimitService(ratelimit_pb2_grpc.RateLimitServiceServicer):
    def ShouldRateLimit(self, request, context):
        """
        Main logic for adaptive rate limiting.
        """
        overall_code = ratelimit_pb2.RateLimitResponse.Code.OK

        for descriptor in request.descriptors:
            service_key = None
            for entry in descriptor.entries:
                if entry.key == "service":
                    service_key = entry.value

            if service_key == "inventory_service":
                with LOCK:
                    # Increment request count for this service
                    REQUEST_COUNTS[service_key] += 1
                    
                    # Get the current dynamic limit
                    limit = CURRENT_LIMITS.get(service_key, 10)
                    
                    # Basic time window logic (requests per 10 seconds)
                    # A real implementation would use a sliding window algorithm.
                    if REQUEST_COUNTS[service_key] > limit:
                        logging.warning(f"'{service_key}' is OVER_LIMIT. Current count: {REQUEST_COUNTS[service_key]}, Limit: {limit}")
                        overall_code = ratelimit_pb2.RateLimitResponse.Code.OVER_LIMIT
                    else:
                        logging.info(f"'{service_key}' is OK. Current count: {REQUEST_COUNTS[service_key]}, Limit: {limit}")
                        overall_code = ratelimit_pb2.RateLimitResponse.Code.OK

        return ratelimit_pb2.RateLimitResponse(overall_code=overall_code)

def monitor_and_adjust_limits():
    """
    A background thread to simulate monitoring backend health and adjusting limits.
    In a real system, this would subscribe to a metrics source like Prometheus.
    Here, we'll just periodically reset our request counters to simulate a time window.
    """
    global REQUEST_COUNTS
    while True:
        time.sleep(10) # Our time window is 10 seconds
        with LOCK:
            logging.info("Resetting request counters for time window.")
            # A more sophisticated logic would adjust CURRENT_LIMITS based on observed error rates
            # For this example, we'll just reset counts.
            for key in REQUEST_COUNTS:
                REQUEST_COUNTS[key] = 0
        logging.info(f"Current limits: {CURRENT_LIMITS}")


def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    ratelimit_pb2_grpc.add_RateLimitServiceServicer_to_server(RateLimitService(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    logging.info("gRPC Rate Limit Service started on port 50051.")
    
    # Start the monitoring thread
    monitor_thread = threading.Thread(target=monitor_and_adjust_limits, daemon=True)
    monitor_thread.start()
    
    server.wait_for_termination()

if __name__ == '__main__':
    serve()

We also need to update docker-compose.yml to build and run this new service.

Finally, we update envoy.yaml to use this global rate limiter.

# envoy.yaml (v3 - Adaptive Global Rate Limiting)
...
static_resources:
  listeners:
  - name: listener_0
    ...
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          ...
          http_filters:
          # The rate limit filter must come before the router filter.
          - name: envoy.filters.http.ratelimit
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
              domain: app_edge # A namespace for rate limit rules
              rate_limit_service:
                grpc_service:
                  envoy_grpc:
                    cluster_name: rate_limit_cluster
                transport_api_version: V3
          - name: envoy.filters.http.router
            typed_config: {}

          route_config:
            name: local_route
            virtual_hosts:
            - name: backend
              domains: ["*"]
              routes:
              - match: { prefix: "/users/" }
                route: { cluster: user_service_cluster }
              
              - match: { prefix: "/inventory/" }
                route: 
                  cluster: inventory_service_cluster
                  # This section tells Envoy what descriptors to send to the RLS for this route.
                  rate_limits:
                  - actions:
                    - request_headers:
                        header_name: ":path"
                        descriptor_key: "path"
                    - generic_key:
                        descriptor_value: "inventory_service"
                        descriptor_key: "service"

              - match: { prefix: "/orders/" }
                route: { cluster: order_service_cluster }

  clusters:
    ...
    # New cluster definition for our gRPC Rate Limit Service
    - name: rate_limit_cluster
      connect_timeout: 0.25s
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      # Important: This must be HTTP/2 for gRPC
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options: {}
      load_assignment:
        cluster_name: rate_limit_cluster
        endpoints:
        - lb_endpoints:
          - endpoint:
              address:
                socket_address:
                  address: rate-limit-service
                  port_value: 50051

The key change is in the route configuration for /inventory/. The rate_limits section defines a set of actions that generate descriptors to be sent to our Flask RLS. Here, we’re sending a generic key-value pair {"service": "inventory_service"}. Our RLS uses this key to apply the correct logic.

With this final configuration, the system behavior is now much more robust.

When traffic to /inventory/ is low, requests pass through Envoy, the RLS returns OK, and the backend service responds.
As traffic increases, the RLS starts returning OVER_LIMIT once its internal counter exceeds the dynamic threshold. Envoy sees this response and immediately returns a 429 Too Many Requests to the client, protecting the inventory-service from overload.
If a burst of errors still gets through and the inventory-service starts failing, the outlier_detection circuit breaker will kick in as a final safety net, ejecting the service and preventing cascading failure.

graph TD
    subgraph Client Request Flow
        Client -- "POST /inventory/abc" --> Envoy;
    end

    subgraph Envoy Processing
        Envoy -- "1. Match Route" --> RouteConfig;
        RouteConfig -- "2. Generate Descriptors
{'service': 'inventory_service'}" --> RateLimitFilter;
        RateLimitFilter -- "3. gRPC Request" --> RLS[Flask Rate Limit Service];
        RLS -- "4. Decision (OK/OVER_LIMIT)" --> RateLimitFilter;
        subgraph "Decision Path"
            direction LR
            RateLimitFilter -- "If OK" --> Router;
            RateLimitFilter -- "If OVER_LIMIT" --> Return429[Return 429];
        end
        Router -- "5. Forward to Upstream" --> InventoryCluster;
    end
    
    subgraph Backend
        InventoryCluster -- "HTTP GET" --> FlaskInventory[Flask Inventory Service];
    end

    subgraph Circuit Breaker Logic
        direction TB
        FlaskInventory -- "Monitors 5xx errors" --> OutlierDetection[Outlier Detection];
        OutlierDetection -- "If unhealthy" --> EjectHost{Eject Host};
        EjectHost -- "Blocks traffic" --> InventoryCluster;
    end

    style RLS fill:#f9f,stroke:#333,stroke-width:2px
    style OutlierDetection fill:#ccf,stroke:#333,stroke-width:2px

The current implementation of the Rate Limit Service is a proof-of-concept. Its state is ephemeral and local to a single instance, making it a single point of failure and unsuitable for a multi-instance deployment. A production-grade RLS would require a distributed key-value store like Redis to maintain counters and rate limit configurations, ensuring consistency across RLS replicas. Furthermore, the logic for adapting the limits is simplistic; a more advanced system would consume metrics from a system like Prometheus, analyzing error rates and latency percentiles of the upstream Flask services to make more intelligent, proactive adjustments to traffic shaping policies. While this architecture successfully decouples resilience logic from the application, the responsibility for maintaining the health of the core services ultimately still lies with the application code itself; Envoy provides a powerful shield, not a cure for underlying instability.

Flask Microservices API Gateway Envoy Proxy Resilience Engineering

Implementing a CRDT-Based Synchronization Layer for Real-Time Collaborative Code Review Workflows

2023-10-27 Software Architecture

XState WebSocket Zustand CRDT Code Review Vite Spring Boot Real-Time

Implementing a Rust-Based Sidecar with Axum and LevelDB to Accelerate a Legacy Flask ORM Layer

2023-10-27 Backend Development

Flask Microservices LevelDB Python Rust Tonic Axum ORM