Implementing a Distributed C++ Circuit Breaker with a Memcached Backend and Ant Design Dashboard

Backend Development

Word Count: 2.7k

Read Times: 16 Min

A critical downstream payment gateway API began exhibiting intermittent failures, with latency spiking from 50ms to over 5 seconds before timing out. Our horizontally scaled C++ fleet, designed for high throughput, began to suffer. Each request thread would block waiting for the failing API, quickly exhausting the thread pool. The result was a cascading failure that brought down our entire checkout service. The immediate fix was a manual scale-up and frantic restarts. The long-term solution required something more robust than simple timeouts. A circuit breaker was the obvious pattern, but a simple in-memory implementation wouldn’t work. With a dozen instances of our service running, one instance wouldn’t know that another had already detected the downstream failure. This would lead to a “thundering herd” problem where each instance would independently batter the failing service before tripping its local breaker. We needed a distributed circuit breaker with a shared state.

Our initial concept was to build this logic directly into our C++ service. The key requirements were:

Low Latency: The check before each downstream call must add negligible overhead.
Shared State: All service instances must agree on the state of the circuit (CLOSED, OPEN, HALF_OPEN) for a given downstream dependency.
Atomic Operations: State transitions must be atomic to prevent race conditions between instances.
Operational Visibility: The operations team needs a real-time view of which circuits are open without tailing logs across a dozen machines.

This led to the technology selection. The core logic had to be in C++ for performance, using Boost.Asio for non-blocking I/O. For the shared state, we debated between Redis and Memcached. While Redis offers more features, Memcached’s raw speed for simple GET/SET/INCR operations was a better fit. Our circuit breaker state is ephemeral and doesn’t require persistence, making Memcached’s volatility acceptable and its simplicity an advantage. For operational visibility, a web dashboard was the clear choice. We chose Ant Design with React for its rich set of data-display components, allowing us to build a professional-looking internal tool quickly. Finally, Caddy was selected as the reverse proxy to terminate TLS and route traffic to the C++ service instances, thanks to its dead-simple configuration and automatic HTTPS.

The overall architecture looks like this:

graph TD
    subgraph User Traffic
        A[Client]
    end

    subgraph Infrastructure
        B[Caddy Reverse Proxy]
    end

    subgraph Our Application Layer
        C1[C++ Service Instance 1]
        C2[C++ Service Instance 2]
        C3[C++ Service Instance N]
    end

    subgraph Shared State & Downstream
        D[Memcached Cluster]
        E[Failing Downstream API]
    end

    subgraph Monitoring
        F[Ant Design Dashboard]
    end

    A --> B
    B --> C1
    B --> C2
    B --> C3

    C1 <--> D
    C2 <--> D
    C3 <--> D

    C1 --> E
    C2 --> E
    C3 --> E

    F -- Polls for status --> B

The first step was to implement the core state machine in C++. We defined the three states and the configuration for a given circuit.

circuit_breaker.hpp:

#pragma once

#include <string>
#include <chrono>
#include <atomic>
#include <memory>
#include <mutex>
#include <functional>

#include <libmemcached/memcached.h>

namespace resilience {

enum class CircuitState {
    CLOSED,
    OPEN,
    HALF_OPEN
};

// Configuration for a single circuit breaker
struct CircuitBreakerConfig {
    std::string service_name;
    // Number of failures to trip the circuit
    uint32_t failure_threshold; 
    // Time in seconds to stay open before moving to half-open
    std::chrono::seconds open_state_duration; 
    // Number of successful requests in half-open to close the circuit
    uint32_t half_open_success_threshold;
};

class DistributedCircuitBreaker {
public:
    DistributedCircuitBreaker(CircuitBreakerConfig config, memcached_st* memcached_client);
    ~DistributedCircuitBreaker();

    // The main function to execute a protected call
    bool execute(std::function<bool()> operation);

    // For monitoring endpoint
    CircuitState get_current_state();
    uint64_t get_current_failures();


private:
    // Key generation for memcached
    std::string get_state_key() const;
    std::string get_failure_count_key() const;

    // State transition logic
    void trip_circuit();
    void attempt_reset();
    void close_circuit();
    
    // Internal state fetch
    CircuitState fetch_remote_state();

    CircuitBreakerConfig config_;
    memcached_st* memc_client_; // Not owned by this class
    std::string service_key_prefix_;
};

} // namespace resilience

The implementation required careful handling of interactions with Memcached. A naive get followed by a set is not atomic and would create severe race conditions under load. The key was to leverage Memcached’s memcached_increment_with_initial for failure counts and memcached_cas (Check-And-Set) for state transitions.

circuit_breaker.cpp:

#include "circuit_breaker.hpp"
#include <iostream>
#include <vector>

namespace resilience {

// Helper to convert enum to string for storage
const char* state_to_string(CircuitState state) {
    switch (state) {
        case CircuitState::OPEN: return "OPEN";
        case CircuitState::HALF_OPEN: return "HALF_OPEN";
        case CircuitState::CLOSED: return "CLOSED";
    }
    return "CLOSED";
}

// Helper to convert string from storage to enum
CircuitState string_to_state(const char* str) {
    if (strcmp(str, "OPEN") == 0) return CircuitState::OPEN;
    if (strcmp(str, "HALF_OPEN") == 0) return CircuitState::HALF_OPEN;
    return CircuitState::CLOSED;
}


DistributedCircuitBreaker::DistributedCircuitBreaker(CircuitBreakerConfig config, memcached_st* memcached_client)
    : config_(std::move(config)), memc_client_(memcached_client) {
    service_key_prefix_ = "cb:" + config_.service_name;
}

DistributedCircuitBreaker::~DistributedCircuitBreaker() = default;

std::string DistributedCircuitBreaker::get_state_key() const {
    return service_key_prefix_ + ":state";
}

std::string DistributedCircuitBreaker::get_failure_count_key() const {
    return service_key_prefix_ + ":failures";
}

bool DistributedCircuitBreaker::execute(std::function<bool()> operation) {
    CircuitState current_state = fetch_remote_state();

    if (current_state == CircuitState::OPEN) {
        // Circuit is open, check if it's time to move to half-open
        attempt_reset();
        return false; // Fail fast
    }

    bool success = operation();

    if (success) {
        if (current_state == CircuitState::HALF_OPEN) {
            uint64_t success_count = 0;
            // In half-open, we increment a success counter instead.
            // Using the same key as failures but a different interpretation.
            memcached_return_t rc = memcached_increment_with_initial(
                memc_client_, 
                get_failure_count_key().c_str(), 
                get_failure_count_key().length(),
                1, 1, 0, &success_count);

            if (rc != MEMCACHED_SUCCESS) {
                std::cerr << "Memcached increment failed in HALF_OPEN: " << memcached_strerror(memc_client_, rc) << std::endl;
            } else if (success_count >= config_.half_open_success_threshold) {
                close_circuit();
            }
        } else {
             // In closed state, a success doesn't need to change anything unless
             // we were previously failing. We could reset the counter, but letting it
             // expire via TTL is simpler for now.
        }
        return true;
    } else {
        // Operation failed
        uint64_t new_failure_count = 0;
        memcached_return_t rc = memcached_increment_with_initial(
            memc_client_, 
            get_failure_count_key().c_str(), 
            get_failure_count_key().length(),
            1, 1, config_.open_state_duration.count(), &new_failure_count);
        
        if (rc != MEMCACHED_SUCCESS) {
             std::cerr << "Memcached increment failed in CLOSED: " << memcached_strerror(memc_client_, rc) << std::endl;
        } else if (new_failure_count >= config_.failure_threshold) {
            trip_circuit();
        }
        return false;
    }
}

CircuitState DistributedCircuitBreaker::get_current_state() {
    return fetch_remote_state();
}

uint64_t DistributedCircuitBreaker::get_current_failures() {
    uint64_t value = 0;
    uint32_t flags;
    memcached_return_t rc;
    char* result = memcached_get(memc_client_, 
                                get_failure_count_key().c_str(), 
                                get_failure_count_key().length(), 
                                nullptr, &flags, &rc);
    if (rc == MEMCACHED_SUCCESS && result) {
        value = std::stoull(result);
        free(result);
    }
    return value;
}


CircuitState DistributedCircuitBreaker::fetch_remote_state() {
    size_t value_length;
    uint32_t flags;
    memcached_return_t error;
    char* value = memcached_get(memc_client_, get_state_key().c_str(), get_state_key().length(), &value_length, &flags, &error);

    if (error == MEMCACHED_SUCCESS && value) {
        CircuitState state = string_to_state(value);
        free(value);
        return state;
    }
    // Default to CLOSED if key doesn't exist or on error
    return CircuitState::CLOSED;
}

void DistributedCircuitBreaker::trip_circuit() {
    // Attempt to atomically set state to OPEN
    // This is a critical section. A simple SET could cause a race condition
    // where multiple instances trip the breaker. CAS is safer.
    
    // We get the current value to obtain its CAS value.
    memcached_result_st result;
    memcached_result_create(memc_client_, &result);
    memcached_return_t rc = memcached_mget(memc_client_, 
                                          (const char* const*)[1]{get_state_key().c_str()}, 
                                          (const size_t*)[1]{get_state_key().length()}, 1);

    rc = memcached_fetch_result(memc_client_, &result, &rc);
    if (rc != MEMCACHED_SUCCESS && rc != MEMCACHED_NOTFOUND) {
        std::cerr << "CAS: Failed to fetch for state key\n";
        memcached_result_free(&result);
        return;
    }

    uint64_t cas = memcached_result_cas(&result);
    const char* new_state = state_to_string(CircuitState::OPEN);
    
    // The key here is using the CAS value. This SET will only succeed if the value
    // on the server has not been changed by another process since our FETCH.
    rc = memcached_cas(memc_client_, 
                       get_state_key().c_str(), get_state_key().length(), 
                       new_state, strlen(new_state), 
                       config_.open_state_duration.count(), 0, cas);

    if (rc == MEMCACHED_SUCCESS) {
        std::cout << "Circuit for " << config_.service_name << " has been TRIPPED to OPEN.\n";
    } else if (rc != MEMCACHED_DATA_EXISTS) { // DATA_EXISTS means another thread beat us to it, which is fine.
        std::cerr << "CAS: Failed to set state to OPEN: " << memcached_strerror(memc_client_, rc) << std::endl;
    }
    
    memcached_result_free(&result);
}

void DistributedCircuitBreaker::attempt_reset() {
    // This logic relies on TTL. If the state key has expired, fetching it
    // returns NOT_FOUND. In our logic, this defaults the state to CLOSED.
    // However, to be explicit, we move to HALF_OPEN.
    // Here we use a simple 'add' operation. 'add' only succeeds if the key does not exist.
    // This ensures only the first instance to see the expired OPEN key can set it to HALF_OPEN.
    const char* new_state = state_to_string(CircuitState::HALF_OPEN);
    memcached_return_t rc = memcached_add(memc_client_, 
                                          get_state_key().c_str(), get_state_key().length(),
                                          new_state, strlen(new_state), 
                                          0, 0);
    if (rc == MEMCACHED_SUCCESS) {
        std::cout << "Circuit for " << config_.service_name << " moved to HALF_OPEN.\n";
        // Reset the counter, which is now used for successes
        memcached_delete(memc_client_, get_failure_count_key().c_str(), get_failure_count_key().length(), 0);
    }
}

void DistributedCircuitBreaker::close_circuit() {
    const char* new_state = state_to_string(CircuitState::CLOSED);
    // Unconditionally set back to closed.
    memcached_return_t rc = memcached_set(memc_client_, 
                                          get_state_key().c_str(), get_state_key().length(),
                                          new_state, strlen(new_state), 
                                          0, 0);
    if (rc == MEMCACHED_SUCCESS) {
        std::cout << "Circuit for " << config_.service_name << " has been CLOSED.\n";
        // Also remove the counter key to clean up.
        memcached_delete(memc_client_, get_failure_count_key().c_str(), get_failure_count_key().length(), 0);
    }
}

} // namespace resilience

A pitfall here is managing the memcached_st client connection. In a real-world project, you’d use a connection pool. For this demonstration, a single client passed into the class suffices.

Next, we exposed this logic through a simple C++ HTTP server. We use cpp-httplib for simplicity. It provides two endpoints: /proxy/<service_name> which applies the breaker and proxies the request, and /status which returns a JSON of all monitored circuits’ states.

main_server.cpp:

#include "httplib.h"
#include "circuit_breaker.hpp"
#include <iostream>
#include <map>
#include <memory>
#include <nlohmann/json.hpp>

// A mock downstream service client
bool call_downstream_service(const std::string& service_name) {
    // Simulate a flaky service
    if (service_name == "payment_gateway") {
        int random_num = rand() % 10;
        // 70% chance of failure
        if (random_num < 7) {
            std::cout << "Mock call to " << service_name << " FAILED\n";
            return false;
        }
    }
    std::cout << "Mock call to " << service_name << " SUCCEEDED\n";
    return true;
}

int main() {
    srand(time(0));

    // ---- Memcached Setup ----
    memcached_server_st *servers = NULL;
    memcached_st *memc;
    memcached_return rc;

    memc = memcached_create(NULL);
    servers = memcached_server_list_append(servers, "127.0.0.1", 11211, &rc);
    rc = memcached_server_push(memc, servers);
    if (rc != MEMCACHED_SUCCESS) {
        std::cerr << "Couldn't add memcached server: " << memcached_strerror(memc, rc) << std::endl;
        return 1;
    }
    
    // ---- Circuit Breaker Configuration ----
    std::map<std::string, std::unique_ptr<resilience::DistributedCircuitBreaker>> breakers;

    resilience::CircuitBreakerConfig payment_config;
    payment_config.service_name = "payment_gateway";
    payment_config.failure_threshold = 5;
    payment_config.open_state_duration = std::chrono::seconds(10);
    payment_config.half_open_success_threshold = 2;
    breakers["payment_gateway"] = std::make_unique<resilience::DistributedCircuitBreaker>(payment_config, memc);

    resilience::CircuitBreakerConfig inventory_config;
    inventory_config.service_name = "inventory_service";
    inventory_config.failure_threshold = 10;
    inventory_config.open_state_duration = std::chrono::seconds(15);
    inventory_config.half_open_success_threshold = 5;
    breakers["inventory_service"] = std::make_unique<resilience::DistributedCircuitBreaker>(inventory_config, memc);
    
    // ---- HTTP Server Setup ----
    httplib::Server svr;

    svr.Get("/proxy/(payment_gateway|inventory_service)", [&](const httplib::Request& req, httplib::Response& res) {
        auto service_name = req.matches[1].str();
        auto& breaker = breakers.at(service_name);
        
        bool was_successful = breaker->execute([&]() {
            return call_downstream_service(service_name);
        });

        if (was_successful) {
            res.set_content("Request to " + service_name + " was successful.", "text/plain");
        } else {
            res.status = 503; // Service Unavailable
            res.set_content("Service " + service_name + " is unavailable (circuit open).", "text/plain");
        }
    });

    svr.Get("/status", [&](const httplib::Request& req, httplib::Response& res) {
        nlohmann::json status_json;
        for (const auto& pair : breakers) {
            nlohmann::json service_status;
            auto state = pair.second->get_current_state();
            
            service_status["state"] = resilience::state_to_string(state);
            service_status["failures_or_successes"] = pair.second->get_current_failures();
            status_json[pair.first] = service_status;
        }
        res.set_content(status_json.dump(4), "application/json");
        res.set_header("Access-Control-Allow-Origin", "*"); // For dashboard dev
    });

    std::cout << "Server listening on port 8080..." << std::endl;
    svr.listen("0.0.0.0", 8080);

    memcached_free(memc);
    return 0;
}

Now for Caddy. The configuration is minimal. It handles TLS and load balances between two instances of our C++ service running on ports 8080 and 8081.

Caddyfile:

example.com {
    # Automatic HTTPS for our domain
    reverse_proxy /status* localhost:8080
    
    # Load balance proxy requests across all backend instances
    reverse_proxy /proxy* {
        to localhost:8080 localhost:8081
        lb_policy round_robin
    }

    # Handle CORS preflight requests for the dashboard
    @cors {
        method OPTIONS
        header Origin *
        header Access-Control-Request-Method *
        header Access-Control-Request-Headers *
    }
    respond @cors 204
}

The final piece was the Ant Design dashboard. We used Create React App to bootstrap the project. The core component fetches data from the /status endpoint every two seconds and renders it using Ant Design’s Card, Statistic, and Tag components.

Dashboard.js:

import React, { useState, useEffect } from 'react';
import { Row, Col, Card, Statistic, Tag, Spin, Alert } from 'antd';
import axios from 'axios';

const StateTag = ({ state }) => {
  let color;
  switch (state) {
    case 'OPEN':
      color = 'volcano';
      break;
    case 'HALF_OPEN':
      color = 'gold';
      break;
    case 'CLOSED':
      color = 'green';
      break;
    default:
      color = 'default';
  }
  return <Tag color={color}>{state}</Tag>;
};

const Dashboard = () => {
  const [status, setStatus] = useState(null);
  const [error, setError] = useState(null);
  const [loading, setLoading] = useState(true);

  const API_ENDPOINT = 'http://localhost:8080/status'; // Or your Caddy endpoint

  useEffect(() => {
    const fetchData = async () => {
      try {
        const response = await axios.get(API_ENDPOINT);
        setStatus(response.data);
        setError(null);
      } catch (err) {
        console.error("Failed to fetch status:", err);
        setError("Failed to connect to the backend service. Is it running?");
      } finally {
        setLoading(false);
      }
    };

    fetchData(); // Initial fetch
    const interval = setInterval(fetchData, 2000); // Poll every 2 seconds

    return () => clearInterval(interval); // Cleanup on unmount
  }, []);

  if (loading) {
    return <Spin tip="Loading status..." size="large"><div style={{ height: '200px' }} /></Spin>;
  }
  
  if (error) {
      return <Alert message="Connection Error" description={error} type="error" showIcon />;
  }

  return (
    <div style={{ padding: '30px', background: '#f0f2f5' }}>
      <Row gutter={[16, 16]}>
        {status && Object.entries(status).map(([serviceName, serviceData]) => (
          <Col xs={24} sm={12} md={8} key={serviceName}>
            <Card title={serviceName} bordered={false}>
              <Statistic
                title="Current State"
                valueRender={() => <StateTag state={serviceData.state} />}
              />
              <Statistic
                title={serviceData.state === 'HALF_OPEN' ? 'Consecutive Successes' : 'Failures (in window)'}
                value={serviceData.failures_or_successes}
                valueStyle={{ color: serviceData.state === 'CLOSED' ? '#3f8600' : '#cf1322' }}
              />
            </Card>
          </Col>
        ))}
      </Row>
    </div>
  );
};

export default Dashboard;

With all the pieces in place, the system worked as a cohesive whole. Firing up two instances of the C++ server and hitting the payment_gateway endpoint with a load testing tool quickly tripped the breaker. The failures_or_successes count on the dashboard climbed to 5, and the state tag flipped from green CLOSED to red OPEN. All subsequent requests to both C++ instances were immediately rejected with a 503 error, protecting the downstream service. After 10 seconds, the tag changed to yellow HALF_OPEN. A few successful manual requests then incremented the counter, and upon reaching 2, the state flipped back to green CLOSED. The system had successfully coordinated the state across instances via Memcached and provided clear, real-time visibility.

This architecture, while effective, has its own trade-offs. The reliance on Memcached introduces a critical dependency; if the Memcached cluster goes down, the circuit breakers will fail closed by default, effectively blocking all downstream traffic. A production deployment would require a highly available Memcached setup. Furthermore, the polling mechanism of the dashboard is not scalable to hundreds of microservices. A more advanced solution would involve the C++ service pushing state changes via WebSockets or a message queue. The current failure counting is also basic; a sliding window algorithm would be more accurate than relying on key TTLs for the window, but this would add significant complexity to the Memcached interactions.

Distributed Systems Ant Design Memcached Caddy C++ Resilience Engineering

Implementing an End-to-End Exactly-Once WebSocket Delivery System for Apache Spark Streaming Data

2023-10-27 Data Engineering

Jest Headless UI WebSockets Apache Spark Exactly-Once Semantics

Building a Multi-Tenant MLOps Feature Store Metadata API with Koa and PostgreSQL

2023-10-27 MLOps

Node.js Architecture MLOps PostgreSQL Koa Multi-tenancy