Implementing a Service Worker Circuit Breaker for a Ktor Web API with a MyBatis Persistence Layer


A critical internal dashboard, responsible for displaying real-time operational metrics, began exhibiting catastrophic failures under load. The frontend, a vanilla JavaScript application, would become completely unresponsive during peak hours. Investigation revealed that the backing Ktor Web API was buckling due to contention in the underlying PostgreSQL database, managed via MyBatis. Each slow database response caused API endpoints to time out. The frontend, unaware of the systemic issue, would relentlessly retry failed requests, initiating a vicious cycle that escalated a localized database slowdown into a full-blown service outage. The initial fix, a simple exponential backoff on the client, merely delayed the inevitable collapse. A more robust, client-aware resilience mechanism was required.

The core of the problem was a cascading failure amplified by client-side retries. The solution had to be implemented on the client to prevent it from overwhelming an already struggling server. This led to the decision to implement a client-side Circuit Breaker pattern. By placing this logic inside a Service Worker, we could intercept all outgoing API requests transparently, without refactoring hundreds of fetch calls scattered across the application codebase. The Service Worker would act as an intelligent proxy, monitoring API health and cutting off traffic when a pattern of failures was detected.

The Ailing Backend: Simulating Failure in Ktor and MyBatis

To reliably develop and test the client-side solution, we first needed to create a Ktor backend that could predictably simulate the failure conditions observed in production. This involves an API endpoint whose response time is artificially inflated to mimic database contention.

The Ktor application is configured for Netty, with essential plugins for content negotiation (using Jackson for JSON) and call logging to monitor incoming requests.

// build.gradle.kts
plugins {
    kotlin("jvm") version "1.9.20"
    id("io.ktor.plugin") version "2.3.6"
    id("org.jetbrains.kotlin.plugin.serialization") version "1.9.20"
}

// ... dependencies for ktor-server-netty, jackson, mybatis, postgresql, logback

The main application entry point sets up the server and its modules.

// src/main/kotlin/com/example/Application.kt
package com.example

import com.example.data.DatabaseFactory
import com.example.data.OperationalDataService
import io.ktor.serialization.jackson.*
import io.ktor.server.application.*
import io.ktor.server.engine.*
import io.ktor.server.netty.*
import io.ktor.server.plugins.callloging.*
import io.ktor.server.plugins.contentnegotiation.*
import io.ktor.server.request.*
import io.ktor.server.response.*
import io.ktor.server.routing.*
import org.slf4j.event.Level

fun main() {
    embeddedServer(Netty, port = 8080, host = "0.0.0.0", module = Application::module)
        .start(wait = true)
}

fun Application.module() {
    // Configure logging to see each request hit the server.
    install(CallLogging) {
        level = Level.INFO
        filter { call -> call.request.path().startsWith("/") }
    }
    // Configure JSON serialization.
    install(ContentNegotiation) {
        jackson()
    }
    
    // Initialize the database connection and MyBatis.
    DatabaseFactory.init()
    val dataService = OperationalDataService()

    routing {
        get("/api/metrics") {
            try {
                // This call is designed to be slow under certain conditions.
                val metrics = dataService.getLatestMetrics()
                call.respond(metrics)
            } catch (e: Exception) {
                // In a real app, map exceptions to proper status codes.
                call.respond(io.ktor.http.HttpStatusCode.InternalServerError, mapOf("error" to "Failed to fetch metrics"))
            }
        }
    }
}

The persistence layer uses MyBatis to query the database. The DatabaseFactory class initializes the SqlSessionFactory. A critical detail here is the mybatis-config.xml, which defines data sources and mappers. For this scenario, we use a simple HikariCP connection pool.

<!-- src/main/resources/mybatis-config.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE configuration
        PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
    <environments default="development">
        <environment id="development">
            <transactionManager type="JDBC"/>
            <dataSource type="com.example.data.HikariDataSourceFactory">
                <property name="driverClassName" value="org.postgresql.Driver"/>
                <property name="jdbcUrl" value="jdbc:postgresql://localhost:5432/dashboard_db"/>
                <property name="username" value="user"/>
                <property name="password" value="password"/>
                <property name="maximumPoolSize" value="10"/>
            </dataSource>
        </environment>
    </environments>
    <mappers>
        <mapper resource="mappers/OperationalDataMapper.xml"/>
    </mappers>
</configuration>

The core of the failure simulation lies within the MyBatis mapper. We use PostgreSQL’s pg_sleep() function to introduce a deliberate delay, which can be triggered by a request parameter or a system property to toggle between healthy and degraded states.

<!-- src/main/resources/mappers/OperationalDataMapper.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE mapper
        PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-config.dtd">
<mapper namespace="com.example.data.OperationalDataMapper">
    <select id="getLatestMetrics" resultType="com.example.data.Metric">
        -- Simulate a slow query that takes 5 seconds, mimicking database lock contention.
        SELECT 1 as id, 'cpu_load' as name, 0.85 as value, now() as timestamp, pg_sleep(5);
    </select>
</mapper>

With this setup, any call to /api/metrics will now take over 5 seconds to respond. This is the predictable failure condition we will build the Service Worker against.

The Service Worker as an Interception Layer

A Service Worker runs in a separate background thread, independent of the web page, and can intercept and handle network requests. This makes it the ideal location for our circuit breaker logic.

First, the application must register the Service Worker. This is done once in the main JavaScript file.

// public/js/main.js
if ('serviceWorker' in navigator) {
    window.addEventListener('load', () => {
        navigator.serviceWorker.register('/sw.js')
            .then(registration => {
                console.log('Service Worker registered with scope:', registration.scope);
            })
            .catch(error => {
                console.error('Service Worker registration failed:', error);
            });
    });
}

// Example function to trigger API calls
function fetchData() {
    console.log('Attempting to fetch metrics...');
    fetch('/api/metrics')
        .then(response => {
            if (!response.ok) {
                throw new Error(`HTTP error! status: ${response.status}`);
            }
            return response.json();
        })
        .then(data => console.log('Data received:', data))
        .catch(error => console.error('Fetch failed:', error.message));
}

// Set up a button to trigger fetches
document.getElementById('fetchButton').addEventListener('click', fetchData);

The heart of the solution is the Service Worker file, sw.js. We will build the circuit breaker state machine inside its fetch event listener. The logic must be robust and self-contained. A common mistake is to store state in global variables within the Service Worker, but this state is lost whenever the worker is terminated by the browser. For persistence, we must use IndexedDB or a similar storage mechanism.

Building the Circuit Breaker State Machine

The circuit breaker has three states:

  1. CLOSED: The default state. Network requests are allowed to pass through. Failures are tracked. If the failure count exceeds a threshold, the breaker trips and moves to the OPEN state.
  2. OPEN: Network requests are immediately rejected without being sent. This prevents the application from hammering the failing server. After a timeout period, the breaker moves to the HALF_OPEN state.
  3. HALF_OPEN: A single “probe” request is allowed to pass through. If it succeeds, the breaker is considered healthy again and moves to CLOSED. If it fails, it trips back to OPEN, restarting the timeout.
graph TD
    subgraph Circuit Breaker States
        CLOSED -- Failure Threshold Reached --> OPEN;
        OPEN -- Timeout Expires --> HALF_OPEN;
        HALF_OPEN -- Probe Request Fails --> OPEN;
        HALF_OPEN -- Probe Request Succeeds --> CLOSED;
        CLOSED -- Success --> CLOSED;
    end

The implementation requires careful management of this state. We’ll use a simple key-value store abstraction over IndexedDB for persistence.

// public/sw.js - Formatted with Prettier for clarity

const CIRCUIT_BREAKER_DB = 'CircuitBreakerDB';
const STATE_STORE = 'stateStore';

// Configuration for the circuit breaker
const CONFIG = {
    failureThreshold: 3, // Trip after 3 consecutive failures
    openStateTimeout: 15000, // Stay open for 15 seconds
    probeRequestUrl: '/api/metrics', // The API endpoint to monitor
};

// A simple key-value wrapper around IndexedDB
const idb = {
    get(key) {
        return new Promise((resolve) => {
            const request = indexedDB.open(CIRCUIT_BREAKER_DB, 1);
            request.onupgradeneeded = () => request.result.createObjectStore(STATE_STORE);
            request.onsuccess = () => {
                const db = request.result;
                const tx = db.transaction(STATE_STORE, 'readonly');
                const store = tx.objectStore(STATE_STORE);
                const getReq = store.get(key);
                getReq.onsuccess = () => resolve(getReq.result);
                getReq.onerror = () => resolve(undefined); // Key not found
            };
            request.onerror = () => resolve(undefined); // DB error
        });
    },
    set(key, value) {
        return new Promise((resolve, reject) => {
            const request = indexedDB.open(CIRCUIT_BREAKER_DB, 1);
            request.onupgradeneeded = () => request.result.createObjectStore(STATE_STORE);
            request.onsuccess = () => {
                const db = request.result;
                const tx = db.transaction(STATE_STORE, 'readwrite');
                const store = tx.objectStore(STATE_STORE);
                const setReq = store.put(value, key);
                tx.oncomplete = () => resolve(true);
                tx.onerror = () => reject(tx.error);
            };
            request.onerror = () => reject(request.error);
        });
    },
};

async function getCircuitState() {
    const defaultState = {
        state: 'CLOSED', // CLOSED, OPEN, HALF_OPEN
        failureCount: 0,
        lastFailureTime: 0,
    };
    const storedState = await idb.get('circuitState');
    return { ...defaultState, ...(storedState || {}) };
}

async function setCircuitState(newState) {
    return idb.set('circuitState', newState);
}

self.addEventListener('install', (event) => {
    // Ensure the new service worker activates immediately
    event.waitUntil(self.skipWaiting());
});

self.addEventListener('activate', (event) => {
    // Take control of all pages under its scope immediately
    event.waitUntil(self.clients.claim());
    console.log('Service Worker activated. Initializing circuit breaker state.');
    // Initialize state on activation
    setCircuitState({
        state: 'CLOSED',
        failureCount: 0,
        lastFailureTime: 0,
    });
});

The core logic resides in the fetch event handler. It intercepts requests, checks the circuit breaker’s state, and decides whether to proceed with the network request, fail fast, or send a probe. The code must be defensive, handling various failure modes like network errors or non-2xx HTTP responses.

// public/sw.js (continued)

self.addEventListener('fetch', (event) => {
    // Only apply the circuit breaker to our target API
    if (event.request.url.includes(CONFIG.probeRequestUrl)) {
        event.respondWith(handleApiFetch(event.request));
    }
});

async function handleApiFetch(request) {
    const currentState = await getCircuitState();
    console.log(`[Circuit Breaker] Current State: ${currentState.state}, Failures: ${currentState.failureCount}`);

    // --- OPEN State Logic ---
    if (currentState.state === 'OPEN') {
        const timeSinceFailure = Date.now() - currentState.lastFailureTime;
        if (timeSinceFailure > CONFIG.openStateTimeout) {
            // Timeout has expired, move to HALF_OPEN
            console.log('[Circuit Breaker] Timeout expired. Moving to HALF_OPEN.');
            await setCircuitState({ ...currentState, state: 'HALF_OPEN' });
            // Let this request proceed as a probe
            return performFetchAndMonitor(request, { ...currentState, state: 'HALF_OPEN' });
        } else {
            // Still in the cooling-off period, fail fast.
            console.warn('[Circuit Breaker] Circuit is OPEN. Failing fast.');
            return new Response(
                JSON.stringify({ error: 'Circuit Breaker is open. Request blocked.' }), {
                    status: 503,
                    statusText: 'Service Unavailable',
                    headers: { 'Content-Type': 'application/json' },
                }
            );
        }
    }

    // --- CLOSED or HALF_OPEN State Logic ---
    // In both these states, we let the request through and monitor the result.
    return performFetchAndMonitor(request, currentState);
}

async function performFetchAndMonitor(request, currentState) {
    try {
        const response = await fetch(request);

        // A common pitfall is only checking for network errors.
        // We must also treat non-successful HTTP status codes as failures.
        if (!response.ok) {
            // Throw an error to be caught by the catch block
            throw new Error(`Server responded with status: ${response.status}`);
        }

        // --- Success Handling ---
        console.log('[Circuit Breaker] Request successful.');
        if (currentState.state === 'HALF_OPEN') {
            console.log('[Circuit Breaker] Probe successful. Moving to CLOSED.');
            await setCircuitState({
                state: 'CLOSED',
                failureCount: 0,
                lastFailureTime: 0,
            });
        } else if (currentState.failureCount > 0) {
            // If we were in CLOSED but had some failures, a success resets the count.
            await setCircuitState({ ...currentState, failureCount: 0 });
        }
        return response;

    } catch (error) {
        // --- Failure Handling ---
        console.error('[Circuit Breaker] Request failed:', error.message);
        await handleFailure(currentState);

        // Re-throw the error so the original fetch promise on the page rejects.
        // A real implementation might return a custom error response instead.
        throw error;
    }
}

async function handleFailure(currentState) {
    const newFailureCount = currentState.failureCount + 1;

    if (currentState.state === 'HALF_OPEN') {
        // The probe failed. Re-open the circuit and reset the timer.
        console.error('[Circuit Breaker] Probe failed. Moving back to OPEN.');
        await setCircuitState({
            ...currentState,
            state: 'OPEN',
            failureCount: 0, // Reset count as we are now fully open
            lastFailureTime: Date.now(),
        });
    } else if (newFailureCount >= CONFIG.failureThreshold) {
        // Failure threshold reached in CLOSED state. Trip the circuit.
        console.error(`[Circuit Breaker] Failure threshold reached (${newFailureCount}). Tripping circuit to OPEN.`);
        await setCircuitState({
            ...currentState,
            state: 'OPEN',
            failureCount: 0,
            lastFailureTime: Date.now(),
        });
    } else {
        // Still in CLOSED state, just increment the failure count.
        console.log(`[Circuit Breaker] Incrementing failure count to ${newFailureCount}.`);
        await setCircuitState({ ...currentState, failureCount: newFailureCount });
    }
}

The use of Prettier is not a trivial point; it enforces a consistent style on this complex, asynchronous state-management code, which is critical for long-term maintainability by the team.

Observing the System in Action

With the slow Ktor backend running and the Service Worker active in the browser, the complete behavior can be observed.

  1. Initial Requests (Circuit CLOSED):

    • The user clicks the “Fetch Data” button.
    • Browser Console: “Attempting to fetch metrics…”.
    • Service Worker Log: “[Circuit Breaker] Current State: CLOSED, Failures: 0”.
    • Ktor Server Log: INFO Application - 200 OK: GET - /api/metrics.
    • The request takes 5 seconds. The fetch eventually fails on the client due to a browser timeout (or a timeout configured in the fetch call).
    • Service Worker Log: “[Circuit Breaker] Request failed: The user aborted a request.”, “[Circuit Breaker] Incrementing failure count to 1.”
  2. Tripping the Circuit:

    • The user clicks two more times. Each request fails.
    • After the third failure:
    • Service Worker Log: “[Circuit Breaker] Request failed…”, [Circuit Breaker] Failure threshold reached (3). Tripping circuit to OPEN.
    • The state in IndexedDB is now { "state": "OPEN", ... }.
  3. Circuit OPEN:

    • The user clicks the button again.
    • Service Worker Log: “[Circuit Breaker] Current State: OPEN…”, [Circuit Breaker] Circuit is OPEN. Failing fast.
    • Browser Console: “Fetch failed: Service Unavailable”.
    • The response is immediate. Crucially, no new entry appears in the Ktor server log. The server is protected.
  4. Circuit HALF_OPEN and Recovery:

    • After 15 seconds, the user clicks again.
    • Service Worker Log: “[Circuit Breaker] Current State: OPEN…”, [Circuit Breaker] Timeout expired. Moving to HALF_OPEN.
    • A single probe request is sent to the server. Let’s assume the backend has recovered in this time (we would manually stop the artificial delay).
    • Ktor Server Log: INFO Application - 200 OK: GET - /api/metrics.
    • The request succeeds quickly.
    • Service Worker Log: “[Circuit Breaker] Request successful.”, [Circuit Breaker] Probe successful. Moving to CLOSED.
    • The system is back to normal. Subsequent requests go through directly.

This client-side pattern effectively quarantines the application from a failing downstream service, improving user experience by providing immediate feedback and protecting the backend infrastructure from being overwhelmed by retries.

The solution is not without its limitations. The configuration for the breaker (thresholds, timeouts) is hardcoded in the Service Worker file. A production-grade implementation would fetch this configuration from a central endpoint, allowing for dynamic tuning without requiring a new Service Worker deployment. Furthermore, this breaker operates on a per-client basis. While it prevents one user’s browser from DDOS-ing the server, it doesn’t solve the underlying problem for all users. It’s a client-side containment strategy that should be paired with server-side resilience patterns like rate limiting, bulkheading, and server-side circuit breakers for a truly robust architecture. Finally, the fallback strategy here is to fail fast; a more sophisticated version could serve stale data from the Cache API when the circuit is open, providing a degraded but still functional experience.


  TOC