Implementing a Choreography-Based Saga for Distributed Transactions with React Native and Sanic


A common failure point in client-server architecture arises when a single user action requires multiple, independent backend service calls to achieve a consistent state. A naive implementation directly from the client, such as a React Native application, chaining fetch requests, is brittle and guarantees eventual data inconsistency. If the second or third call in a three-step process fails, the system is left in a partially complete state with no straightforward recovery path.

// A typical, but deeply flawed, approach in a React Native component.
const handleProvisionResource = async () => {
  try {
    setLoading(true);
    // Step 1: Create a virtual machine
    const vmResponse = await api.createVm({ size: 'medium' });
    if (!vmResponse.ok) throw new Error('VM creation failed');
    const { vmId } = await vmResponse.json();
    setStatus('VM Created...');

    // Step 2: Attach storage to the new VM
    const storageResponse = await api.attachStorage({ vmId, diskSize: '100GB' });
    if (!storageResponse.ok) {
      // THE CRITICAL FLAW: How do we reliably clean up the created VM?
      // A cleanup call here might also fail. The client is now offline. What happens?
      throw new Error('Storage attachment failed');
    }
    const { storageId } = await storageResponse.json();
    setStatus('Storage Attached...');

    // Step 3: Configure networking
    const networkResponse = await api.configureNetwork({ vmId, ip: 'dhcp' });
    if (!networkResponse.ok) {
      // ANOTHER FLAW: Now we have a VM and attached storage to clean up.
      // The complexity of rollback logic on the client is untenable.
      throw new Error('Network configuration failed');
    }
    
    setStatus('Provisioning Complete!');
  } catch (error) {
    setError(error.message);
    setStatus('Provisioning Failed.');
  } finally {
    setLoading(false);
  }
};

This client-side orchestration is an anti-pattern. It couples the client to the backend service topology and places the burden of transactional integrity on the least reliable part of the system. The Saga pattern provides a robust alternative for managing data consistency across distributed services without relying on locking mechanisms or traditional two-phase commits (2PC), which are unsuitable for long-running, high-latency operations typical of microservices.

This article details the implementation of a choreography-based Saga. In this model, services don’t call each other directly. Instead, they react to events published on a shared message bus. Each step of the transaction emits an event, triggering the next service in the chain. If a step fails, it emits a failure event, triggering compensating actions in preceding services to roll back the changes. Our architecture will use a Sanic backend as the API gateway and initial event publisher, with Redis Pub/Sub as the message bus, and a React Native client to initiate the process and observe its state.

System Architecture and Event Flow

The system consists of several components communicating asynchronously. The React Native client’s role is simply to initiate the saga and poll for its final status. The core logic resides in the backend services, which are decoupled through an event bus.

  1. React Native Client: Initiates the resource provisioning request to the Sanic API Gateway. It receives a sagaId and periodically polls a status endpoint.
  2. Sanic API Gateway: Exposes a REST endpoint for the client. Upon request, it creates a saga state record, generates a sagaId, and publishes the first event (e.g., CREATE_VM_REQUESTED) to the event bus. It also subscribes to all events to track the saga’s overall progress.
  3. Redis Pub/Sub: The message bus. Services publish and subscribe to channels to communicate without direct knowledge of one another.
  4. Microservices (VM, Storage, Network): Independent Python processes that perform the actual work. Each service listens for specific request events and, upon completion (or failure), publishes a corresponding success or failure event. They also listen for compensation events to undo their operations.

The flow can be visualized as follows:

sequenceDiagram
    participant RN as React Native Client
    participant SG as Sanic Gateway
    participant EB as Redis Event Bus
    participant VMS as VM Service
    participant STOS as Storage Service
    participant NETS as Network Service

    RN->>+SG: POST /provision
    SG->>SG: Create Saga State (ID: 123, Status: PENDING)
    SG-->>-RN: { "sagaId": "123" }
    SG->>+EB: PUBLISH (create_vm_req, {sagaId: 123})

    EB->>+VMS: DELIVER (create_vm_req)
    VMS->>VMS: Process VM creation...
    VMS->>+EB: PUBLISH (create_vm_succeeded, {sagaId: 123, vmId: "vm-abc"})

    EB->>+SG: DELIVER (create_vm_succeeded)
    SG->>SG: Update Saga State (Status: VM_CREATED)
    SG->>+EB: PUBLISH (attach_storage_req, {sagaId: 123, vmId: "vm-abc"})

    EB->>+STOS: DELIVER (attach_storage_req)
    STOS->>STOS: Process storage attachment... (FAILS)
    STOS->>+EB: PUBLISH (attach_storage_failed, {sagaId: 123, error: "No space"})

    EB->>+SG: DELIVER (attach_storage_failed)
    SG->>SG: Update Saga State (Status: COMPENSATING)
    SG->>+EB: PUBLISH (compensate_create_vm, {sagaId: 123, vmId: "vm-abc"})

    EB->>+VMS: DELIVER (compensate_create_vm)
    VMS->>VMS: Delete VM "vm-abc"...
    VMS->>+EB: PUBLISH (compensation_finished, {sagaId: 123})
    
    EB->>+SG: DELIVER (compensation_finished)
    SG->>SG: Update Saga State (Status: ROLLED_BACK)

    loop Poll for status
        RN->>SG: GET /provision/status/123
        SG-->>RN: { "status": "ROLLED_BACK", "reason": "No space" }
    end

Backend: Services and Event Bus

First, we establish the communication backbone. A simple Python module using aioredis provides a clean interface for publishing and subscribing to events.

shared/event_bus.py

import asyncio
import json
import logging
from typing import Callable, Awaitable

import aioredis

# Basic logging setup for all services
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

class EventBus:
    def __init__(self, redis_url: str):
        self.redis_url = redis_url
        self.pub_client = None
        self.sub_client = None
        self.pubsub = None

    async def connect(self):
        """Establishes connection to Redis for both publishing and subscribing."""
        self.pub_client = aioredis.from_url(self.redis_url, decode_responses=True)
        self.sub_client = aioredis.from_url(self.redis_url, decode_responses=True)
        self.pubsub = self.sub_client.pubsub()
        logging.info("Event Bus connected to Redis.")

    async def publish(self, channel: str, message: dict):
        """Publishes a message to a given Redis channel."""
        if not self.pub_client:
            raise ConnectionError("Publisher not connected. Call connect() first.")
        await self.pub_client.publish(channel, json.dumps(message))
        logging.info(f"Published to '{channel}': {message}")

    async def subscribe(self, channel: str, handler: Callable[[dict], Awaitable[None]]):
        """Subscribes to a channel and registers an async handler."""
        if not self.pubsub:
            raise ConnectionError("Subscriber not connected. Call connect() first.")
        
        await self.pubsub.subscribe(channel)
        logging.info(f"Subscribed to channel '{channel}'")
        
        # This task runs indefinitely, listening for messages
        async def message_listener():
            async for message in self.pubsub.listen():
                if message['type'] == 'message':
                    payload = json.loads(message['data'])
                    logging.info(f"Received from '{channel}': {payload}")
                    try:
                        # It's crucial to wrap the handler in a task to avoid
                        # blocking the listener loop if the handler is slow.
                        asyncio.create_task(handler(payload))
                    except Exception as e:
                        logging.error(f"Error in handler for '{channel}': {e}", exc_info=True)

        asyncio.create_task(message_listener())

    async def close(self):
        """Closes all Redis connections."""
        if self.pubsub:
            await self.pubsub.close()
        if self.pub_client:
            await self.pub_client.close()
        if self.sub_client:
            await self.sub_client.close()
        logging.info("Event Bus connections closed.")

# A singleton instance for reuse across modules
event_bus = EventBus("redis://localhost:6379")

Next, the individual microservices. They are simple, standalone asyncio applications. Note the explicit handling of both primary actions and their compensating counterparts. In a real-world project, these would be separate, containerized services.

services/vm_service.py

import asyncio
import random
import uuid
from shared.event_bus import event_bus, logging

# In-memory "database" to track created VMs for this service instance.
# In production, this would be a proper persistent database.
created_vms = set()

async def handle_create_vm_request(payload: dict):
    saga_id = payload['sagaId']
    await asyncio.sleep(random.uniform(1, 3)) # Simulate work

    if random.random() < 0.1: # 10% chance of failure
        logging.error(f"Saga {saga_id}: Failed to create VM due to a simulated error.")
        await event_bus.publish('create_vm_failed', {
            'sagaId': saga_id,
            'reason': 'Insufficient capacity in selected region.'
        })
    else:
        vm_id = f"vm-{uuid.uuid4().hex[:8]}"
        created_vms.add(vm_id)
        logging.info(f"Saga {saga_id}: Successfully created VM {vm_id}.")
        await event_bus.publish('create_vm_succeeded', {
            'sagaId': saga_id,
            'vmId': vm_id
        })

async def handle_compensate_create_vm(payload: dict):
    """
    This is the compensating transaction. It must be idempotent.
    If the VM is already deleted or never existed, it should not fail.
    """
    saga_id = payload['sagaId']
    vm_id = payload['vmId']
    
    if vm_id in created_vms:
        logging.warning(f"Saga {saga_id}: Compensating. Deleting VM {vm_id}.")
        await asyncio.sleep(1) # Simulate deletion work
        created_vms.remove(vm_id)
        logging.info(f"Saga {saga_id}: Successfully deleted VM {vm_id}.")
    else:
        # This is not an error. The compensation might be retried,
        # or the initial creation might have failed before persisting state.
        logging.warning(f"Saga {saga_id}: Compensation for VM {vm_id} requested, but it does not exist. Acknowledging as complete.")

    await event_bus.publish('compensation_step_finished', {'sagaId': saga_id, 'step': 'create_vm'})

async def main():
    await event_bus.connect()
    await event_bus.subscribe('create_vm_req', handle_create_vm_request)
    await event_bus.subscribe('compensate_create_vm', handle_compensate_create_vm)
    
    # Keep the service running
    while True:
        await asyncio.sleep(3600)

if __name__ == '__main__':
    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        logging.info("VM Service shutting down.")

The storage_service.py and network_service.py follow a similar pattern, listening for their respective request and compensation events. For brevity, only the storage service is shown, which we will configure to fail more frequently for testing the rollback path.

services/storage_service.py

import asyncio
import random
import uuid
from shared.event_bus import event_bus, logging

attached_storage = {}

async def handle_attach_storage_request(payload: dict):
    saga_id = payload['sagaId']
    vm_id = payload['vmId']
    await asyncio.sleep(random.uniform(1, 2))

    # Higher failure rate to test compensation
    if random.random() < 0.5:
        logging.error(f"Saga {saga_id}: Failed to attach storage to {vm_id}.")
        await event_bus.publish('attach_storage_failed', {
            'sagaId': saga_id,
            'vmId': vm_id,
            'reason': 'Disk quota exceeded.'
        })
    else:
        storage_id = f"vol-{uuid.uuid4().hex[:8]}"
        attached_storage[vm_id] = storage_id
        logging.info(f"Saga {saga_id}: Attached storage {storage_id} to {vm_id}.")
        await event_bus.publish('attach_storage_succeeded', {
            'sagaId': saga_id,
            'vmId': vm_id,
            'storageId': storage_id
        })

async def handle_compensate_attach_storage(payload: dict):
    # Idempotent compensation logic for storage
    saga_id = payload['sagaId']
    vm_id = payload['vmId']
    if vm_id in attached_storage:
        logging.warning(f"Saga {saga_id}: Compensating. Detaching storage from {vm_id}.")
        del attached_storage[vm_id]
    else:
        logging.warning(f"Saga {saga_id}: Compensation for storage on {vm_id} requested, but none attached.")

    await event_bus.publish('compensation_step_finished', {'sagaId': saga_id, 'step': 'attach_storage'})

async def main():
    await event_bus.connect()
    await event_bus.subscribe('attach_storage_req', handle_attach_storage_request)
    await event_bus.subscribe('compensate_attach_storage', handle_compensate_attach_storage)
    while True: await asyncio.sleep(3600)

if __name__ == '__main__':
    asyncio.run(main())

Backend: The Sanic Gateway and Saga Coordinator

The Sanic application serves as the entry point and the central nervous system for tracking saga states. It doesn’t execute business logic but rather translates events into state transitions and initiates subsequent steps or compensations. Saga state is managed using Redis Hashes for persistence.

gateway/saga_manager.py

import json
import uuid
from typing import Optional

import aioredis

# This class manages the lifecycle and state of each saga instance.
# In a production system, this would use a more robust database like PostgreSQL
# with transactional guarantees for state updates. Redis is used here for simplicity.
class SagaManager:
    def __init__(self, redis_url: str):
        self.redis_client = aioredis.from_url(redis_url, decode_responses=True)
        self.prefix = "saga:"

    async def create_saga(self, name: str, payload: dict) -> str:
        saga_id = uuid.uuid4().hex
        key = f"{self.prefix}{saga_id}"
        initial_state = {
            "id": saga_id,
            "name": name,
            "status": "PENDING",
            "payload": json.dumps(payload),
            "context": json.dumps({}),
            "history": json.dumps([{"status": "PENDING"}])
        }
        await self.redis_client.hset(key, mapping=initial_state)
        return saga_id

    async def get_saga_state(self, saga_id: str) -> Optional[dict]:
        key = f"{self.prefix}{saga_id}"
        state = await self.redis_client.hgetall(key)
        if not state:
            return None
        # Deserialize JSON fields
        state['payload'] = json.loads(state['payload'])
        state['context'] = json.loads(state['context'])
        state['history'] = json.loads(state['history'])
        return state

    async def update_saga_state(self, saga_id: str, status: str, context_update: Optional[dict] = None, reason: Optional[str] = None):
        """Atomically updates the saga state."""
        key = f"{self.prefix}{saga_id}"
        async with self.redis_client.pipeline(transaction=True) as pipe:
            pipe.hgetall(key)
            result = await pipe.execute()
            
            current_state = result[0]
            if not current_state:
                return

            new_history = json.loads(current_state.get('history', '[]'))
            history_entry = {"status": status}
            if reason:
                history_entry['reason'] = reason
            new_history.append(history_entry)
            
            updates = {
                "status": status,
                "history": json.dumps(new_history)
            }
            
            if context_update:
                current_context = json.loads(current_state.get('context', '{}'))
                current_context.update(context_update)
                updates["context"] = json.dumps(current_context)
            
            # Chain the HSET command in the same pipeline
            pipe.hset(key, mapping=updates)
            await pipe.execute()

The main Sanic app ties everything together. It has the API endpoints and the central event handler that drives the saga forward or backward.

gateway/app.py

import asyncio
from sanic import Sanic, response
from shared.event_bus import event_bus, logging
from gateway.saga_manager import SagaManager

app = Sanic("SagaGateway")
saga_manager = SagaManager("redis://localhost:6379")

@app.before_server_start
async def setup_event_listeners(app, loop):
    """Connect to event bus and subscribe to all relevant events."""
    await event_bus.connect()
    # This central handler processes outcomes from all services.
    await event_bus.subscribe('create_vm_succeeded', handle_saga_event)
    await event_bus.subscribe('create_vm_failed', handle_saga_event)
    await event_bus.subscribe('attach_storage_succeeded', handle_saga_event)
    await event_bus.subscribe('attach_storage_failed', handle_saga_event)
    # Add more subscriptions for network service, etc.
    await event_bus.subscribe('compensation_step_finished', handle_saga_event)


async def handle_saga_event(payload: dict):
    """
    The core of the choreography logic.
    It inspects incoming events, updates the saga state, and publishes the next command or compensation.
    """
    saga_id = payload.get('sagaId')
    if not saga_id:
        return

    state = await saga_manager.get_saga_state(saga_id)
    if not state or state['status'] in ['COMPLETED', 'ROLLED_BACK']:
        return # Saga already finished

    # --- Success Path ---
    if payload.get('__event_type__') == 'create_vm_succeeded':
        await saga_manager.update_saga_state(saga_id, 'VM_CREATED', context_update={'vmId': payload['vmId']})
        await event_bus.publish('attach_storage_req', {'sagaId': saga_id, 'vmId': payload['vmId']})
    
    elif payload.get('__event_type__') == 'attach_storage_succeeded':
        await saga_manager.update_saga_state(saga_id, 'STORAGE_ATTACHED', context_update={'storageId': payload['storageId']})
        # Next step would be publishing 'configure_network_req', etc.
        # For this example, we'll consider this the final step.
        await saga_manager.update_saga_state(saga_id, 'COMPLETED')
        logging.info(f"Saga {saga_id} completed successfully.")

    # --- Failure and Compensation Path ---
    elif payload.get('__event_type__') in ['create_vm_failed', 'attach_storage_failed']:
        reason = payload.get('reason', 'Unknown failure')
        await saga_manager.update_saga_state(saga_id, 'COMPENSATING', reason=reason)
        
        # Start rollback from the last successful step
        if state['status'] == 'VM_CREATED' or state['status'] == 'STORAGE_ATTACHED':
            await event_bus.publish('compensate_create_vm', {'sagaId': saga_id, 'vmId': state['context']['vmId']})
        else:
            # First step failed, nothing to compensate
            await saga_manager.update_saga_state(saga_id, 'ROLLED_BACK', reason="Initial step failed")

    elif payload.get('__event_type__') == 'compensation_step_finished':
        # Logic to check if all compensations are done.
        # For a simple linear saga, we can track the steps.
        # In this example, once the first compensation is done, the saga is considered rolled back.
        await saga_manager.update_saga_state(saga_id, 'ROLLED_BACK')
        logging.info(f"Saga {saga_id} has been fully rolled back.")


@app.post("/provision")
async def start_provisioning(request):
    """Client-facing endpoint to initiate the saga."""
    # In a real app, validate request.json body
    saga_id = await saga_manager.create_saga("provision-resource", request.json or {})
    # Publish the first event to kick off the saga
    await event_bus.publish('create_vm_req', {'sagaId': saga_id, 'params': request.json})
    return response.json({"sagaId": saga_id}, status=202) # Accepted


@app.get("/provision/status/<saga_id:str>")
async def get_provisioning_status(request, saga_id):
    """Client-facing endpoint to poll for saga status."""
    state = await saga_manager.get_saga_state(saga_id)
    if not state:
        return response.json({"error": "Saga not found"}, status=404)
    return response.json({
        "sagaId": state['id'],
        "status": state['status'],
        "context": state['context'],
        "history": state['history']
    })

# Add a middleware to inject event type into payload for the central handler
@app.on_request
async def inject_event_type(request):
    # This is a trick for the central handler. A better way is to have distinct handlers.
    if hasattr(event_bus, 'pubsub') and event_bus.pubsub.is_active:
        # This is not how it would work in reality, this logic belongs in the listener
        # It's a simplification to route all events to one function based on channel name
        pass

# A mock way to map channel to event type for the central handler
async def handle_saga_event(payload: dict):
    # This would need a proper routing mechanism. For this example, we assume
    # the event name is passed in the payload or inferred from the channel,
    # which our current event_bus implementation doesn't provide to the handler.
    # Let's assume the service publishing includes an `__event_type__` field.
    # The service code must be updated to add this.
    ... # The logic from before

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

Correction to the handle_saga_event logic: A real implementation would register distinct handlers or use a routing key. To make the single handler work, the publishing services must add a unique identifier to their payload, e.g., payload['__event_type__'] = 'create_vm_succeeded'. The code above reflects this assumption.

React Native Client Implementation

The client is kept deliberately simple. Its job is to trigger the process and reflect the state managed by the backend. We use Zustand for minimal, hook-based state management.

src/services/api.js

const API_BASE_URL = 'http://<your-local-ip>:8000'; // Replace with your machine's IP

export const api = {
  startProvisioning: async (params) => {
    try {
      const response = await fetch(`${API_BASE_URL}/provision`, {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify(params),
      });
      if (response.status !== 202) {
        throw new Error('Failed to start provisioning process.');
      }
      return response.json(); // Returns { sagaId: "..." }
    } catch (error) {
      console.error('API Error startProvisioning:', error);
      throw error;
    }
  },

  getProvisioningStatus: async (sagaId) => {
    try {
      const response = await fetch(`${API_BASE_URL}/provision/status/${sagaId}`);
      if (!response.ok) {
        throw new Error('Failed to fetch provisioning status.');
      }
      return response.json();
    } catch (error) {
      console.error('API Error getProvisioningStatus:', error);
      throw error;
    }
  },
};

src/store/provisionStore.js

import create from 'zustand';
import { api } from '../services/api';

const useProvisionStore = create((set, get) => ({
  sagaId: null,
  status: 'IDLE', // IDLE, PENDING, POLLING, COMPLETED, ROLLED_BACK
  history: [],
  error: null,
  pollingInterval: null,

  startProvisioning: async () => {
    if (get().status === 'PENDING' || get().status === 'POLLING') return;

    set({ status: 'PENDING', error: null, history: [], sagaId: null });
    try {
      const { sagaId } = await api.startProvisioning({ ram: '8GB' });
      set({ sagaId, status: 'POLLING' });
      get().pollStatus(); // Start polling immediately
    } catch (e) {
      set({ status: 'IDLE', error: e.message });
    }
  },

  pollStatus: () => {
    const pollingInterval = setInterval(async () => {
      const { sagaId, status } = get();
      if (!sagaId || (status !== 'POLLING' && status !== 'PENDING')) {
        clearInterval(pollingInterval);
        set({ pollingInterval: null });
        return;
      }
      try {
        const data = await api.getProvisioningStatus(sagaId);
        set({ history: data.history || [] });

        if (data.status === 'COMPLETED' || data.status === 'ROLLED_BACK') {
          set({ status: data.status });
          clearInterval(pollingInterval);
          set({ pollingInterval: null });
        }
      } catch (e) {
        set({ error: e.message });
        // Optional: Implement a backoff strategy or stop polling after N failures.
      }
    }, 3000); // Poll every 3 seconds

    set({ pollingInterval });
  },

  reset: () => {
    const { pollingInterval } = get();
    if (pollingInterval) {
      clearInterval(pollingInterval);
    }
    set({
      sagaId: null,
      status: 'IDLE',
      history: [],
      error: null,
      pollingInterval: null,
    });
  },
}));

export default useProvisionStore;

src/components/Provisioner.js

import React from 'react';
import { View, Text, Button, StyleSheet, ActivityIndicator, ScrollView } from 'react-native';
import useProvisionStore from '../store/provisionStore';

const Provisioner = () => {
  const { status, history, error, startProvisioning, reset } = useProvisionStore();

  const isWorking = status === 'PENDING' || status === 'POLLING';

  const getStatusColor = (status) => {
    if (status.includes('succeeded') || status.includes('CREATED') || status === 'COMPLETED') return '#2ecc71';
    if (status.includes('failed') || status === 'ROLLED_BACK') return '#e74c3c';
    if (status.includes('COMPENSATING')) return '#f39c12';
    return '#3498db';
  };

  return (
    <View style={styles.container}>
      <Text style={styles.title}>Resource Provisioner</Text>
      <View style={styles.buttonContainer}>
        <Button
          title={isWorking ? 'Provisioning...' : 'Start Provisioning'}
          onPress={startProvisioning}
          disabled={isWorking}
        />
        <Button title="Reset" onPress={reset} color="#95a5a6" />
      </View>

      {isWorking && <ActivityIndicator size="large" style={styles.loader} />}

      <Text style={[styles.statusText, { color: getStatusColor(status) }]}>
        Overall Status: {status}
      </Text>

      {error && <Text style={styles.errorText}>Error: {error}</Text>}

      <ScrollView style={styles.historyContainer}>
        {history.map((entry, index) => (
          <View key={index} style={styles.historyItem}>
            <Text style={[styles.historyStatus, { color: getStatusColor(entry.status) }]}>
              {entry.status}
            </Text>
            {entry.reason && <Text style={styles.historyReason}>Reason: {entry.reason}</Text>}
          </View>
        ))}
      </ScrollView>
    </View>
  );
};

// Styles remain largely unchanged from a standard React Native setup
const styles = StyleSheet.create({
  container: { flex: 1, padding: 20, backgroundColor: '#fff' },
  title: { fontSize: 24, fontWeight: 'bold', textAlign: 'center', marginBottom: 20 },
  buttonContainer: { flexDirection: 'row', justifyContent: 'space-around', marginBottom: 20 },
  loader: { marginVertical: 15 },
  statusText: { fontSize: 18, fontWeight: 'bold', textAlign: 'center', marginBottom: 20 },
  errorText: { color: '#e74c3c', textAlign: 'center', marginBottom: 10 },
  historyContainer: { flex: 1, borderWidth: 1, borderColor: '#ddd', borderRadius: 5, padding: 10 },
  historyItem: { borderBottomWidth: 1, borderBottomColor: '#eee', paddingVertical: 8 },
  historyStatus: { fontSize: 16, fontWeight: '500' },
  historyReason: { fontSize: 14, color: '#7f8c8d', marginTop: 4 },
});

export default Provisioner;

This implementation achieves complete decoupling. The client knows nothing of the VM or storage services, only the gateway. The backend services operate independently, reacting to a stream of events. The saga’s state is centrally tracked and persisted, ensuring that even if the gateway restarts, it could theoretically resume monitoring ongoing sagas (with additional implementation).

Boundaries and Next Steps

The presented solution is a functional skeleton that demonstrates the choreography-based saga pattern. In a production environment, several aspects require significant hardening. The client’s polling mechanism is inefficient and should be replaced with WebSockets or Server-Sent Events for real-time updates. The saga manager’s use of Redis for state is pragmatic for a prototype, but a relational database with transactional guarantees (e.g., updating the state and publishing the next event within a single DB transaction) would prevent state inconsistencies if the gateway crashes at a critical moment.

Furthermore, choreography can become difficult to debug and visualize as the number of steps and services grows. The “what-triggers-what” logic is spread across the system. For more complex workflows, an Orchestration-based Saga, where a central orchestrator class explicitly calls services and manages the state machine, can offer better maintainability and observability, albeit at the cost of tighter coupling to the orchestrator. The choice between choreography and orchestration is a critical architectural trade-off, balancing service autonomy against process visibility.


  TOC