Implementing a Resilient Swift WebSocket Architecture on Docker Swarm with Externalized Session State

Backend Development

Word Count: 2.6k

Read Times: 16 Min

Deploying a stateful WebSocket service on a container orchestrator like Docker Swarm presents an immediate and critical architectural conflict. The orchestrator’s primary strength is treating containers as ephemeral, cattle-not-pets entities, capable of being destroyed and recreated at any moment. A WebSocket connection, however, is inherently stateful; its existence is tied to the memory and lifecycle of a specific process within a single container. Scaling such a service by simply increasing the replica count creates a cluster where each instance is an isolated island, unaware of clients connected to its peers. A container restart, a routine event in an orchestrated environment, severs all its active connections, effectively destroying user sessions. This isn’t a minor inconvenience; it’s a fundamental failure of service reliability.

The initial, naive approach is to containerize a standard Swift Vapor WebSocket server and deploy it as a scaled service.

// A non-production, fundamentally flawed starting point.
import Vapor

func routes(_ app: Application) throws {
    var connections: [UUID: WebSocket] = [:]

    app.webSocket("api", "v1", "channel") { req, ws in
        let id = UUID()
        connections[id] = ws
        print("Client \(id) connected.")

        ws.onText { ws, text in
            // Broadcast to all other connections on THIS instance only.
            for (peerId, peerWs) in connections where peerId != id {
                peerWs.send(text)
            }
        }

        ws.onClose.whenComplete { _ in
            connections.removeValue(forKey: id)
            print("Client \(id) disconnected.")
        }
    }
}

Deploying this with docker service scale myservice=3 instantly creates a broken system. A client connected to replica A cannot communicate with a client connected to replica B. Worse, when replica A is terminated by Swarm during a node drain or a deployment update, its clients are unceremoniously disconnected. The application state, the connections dictionary, is lost forever. Common workarounds like sticky sessions are difficult to implement reliably with Swarm’s built-in ingress routing mesh and introduce single points of failure at the instance level. The only viable path forward is to completely decouple session state from the application instances themselves.

This requires an external, shared state and messaging layer. For this role, Redis is a pragmatic choice. Its high-performance in-memory data structures can track client presence, and its Publish/Subscribe mechanism provides the real-time messaging bus needed to coordinate communication across all service replicas. The architecture shifts from instance-local state to a shared-nothing application tier that relies on Redis as the source of truth.

Our technical stack becomes:

Swift (Vapor/NIO): For the high-performance, type-safe WebSocket server logic.
Redis: As both a presence database (using Sets/Hashes) and a cross-instance message bus (using Pub/Sub).
Docker Swarm: For container orchestration, service discovery, and automated recovery of failed instances.

The core principle is this: no Swift instance holds authoritative state about any connection other than the raw WebSocket channel itself. All meaningful session information and message routing decisions are delegated to Redis.

Architecting the Resilient Service

The refined architecture involves each Swift container instance performing several key functions upon a new client connection:

Instance Identification: Each container instance must have a unique identity. We can generate a UUID at startup for this.
Client Registration: When a client connects via WebSocket, it’s assigned a unique client ID. The instance then registers this client in Redis, mapping the clientID to its own instanceID. A Redis Hash is suitable for this: HSET client_sessions client_id_1 instance_id_A.
Subscription Management: Each instance subscribes to two Redis Pub/Sub channels:
- A global broadcast channel for messages intended for all clients.
- An instance-specific channel, e.g., messages:instance_id_A, for targeted messages.
Message Forwarding: When instance A needs to send a message to client B, it first looks up client B’s location in the client_sessions hash. If client B is managed by instance C, instance A publishes the message to the messages:instance_id_C channel. Instance C, being subscribed to this channel, receives the message and forwards it to the client through the active WebSocket connection it holds.

This flow is visualized below.

graph TD
    subgraph Docker Swarm Cluster
        subgraph Node 1
            C1[Swift Instance A
instance_id_A]
        end
        subgraph Node 2
            C2[Swift Instance B
instance_id_B]
        end
        subgraph Node 3
            R[Redis Server]
        end
    end

    ClientA --> Ingress
    ClientB --> Ingress
    Ingress -- routes to --> C1
    Ingress -- routes to --> C2

    C1 -- ws.onConnect --> R(HSET client_sessions client_A instance_A)
    C2 -- ws.onConnect --> R(HSET client_sessions client_B instance_B)

    ClientA -- "Msg for ClientB" --> C1
    C1 -- HGET client_sessions client_B --> R
    R -- "instance_B" --> C1
    C1 -- PUBLISH messages:instance_B --> R
    R -- Pub/Sub --> C2
    C2 -- ws.send() --> ClientB

    style R fill:#f9f,stroke:#333,stroke-width:2px

Now, let’s implement this robust architecture.

Core Implementation in Swift

First, we set up the Vapor project and its dependencies. The Package.swift must include Vapor and a Redis client library like RediStack.

Package.swift

// swift-tools-version:5.7
import PackageDescription

let package = Package(
    name: "ResilientSockets",
    platforms: [
       .macOS(.v12)
    ],
    dependencies: [
        .package(url: "https://github.com/vapor/vapor.git", from: "4.0.0"),
        .package(url: "https://github.com/vapor/redis.git", from: "4.0.0"),
    ],
    targets: [
        .executableTarget(
            name: "App",
            dependencies: [
                .product(name: "Vapor", package: "vapor"),
                .product(name: "Redis", package: "redis"),
            ]
        ),
        .testTarget(name: "AppTests", dependencies: [
            .target(name: "App"),
            .product(name: "XCTVapor", package: "vapor"),
        ]),
    ]
)

The application needs a singleton service to manage connections and interact with Redis. This ConnectionManager will encapsulate all the Redis logic.

Sources/App/Services/ConnectionManager.swift

import Vapor
import Redis

final class ConnectionManager {
    // Unique ID for this specific running container instance.
    let instanceId: UUID
    private let redis: RedisClient
    private let logger: Logger
    
    // In-memory mapping of local connections for this instance only.
    private var localConnections: [UUID: WebSocket]
    private let lock: NIOLockedValueBox<[UUID: WebSocket]>

    init(redis: RedisClient, logger: Logger) {
        self.instanceId = UUID()
        self.redis = redis
        self.logger = logger
        self.localConnections = [:]
        self.lock = .init([:])
        
        logger.info("ConnectionManager initialized for instance \(self.instanceId.uuidString)")
    }

    func handleConnection(ws: WebSocket) {
        let clientId = UUID()
        self.lock.withLockedValue { $0[clientId] = ws }
        logger.info("Client \(clientId) connected to instance \(self.instanceId)")

        // Register client presence in Redis
        _ = self.redis.hset("client_sessions", field: clientId.uuidString, to: self.instanceId.uuidString).always { result in
            switch result {
            case .success:
                self.logger.info("Successfully registered client \(clientId) to instance \(self.instanceId)")
            case .failure(let error):
                self.logger.error("Failed to register client \(clientId): \(error.localizedDescription)")
                ws.close(code: .unexpectedServerError, promise: nil)
            }
        }
        
        ws.onText { [weak self] ws, text in
            self?.handleIncomingMessage(from: clientId, text: text)
        }

        ws.onClose.whenComplete { [weak self] _ in
            self?.handleDisconnection(clientId: clientId)
        }
    }

    private func handleDisconnection(clientId: UUID) {
        logger.info("Client \(clientId) disconnected from instance \(self.instanceId)")
        self.lock.withLockedValue { $0.removeValue(forKey: clientId) }
        
        // Remove client presence from Redis
        _ = self.redis.hdel("client_sessions", field: clientId.uuidString).always { result in
            switch result {
            case .success(let count) where count > 0:
                self.logger.info("Successfully deregistered client \(clientId)")
            case .success:
                self.logger.warning("Attempted to deregister client \(clientId) but it was not found in Redis.")
            case .failure(let error):
                self.logger.error("Failed to deregister client \(clientId): \(error.localizedDescription)")
            }
        }
    }

    private func handleIncomingMessage(from senderId: UUID, text: String) {
        // In a real application, you'd decode this to a proper model.
        // For this example, we'll assume a simple format "targetClientId:message"
        let components = text.split(separator: ":", maxSplits: 1)
        guard components.count == 2 else {
            logger.warning("Invalid message format from \(senderId): \(text)")
            return
        }

        guard let targetId = UUID(uuidString: String(components[0])) else {
            logger.warning("Invalid target UUID from \(senderId): \(components[0])")
            return
        }
        
        let message = String(components[1])
        let payload = "Message from \(senderId): \(message)"

        // Find where the target client is connected
        self.redis.hget("client_sessions", field: targetId.uuidString, as: String.self).whenSuccess { instanceIdString in
            guard let instanceIdString = instanceIdString, let targetInstanceId = UUID(uuidString: instanceIdString) else {
                self.logger.warning("Target client \(targetId) not found in any instance.")
                // Optionally, send a "user not found" message back to the sender
                return
            }

            if targetInstanceId == self.instanceId {
                // Target is on the same instance, send directly.
                self.logger.info("Sending direct message from \(senderId) to \(targetId) on instance \(self.instanceId)")
                if let targetWs = self.lock.withLockedValue({ $0[targetId] }) {
                    targetWs.send(payload)
                }
            } else {
                // Target is on another instance, publish to Redis.
                self.logger.info("Forwarding message from \(senderId) to \(targetId) via Redis channel messages:\(targetInstanceId)")
                _ = self.redis.publish(payload, to: "messages:\(targetInstanceId)")
            }
        }
    }
    
    // This must be called at application startup.
    func subscribeToInstanceChannel() {
        let channelName = "messages:\(self.instanceId)"
        _ = self.redis.subscribe(to: channelName) { [weak self] channel, message in
            guard let self = self, channel == channelName else { return }
            
            // This is a naive implementation. A robust solution needs to parse the message
            // to find the final target client ID within the payload itself.
            // For this example, we assume the message is the raw string to be sent.
            // Let's assume the payload should be "targetClientId:actualMessage"
            let components = message.string.split(separator: ":", maxSplits: 1)
            guard components.count == 2, let targetId = UUID(uuidString: String(components[0])) else {
                self.logger.error("Received malformed message on instance channel: \(message.string)")
                return
            }

            let actualMessage = String(components[1])
            if let targetWs = self.lock.withLockedValue({ $0[targetId] }) {
                self.logger.info("Received message for client \(targetId) via Redis. Forwarding...")
                targetWs.send(actualMessage)
            } else {
                // This is a race condition: client disconnected after message was published but before it was received here.
                self.logger.warning("Received message for client \(targetId) but they are no longer connected to this instance.")
            }
        }
        logger.info("Subscribed to instance-specific Redis channel: \(channelName)")
    }
}

This manager needs to be configured as a service and started when the application boots.

Sources/App/configure.swift

import Vapor
import Redis

public func configure(_ app: Application) throws {
    // Configure Redis from environment variables for production readiness.
    let redisHost = Environment.get("REDIS_HOSTNAME") ?? "127.0.0.1"
    app.redis.configuration = try RedisConfiguration(hostname: redisHost)

    // Set up our ConnectionManager as a singleton service.
    let connectionManager = ConnectionManager(redis: app.redis, logger: app.logger)
    app.storage[ConnectionManagerKey.self] = connectionManager
    
    // Start listening on the instance-specific Redis channel.
    // This should happen after the application has fully initialized.
    app.lifecycle.use(InstanceSubscription(manager: connectionManager))

    try routes(app)
}

// A helper key for storing the service in Vapor's storage.
private struct ConnectionManagerKey: StorageKey {
    typealias Value = ConnectionManager
}

// A LifecycleHandler to ensure subscription happens at the right time.
private struct InstanceSubscription: LifecycleHandler {
    let manager: ConnectionManager
    func didBoot(_ application: Application) throws {
        manager.subscribeToInstanceChannel()
    }
}

Finally, the routes file becomes a simple entry point that delegates to the manager.

Sources/App/routes.swift

import Vapor

func routes(_ app: Application) throws {
    app.get { req async in
        "Server is running."
    }

    app.webSocket("api", "v1", "channel") { req, ws in
        guard let manager = req.application.storage[ConnectionManagerKey.self] else {
            req.logger.critical("ConnectionManager not configured. Closing WebSocket.")
            _ = ws.close(code: .policyViolation)
            return
        }
        manager.handleConnection(ws: ws)
    }
}

A common mistake here is to perform the Redis subscription inside the init of the ConnectionManager. This can lead to race conditions where the application isn’t fully ready to handle incoming messages. Using a LifecycleHandler ensures the subscription is activated only after the application has successfully booted.

Containerization and Orchestration

With the application logic prepared, the next step is to containerize it for deployment. A multi-stage Dockerfile is essential for production builds to keep the final image slim and secure.

Dockerfile

# ---- Builder Stage ----
FROM swift:5.7-focal AS builder
WORKDIR /app

# Copy package manifests
COPY ./Package.* ./

# Resolve dependencies
RUN swift package resolve

# Copy source code
COPY ./Sources ./Sources
COPY ./Public ./Public
COPY ./Tests ./Tests

# Build for release
RUN swift build -c release --static-swift-stdlib

# ---- Runtime Stage ----
FROM swift:5.7-slim-focal
WORKDIR /app

# Copy the compiled binary from the builder stage
COPY --from=builder /app/.build/release/App .
# Copy any required runtime assets if necessary
# COPY --from=builder /app/Public ./Public

# Expose the port the server will run on
EXPOSE 8080

# Command to run the application
ENTRYPOINT ["./App"]
CMD ["serve", "--env", "production", "--hostname", "0.0.0.0", "--port", "8080"]

The orchestration is defined in a Docker Compose file, which Docker Swarm uses as a stack definition. This file defines our Swift application service and the Redis service it depends on.

docker-compose.yml

version: '3.8'

services:
  app:
    image: my-resilient-sockets-app:latest # Replace with your image name
    networks:
      - app-net
    ports:
      - "8080:8080"
    environment:
      # This tells our Swift app where to find Redis.
      # 'redis' is the service name, which Docker's DNS resolves.
      - REDIS_HOSTNAME=redis
    deploy:
      replicas: 3
      restart_policy:
        condition: on-failure
      update_config:
        parallelism: 1
        delay: 10s
        order: stop-first

  redis:
    image: redis:6-alpine
    networks:
      - app-net
    deploy:
      placement:
        constraints:
          - node.role == manager # Place Redis on a manager node for stability

networks:
  app-net:
    driver: overlay

To deploy this stack to a Swarm cluster, the command is simple:
docker stack deploy -c docker-compose.yml resilient-app

The deploy key is critical here. It tells Swarm to run 3 replicas of our app service. The restart_policy ensures that if a container crashes, Swarm will automatically reschedule it. The update_config with stop-first ensures rolling updates happen gracefully, one container at a time.

Validating Resilience

To test this architecture, one must simulate a failure.

Connect two separate WebSocket clients (e.g., using a simple HTML/JS page or a command-line tool).
Use docker service ps resilient-app_app to see which Swarm nodes the two containers are running on. The clients will likely be connected to different instances.
Let’s say Client A is connected to instance app.1 and Client B to app.2.
Send a message from Client A targeted to Client B. It should be received successfully, having traversed the Redis Pub/Sub bus.
Now, find the container ID for app.2 using docker ps on the correct node and forcibly stop it: docker container stop <container_id_for_app.2>.
Client B will experience a disconnection. A well-designed client should have automatic reconnection logic.
Within seconds, Docker Swarm will detect that app.2 has failed and will start a new container to satisfy the replicas: 3 constraint.
Client B’s reconnection attempt will be routed by the Swarm ingress mesh to one of the healthy instances, which could be the original app.1 or the newly started app.3.
Upon reconnection, the new instance will assign Client B a new WebSocket connection but will register it in Redis under the same clientID (assuming the client provides its ID upon reconnection for session resumption). Because the session’s existence is tracked in Redis, not in the terminated container’s memory, the system as a whole considers the client to still be “present.” The state has survived the failure.

The current implementation uses a Redis Hash for presence, which is sufficient but has limitations. The Redis Pub/Sub mechanism is “fire-and-forget”; if a message is published to an instance’s channel while that instance is briefly down or restarting, the message is lost. For applications requiring guaranteed message delivery, this architecture would need to be evolved to use a more durable messaging system like Redis Streams or a dedicated message broker like RabbitMQ. Furthermore, the single Redis instance is itself a single point of failure. A production deployment would necessitate a high-availability Redis setup using Sentinel for failover or Redis Cluster for sharding and redundancy, which significantly increases operational complexity. The design’s boundary is its reliance on at-most-once message delivery.

Redis Swift Docker Swarm WebSockets High Availability

Integrating Dynamic Secret Injection from HashiCorp Vault into a Laravel API, SwiftUI Client, and Playwright Test Suite

2023-10-27 Security Engineering

Laravel HashiCorp Vault Playwright Zero Trust DevSecOps SwiftUI

Constructing a Real-Time JWT Anomaly Detection Pipeline Using eBPF and an Astro Recoil Frontend

2023-10-27 Observability

Security Go JWT Astro eBPF Web API Recoil