Deploying a stateful WebSocket service on a container orchestrator like Docker Swarm presents an immediate and critical architectural conflict. The orchestrator’s primary strength is treating containers as ephemeral, cattle-not-pets entities, capable of being destroyed and recreated at any moment. A WebSocket connection, however, is inherently stateful; its existence is tied to the memory and lifecycle of a specific process within a single container. Scaling such a service by simply increasing the replica count creates a cluster where each instance is an isolated island, unaware of clients connected to its peers. A container restart, a routine event in an orchestrated environment, severs all its active connections, effectively destroying user sessions. This isn’t a minor inconvenience; it’s a fundamental failure of service reliability.
The initial, naive approach is to containerize a standard Swift Vapor WebSocket server and deploy it as a scaled service.
// A non-production, fundamentally flawed starting point.
import Vapor
func routes(_ app: Application) throws {
var connections: [UUID: WebSocket] = [:]
app.webSocket("api", "v1", "channel") { req, ws in
let id = UUID()
connections[id] = ws
print("Client \(id) connected.")
ws.onText { ws, text in
// Broadcast to all other connections on THIS instance only.
for (peerId, peerWs) in connections where peerId != id {
peerWs.send(text)
}
}
ws.onClose.whenComplete { _ in
connections.removeValue(forKey: id)
print("Client \(id) disconnected.")
}
}
}
Deploying this with docker service scale myservice=3
instantly creates a broken system. A client connected to replica A cannot communicate with a client connected to replica B. Worse, when replica A is terminated by Swarm during a node drain or a deployment update, its clients are unceremoniously disconnected. The application state, the connections
dictionary, is lost forever. Common workarounds like sticky sessions are difficult to implement reliably with Swarm’s built-in ingress routing mesh and introduce single points of failure at the instance level. The only viable path forward is to completely decouple session state from the application instances themselves.
This requires an external, shared state and messaging layer. For this role, Redis is a pragmatic choice. Its high-performance in-memory data structures can track client presence, and its Publish/Subscribe mechanism provides the real-time messaging bus needed to coordinate communication across all service replicas. The architecture shifts from instance-local state to a shared-nothing application tier that relies on Redis as the source of truth.
Our technical stack becomes:
- Swift (Vapor/NIO): For the high-performance, type-safe WebSocket server logic.
- Redis: As both a presence database (using Sets/Hashes) and a cross-instance message bus (using Pub/Sub).
- Docker Swarm: For container orchestration, service discovery, and automated recovery of failed instances.
The core principle is this: no Swift instance holds authoritative state about any connection other than the raw WebSocket channel itself. All meaningful session information and message routing decisions are delegated to Redis.
Architecting the Resilient Service
The refined architecture involves each Swift container instance performing several key functions upon a new client connection:
- Instance Identification: Each container instance must have a unique identity. We can generate a UUID at startup for this.
- Client Registration: When a client connects via WebSocket, it’s assigned a unique client ID. The instance then registers this client in Redis, mapping the
clientID
to its owninstanceID
. A Redis Hash is suitable for this:HSET client_sessions client_id_1 instance_id_A
. - Subscription Management: Each instance subscribes to two Redis Pub/Sub channels:
- A global
broadcast
channel for messages intended for all clients. - An instance-specific channel, e.g.,
messages:instance_id_A
, for targeted messages.
- A global
- Message Forwarding: When instance A needs to send a message to client B, it first looks up client B’s location in the
client_sessions
hash. If client B is managed by instance C, instance A publishes the message to themessages:instance_id_C
channel. Instance C, being subscribed to this channel, receives the message and forwards it to the client through the active WebSocket connection it holds.
This flow is visualized below.
graph TD subgraph Docker Swarm Cluster subgraph Node 1 C1[Swift Instance A
instance_id_A] end subgraph Node 2 C2[Swift Instance B
instance_id_B] end subgraph Node 3 R[Redis Server] end end ClientA --> Ingress ClientB --> Ingress Ingress -- routes to --> C1 Ingress -- routes to --> C2 C1 -- ws.onConnect --> R(HSET client_sessions client_A instance_A) C2 -- ws.onConnect --> R(HSET client_sessions client_B instance_B) ClientA -- "Msg for ClientB" --> C1 C1 -- HGET client_sessions client_B --> R R -- "instance_B" --> C1 C1 -- PUBLISH messages:instance_B --> R R -- Pub/Sub --> C2 C2 -- ws.send() --> ClientB style R fill:#f9f,stroke:#333,stroke-width:2px
Now, let’s implement this robust architecture.
Core Implementation in Swift
First, we set up the Vapor project and its dependencies. The Package.swift
must include Vapor and a Redis client library like RediStack
.
Package.swift
// swift-tools-version:5.7
import PackageDescription
let package = Package(
name: "ResilientSockets",
platforms: [
.macOS(.v12)
],
dependencies: [
.package(url: "https://github.com/vapor/vapor.git", from: "4.0.0"),
.package(url: "https://github.com/vapor/redis.git", from: "4.0.0"),
],
targets: [
.executableTarget(
name: "App",
dependencies: [
.product(name: "Vapor", package: "vapor"),
.product(name: "Redis", package: "redis"),
]
),
.testTarget(name: "AppTests", dependencies: [
.target(name: "App"),
.product(name: "XCTVapor", package: "vapor"),
]),
]
)
The application needs a singleton service to manage connections and interact with Redis. This ConnectionManager
will encapsulate all the Redis logic.
Sources/App/Services/ConnectionManager.swift
import Vapor
import Redis
final class ConnectionManager {
// Unique ID for this specific running container instance.
let instanceId: UUID
private let redis: RedisClient
private let logger: Logger
// In-memory mapping of local connections for this instance only.
private var localConnections: [UUID: WebSocket]
private let lock: NIOLockedValueBox<[UUID: WebSocket]>
init(redis: RedisClient, logger: Logger) {
self.instanceId = UUID()
self.redis = redis
self.logger = logger
self.localConnections = [:]
self.lock = .init([:])
logger.info("ConnectionManager initialized for instance \(self.instanceId.uuidString)")
}
func handleConnection(ws: WebSocket) {
let clientId = UUID()
self.lock.withLockedValue { $0[clientId] = ws }
logger.info("Client \(clientId) connected to instance \(self.instanceId)")
// Register client presence in Redis
_ = self.redis.hset("client_sessions", field: clientId.uuidString, to: self.instanceId.uuidString).always { result in
switch result {
case .success:
self.logger.info("Successfully registered client \(clientId) to instance \(self.instanceId)")
case .failure(let error):
self.logger.error("Failed to register client \(clientId): \(error.localizedDescription)")
ws.close(code: .unexpectedServerError, promise: nil)
}
}
ws.onText { [weak self] ws, text in
self?.handleIncomingMessage(from: clientId, text: text)
}
ws.onClose.whenComplete { [weak self] _ in
self?.handleDisconnection(clientId: clientId)
}
}
private func handleDisconnection(clientId: UUID) {
logger.info("Client \(clientId) disconnected from instance \(self.instanceId)")
self.lock.withLockedValue { $0.removeValue(forKey: clientId) }
// Remove client presence from Redis
_ = self.redis.hdel("client_sessions", field: clientId.uuidString).always { result in
switch result {
case .success(let count) where count > 0:
self.logger.info("Successfully deregistered client \(clientId)")
case .success:
self.logger.warning("Attempted to deregister client \(clientId) but it was not found in Redis.")
case .failure(let error):
self.logger.error("Failed to deregister client \(clientId): \(error.localizedDescription)")
}
}
}
private func handleIncomingMessage(from senderId: UUID, text: String) {
// In a real application, you'd decode this to a proper model.
// For this example, we'll assume a simple format "targetClientId:message"
let components = text.split(separator: ":", maxSplits: 1)
guard components.count == 2 else {
logger.warning("Invalid message format from \(senderId): \(text)")
return
}
guard let targetId = UUID(uuidString: String(components[0])) else {
logger.warning("Invalid target UUID from \(senderId): \(components[0])")
return
}
let message = String(components[1])
let payload = "Message from \(senderId): \(message)"
// Find where the target client is connected
self.redis.hget("client_sessions", field: targetId.uuidString, as: String.self).whenSuccess { instanceIdString in
guard let instanceIdString = instanceIdString, let targetInstanceId = UUID(uuidString: instanceIdString) else {
self.logger.warning("Target client \(targetId) not found in any instance.")
// Optionally, send a "user not found" message back to the sender
return
}
if targetInstanceId == self.instanceId {
// Target is on the same instance, send directly.
self.logger.info("Sending direct message from \(senderId) to \(targetId) on instance \(self.instanceId)")
if let targetWs = self.lock.withLockedValue({ $0[targetId] }) {
targetWs.send(payload)
}
} else {
// Target is on another instance, publish to Redis.
self.logger.info("Forwarding message from \(senderId) to \(targetId) via Redis channel messages:\(targetInstanceId)")
_ = self.redis.publish(payload, to: "messages:\(targetInstanceId)")
}
}
}
// This must be called at application startup.
func subscribeToInstanceChannel() {
let channelName = "messages:\(self.instanceId)"
_ = self.redis.subscribe(to: channelName) { [weak self] channel, message in
guard let self = self, channel == channelName else { return }
// This is a naive implementation. A robust solution needs to parse the message
// to find the final target client ID within the payload itself.
// For this example, we assume the message is the raw string to be sent.
// Let's assume the payload should be "targetClientId:actualMessage"
let components = message.string.split(separator: ":", maxSplits: 1)
guard components.count == 2, let targetId = UUID(uuidString: String(components[0])) else {
self.logger.error("Received malformed message on instance channel: \(message.string)")
return
}
let actualMessage = String(components[1])
if let targetWs = self.lock.withLockedValue({ $0[targetId] }) {
self.logger.info("Received message for client \(targetId) via Redis. Forwarding...")
targetWs.send(actualMessage)
} else {
// This is a race condition: client disconnected after message was published but before it was received here.
self.logger.warning("Received message for client \(targetId) but they are no longer connected to this instance.")
}
}
logger.info("Subscribed to instance-specific Redis channel: \(channelName)")
}
}
This manager needs to be configured as a service and started when the application boots.
Sources/App/configure.swift
import Vapor
import Redis
public func configure(_ app: Application) throws {
// Configure Redis from environment variables for production readiness.
let redisHost = Environment.get("REDIS_HOSTNAME") ?? "127.0.0.1"
app.redis.configuration = try RedisConfiguration(hostname: redisHost)
// Set up our ConnectionManager as a singleton service.
let connectionManager = ConnectionManager(redis: app.redis, logger: app.logger)
app.storage[ConnectionManagerKey.self] = connectionManager
// Start listening on the instance-specific Redis channel.
// This should happen after the application has fully initialized.
app.lifecycle.use(InstanceSubscription(manager: connectionManager))
try routes(app)
}
// A helper key for storing the service in Vapor's storage.
private struct ConnectionManagerKey: StorageKey {
typealias Value = ConnectionManager
}
// A LifecycleHandler to ensure subscription happens at the right time.
private struct InstanceSubscription: LifecycleHandler {
let manager: ConnectionManager
func didBoot(_ application: Application) throws {
manager.subscribeToInstanceChannel()
}
}
Finally, the routes file becomes a simple entry point that delegates to the manager.
Sources/App/routes.swift
import Vapor
func routes(_ app: Application) throws {
app.get { req async in
"Server is running."
}
app.webSocket("api", "v1", "channel") { req, ws in
guard let manager = req.application.storage[ConnectionManagerKey.self] else {
req.logger.critical("ConnectionManager not configured. Closing WebSocket.")
_ = ws.close(code: .policyViolation)
return
}
manager.handleConnection(ws: ws)
}
}
A common mistake here is to perform the Redis subscription inside the init
of the ConnectionManager
. This can lead to race conditions where the application isn’t fully ready to handle incoming messages. Using a LifecycleHandler
ensures the subscription is activated only after the application has successfully booted.
Containerization and Orchestration
With the application logic prepared, the next step is to containerize it for deployment. A multi-stage Dockerfile is essential for production builds to keep the final image slim and secure.
Dockerfile
# ---- Builder Stage ----
FROM swift:5.7-focal AS builder
WORKDIR /app
# Copy package manifests
COPY ./Package.* ./
# Resolve dependencies
RUN swift package resolve
# Copy source code
COPY ./Sources ./Sources
COPY ./Public ./Public
COPY ./Tests ./Tests
# Build for release
RUN swift build -c release --static-swift-stdlib
# ---- Runtime Stage ----
FROM swift:5.7-slim-focal
WORKDIR /app
# Copy the compiled binary from the builder stage
COPY /app/.build/release/App .
# Copy any required runtime assets if necessary
# COPY --from=builder /app/Public ./Public
# Expose the port the server will run on
EXPOSE 8080
# Command to run the application
ENTRYPOINT ["./App"]
CMD ["serve", "--env", "production", "--hostname", "0.0.0.0", "--port", "8080"]
The orchestration is defined in a Docker Compose file, which Docker Swarm uses as a stack definition. This file defines our Swift application service and the Redis service it depends on.
docker-compose.yml
version: '3.8'
services:
app:
image: my-resilient-sockets-app:latest # Replace with your image name
networks:
- app-net
ports:
- "8080:8080"
environment:
# This tells our Swift app where to find Redis.
# 'redis' is the service name, which Docker's DNS resolves.
- REDIS_HOSTNAME=redis
deploy:
replicas: 3
restart_policy:
condition: on-failure
update_config:
parallelism: 1
delay: 10s
order: stop-first
redis:
image: redis:6-alpine
networks:
- app-net
deploy:
placement:
constraints:
- node.role == manager # Place Redis on a manager node for stability
networks:
app-net:
driver: overlay
To deploy this stack to a Swarm cluster, the command is simple:docker stack deploy -c docker-compose.yml resilient-app
The deploy
key is critical here. It tells Swarm to run 3 replicas of our app
service. The restart_policy
ensures that if a container crashes, Swarm will automatically reschedule it. The update_config
with stop-first
ensures rolling updates happen gracefully, one container at a time.
Validating Resilience
To test this architecture, one must simulate a failure.
- Connect two separate WebSocket clients (e.g., using a simple HTML/JS page or a command-line tool).
- Use
docker service ps resilient-app_app
to see which Swarm nodes the two containers are running on. The clients will likely be connected to different instances. - Let’s say Client A is connected to instance
app.1
and Client B toapp.2
. - Send a message from Client A targeted to Client B. It should be received successfully, having traversed the Redis Pub/Sub bus.
- Now, find the container ID for
app.2
usingdocker ps
on the correct node and forcibly stop it:docker container stop <container_id_for_app.2>
. - Client B will experience a disconnection. A well-designed client should have automatic reconnection logic.
- Within seconds, Docker Swarm will detect that
app.2
has failed and will start a new container to satisfy thereplicas: 3
constraint. - Client B’s reconnection attempt will be routed by the Swarm ingress mesh to one of the healthy instances, which could be the original
app.1
or the newly startedapp.3
. - Upon reconnection, the new instance will assign Client B a new WebSocket connection but will register it in Redis under the same
clientID
(assuming the client provides its ID upon reconnection for session resumption). Because the session’s existence is tracked in Redis, not in the terminated container’s memory, the system as a whole considers the client to still be “present.” The state has survived the failure.
The current implementation uses a Redis Hash for presence, which is sufficient but has limitations. The Redis Pub/Sub mechanism is “fire-and-forget”; if a message is published to an instance’s channel while that instance is briefly down or restarting, the message is lost. For applications requiring guaranteed message delivery, this architecture would need to be evolved to use a more durable messaging system like Redis Streams or a dedicated message broker like RabbitMQ. Furthermore, the single Redis instance is itself a single point of failure. A production deployment would necessitate a high-availability Redis setup using Sentinel for failover or Redis Cluster for sharding and redundancy, which significantly increases operational complexity. The design’s boundary is its reliance on at-most-once message delivery.