The connection logic in our iOS application, which handles real-time data processing, was becoming a significant bottleneck. Initially, it pointed to a single, statically configured load balancer responsible for distributing traffic across a fleet of backend worker nodes. This architecture, while simple, presented two critical failures in production. First, the load balancer became a single point of failure and a performance chokepoint. Second, it offered no mechanism for the client to intelligently select a worker, such as connecting to the least-loaded node or one geographically closer. We needed to push service discovery capabilities closer to the client without burdening the iOS application with the complexity and security risks of a full service mesh client.
Our solution was to develop an intermediary microservice—an “Endpoint Provider”—that acts as a secure bridge. This service queries our Consul catalog for healthy, available worker nodes, caches the results aggressively in Memcached to handle high request volumes from thousands of clients, and presents a simple, digestible list of endpoints to the iOS application. This allows the mobile client to retain control over its connection strategy (e.g., round-robin, random selection with retries) while the backend infrastructure remains dynamic and scalable.
The architecture follows a clear request flow:
sequenceDiagram participant iOS App participant Endpoint Provider (Go) participant Memcached participant Consul participant Worker Nodes iOS App->>+Endpoint Provider (Go): GET /v1/endpoints/worker-service Endpoint Provider (Go)->>+Memcached: GET endpoints:worker-service Memcached-->>-Endpoint Provider (Go): Cache Miss Endpoint Provider (Go)->>+Consul: Query healthy instances of 'worker-service' Consul-->>-Endpoint Provider (Go): List of healthy nodes (IP:Port) Endpoint Provider (Go)->>+Memcached: SET endpoints:worker-service with TTL Memcached-->>-Endpoint Provider (Go): OK Endpoint Provider (Go)-->>-iOS App: 200 OK - [{"host":"10.0.1.10", "port":8080}, ...] Note over iOS App, Worker Nodes: Client now connects directly to a chosen Worker Node. %% Subsequent Request (Cache Hit) iOS App->>+Endpoint Provider (Go): GET /v1/endpoints/worker-service Endpoint Provider (Go)->>+Memcached: GET endpoints:worker-service Memcached-->>-Endpoint Provider (Go): Cached JSON blob Endpoint Provider (Go)-->>-iOS App: 200 OK - [{"host":"10.0.1.10", "port":8080}, ...]
Consul Service and Health Check Configuration
The foundation of this system is Consul’s ability to track the health of our worker services. For this to be effective, each worker node must register itself with a meaningful health check. A simple TCP dial check is insufficient; a real-world project requires a check that reflects the actual application’s health, perhaps by measuring queue depth or current processing load.
Here is a representative service definition for a worker node, worker-service.hcl
. We’ll register this with a local Consul agent. The critical part is the check
, which executes a script.
// File: /etc/consul.d/worker-service.hcl
service {
name = "worker-service"
id = "worker-1"
port = 8080
address = "10.0.1.10"
tags = ["realtime", "v1.2"]
check {
id = "worker-load-check"
name = "Worker Process Load Check"
// In a real system, this script would check CPU, memory, or job queue length.
// Exit code 0 = passing, 1 = warning, >1 = critical/failing.
// We simulate a failing state for demonstration.
args = ["/bin/sh", "-c", "exit 2"]
interval = "10s"
timeout = "2s"
}
}
To run a development Consul agent and register this service:
- Save the above HCL configuration.
- Start Consul:
consul agent -dev -config-dir=/etc/consul.d
With this setup, Consul’s health API will now correctly exclude this failing node from any queries for healthy instances of worker-service
. This dynamic health status is what our Endpoint Provider will consume.
The Golang Endpoint Provider Implementation
We chose Go for the Endpoint Provider due to its excellent concurrency model, performance, and robust ecosystem for building networked services. The service has three primary responsibilities: handle incoming HTTP requests, communicate with Memcached, and query the Consul API.
The project structure is straightforward:
endpoint-provider/
├── go.mod
├── go.sum
└── main.go
The core logic resides in main.go
. We’ll build it piece by piece, focusing on configuration, dependency management, and the HTTP handler itself.
// File: main.go
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"os/signal"
"strings"
"syscall"
"time"
"github.comcom/bradfitz/gomemcache/memcache"
consulapi "github.com/hashicorp/consul/api"
)
// ServiceConfig holds all external configuration for the application.
// In a real-world project, this would be populated from environment variables or a config file.
type ServiceConfig struct {
ConsulAddress string
MemcachedServers []string
ListenAddress string
CacheTTL time.Duration
}
// Endpoint represents a single, connectable backend service instance.
type Endpoint struct {
Host string `json:"host"`
Port int `json:"port"`
}
// ServiceLocator is the core application struct, holding clients for external services.
type ServiceLocator struct {
consulClient *consulapi.Client
memcacheClient *memcache.Client
config ServiceConfig
}
// NewServiceLocator initializes all clients and configurations.
func NewServiceLocator(config ServiceConfig) (*ServiceLocator, error) {
// Configure and create the Consul client
consulConfig := consulapi.DefaultConfig()
consulConfig.Address = config.ConsulAddress
consul, err := consulapi.NewClient(consulConfig)
if err != nil {
return nil, fmt.Errorf("failed to create consul client: %w", err)
}
// Configure and create the Memcached client
mc := memcache.New(config.MemcachedServers...)
// A quick ping to ensure Memcached is reachable on startup.
if err := mc.Ping(); err != nil {
return nil, fmt.Errorf("failed to ping memcached servers: %w", err)
}
return &ServiceLocator{
consulClient: consul,
memcacheClient: mc,
config: config,
}, nil
}
// getEndpointsHandler is the HTTP handler that serves the list of available endpoints.
func (s *ServiceLocator) getEndpointsHandler(w http.ResponseWriter, r *http.Request) {
serviceName := strings.TrimPrefix(r.URL.Path, "/v1/endpoints/")
if serviceName == "" {
http.Error(w, "Service name is required", http.StatusBadRequest)
return
}
log.Printf("Request received for service: %s", serviceName)
cacheKey := fmt.Sprintf("endpoints:%s", serviceName)
// 1. Attempt to fetch from cache first.
if item, err := s.memcacheClient.Get(cacheKey); err == nil {
log.Printf("Cache hit for key: %s", cacheKey)
w.Header().Set("Content-Type", "application/json")
w.Header().Set("X-Cache-Status", "HIT")
w.Write(item.Value)
return
} else if err != memcache.ErrCacheMiss {
// A common mistake is to ignore errors other than cache miss.
// This could indicate a serious connectivity issue with Memcached.
log.Printf("ERROR: Memcached GET failed for key %s: %v", cacheKey, err)
}
log.Printf("Cache miss for key: %s. Querying Consul.", cacheKey)
// 2. On cache miss, query Consul for healthy services.
// The `PassingOnly` flag is crucial here.
serviceEntries, _, err := s.consulClient.Health().Service(serviceName, "", true, nil)
if err != nil {
log.Printf("ERROR: Failed to query Consul for service %s: %v", serviceName, err)
http.Error(w, "Internal server error: could not query service registry", http.StatusInternalServerError)
return
}
if len(serviceEntries) == 0 {
log.Printf("WARN: No healthy instances found for service %s", serviceName)
http.Error(w, "No healthy service instances available", http.StatusServiceUnavailable)
return
}
// 3. Format the response. We extract only the necessary information for the client.
endpoints := make([]Endpoint, 0, len(serviceEntries))
for _, entry := range serviceEntries {
// The service address can be in Service.Address or Node.Address.
// A robust implementation checks both.
address := entry.Service.Address
if address == "" {
address = entry.Node.Address
}
endpoints = append(endpoints, Endpoint{
Host: address,
Port: entry.Service.Port,
})
}
responseBody, err := json.Marshal(endpoints)
if err != nil {
log.Printf("ERROR: Failed to marshal endpoints to JSON for service %s: %v", serviceName, err)
http.Error(w, "Internal server error: could not format response", http.StatusInternalServerError)
return
}
// 4. Store the result in Memcached before returning to the client.
err = s.memcacheClient.Set(&memcache.Item{
Key: cacheKey,
Value: responseBody,
Expiration: int32(s.config.CacheTTL.Seconds()),
})
if err != nil {
// Failing to set the cache is not a critical error. We should log it
// but still serve the response to the client. The system gracefully degrades.
log.Printf("ERROR: Failed to set cache for key %s: %v", cacheKey, err)
}
w.Header().Set("Content-Type", "application/json")
w.Header().Set("X-Cache-Status", "MISS")
w.Write(responseBody)
}
func main() {
// Production-grade configuration should come from a more robust source.
config := ServiceConfig{
ConsulAddress: "localhost:8500",
MemcachedServers: []string{"localhost:11211"},
ListenAddress: ":9090",
CacheTTL: 10 * time.Second, // A pragmatic TTL value.
}
locator, err := NewServiceLocator(config)
if err != nil {
log.Fatalf("Failed to initialize service locator: %v", err)
}
mux := http.NewServeMux()
mux.HandleFunc("/v1/endpoints/", locator.getEndpointsHandler)
mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
fmt.Fprintln(w, "OK")
})
server := &http.Server{
Addr: config.ListenAddress,
Handler: mux,
}
// Graceful shutdown handling.
go func() {
log.Printf("Endpoint Provider listening on %s", config.ListenAddress)
if err := server.ListenAndServe(); err != http.ErrServerClosed {
log.Fatalf("HTTP server ListenAndServe error: %v", err)
}
}()
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
<-quit
log.Println("Shutting down server...")
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := server.Shutdown(ctx); err != nil {
log.Fatalf("HTTP server shutdown error: %v", err)
}
log.Println("Server gracefully stopped.")
}
This Go application demonstrates several production-ready practices:
- Configuration Management: A
ServiceConfig
struct centralizes configuration. - Dependency Injection: The
ServiceLocator
struct holds client dependencies, making handlers testable. - Robust Error Handling: It differentiates between a cache miss and a true Memcached connection error. It also handles failures in querying Consul or marshalling JSON.
- Graceful Degradation: If setting the cache fails, the service still returns data to the client, prioritizing availability.
- Graceful Shutdown: It listens for termination signals to finish in-flight requests before exiting.
The iOS Client Implementation
On the iOS side, the implementation requires a service layer to fetch, parse, and manage the list of endpoints. We’ll use Swift’s async/await
for clean asynchronous code and Codable
for parsing.
// File: EndpointService.swift
import Foundation
// A Codable struct that must match the JSON structure from our Go service.
struct Endpoint: Codable, Hashable {
let host: String
let port: Int
}
// Custom error types provide more context than generic errors.
enum EndpointError: Error {
case invalidURL
case networkError(Error)
case decodingError(Error)
case serverError(statusCode: Int)
case noEndpointsAvailable
}
// The EndpointManager is the core component for the client.
// It can be used as a singleton or injected as a dependency.
@MainActor
class EndpointManager: ObservableObject {
// The list of available endpoints is published to the UI.
@Published private(set) var availableEndpoints: [Endpoint] = []
// A simple index for round-robin selection.
private var currentIndex = 0
// In a real project, this base URL would come from a configuration file.
private let providerBaseURL = "http://localhost:9090/v1/endpoints/"
private let urlSession: URLSession
init(session: URLSession = .shared) {
self.urlSession = session
}
// Fetches and updates the list of endpoints for a given service.
func refreshEndpoints(for serviceName: String) async throws {
guard let url = URL(string: "\(providerBaseURL)\(serviceName)") else {
throw EndpointError.invalidURL
}
var request = URLRequest(url: url)
request.timeoutInterval = 5.0 // A sensible timeout for a critical path.
do {
let (data, response) = try await urlSession.data(for: request)
guard let httpResponse = response as? HTTPURLResponse else {
throw EndpointError.networkError(URLError(.badServerResponse))
}
guard (200...299).contains(httpResponse.statusCode) else {
// If the provider returns a 4xx or 5xx, we handle it explicitly.
throw EndpointError.serverError(statusCode: httpResponse.statusCode)
}
let decoder = JSONDecoder()
let endpoints = try decoder.decode([Endpoint].self, from: data)
if endpoints.isEmpty {
throw EndpointError.noEndpointsAvailable
}
// Update the internal list and reset the index.
self.availableEndpoints = endpoints
self.currentIndex = 0
print("Successfully refreshed endpoints: \(endpoints)")
} catch let error as DecodingError {
throw EndpointError.decodingError(error)
} catch {
throw EndpointError.networkError(error)
}
}
// Provides the next available endpoint using a round-robin strategy.
// A pitfall here is not handling an empty list. We must guard against it.
func getNextEndpoint() -> Endpoint? {
guard !availableEndpoints.isEmpty else {
return nil
}
let endpoint = availableEndpoints[currentIndex]
currentIndex = (currentIndex + 1) % availableEndpoints.count
return endpoint
}
}
A SwiftUI view could use this manager as follows:
// File: ContentView.swift
import SwiftUI
struct ContentView: View {
@StateObject private var endpointManager = EndpointManager()
@State private var connectionTarget: String = ""
@State private var statusMessage: String = "Ready"
var body: some View {
VStack(spacing: 20) {
Text("Endpoint Discovery Client")
.font(.largeTitle)
Button("Refresh Worker Endpoints") {
Task {
await refresh(service: "worker-service")
}
}
.buttonStyle(.borderedProminent)
Button("Get Next Worker") {
if let endpoint = endpointManager.getNextEndpoint() {
self.connectionTarget = "Connecting to \(endpoint.host):\(endpoint.port)"
} else {
self.connectionTarget = "No endpoints available. Please refresh."
}
}
.buttonStyle(.bordered)
Text(connectionTarget)
.padding()
Text("Status: \(statusMessage)")
.font(.footnote)
.foregroundColor(.gray)
}
.padding()
.task {
// Initial refresh on view appearance
await refresh(service: "worker-service")
}
}
private func refresh(service: String) async {
do {
statusMessage = "Refreshing..."
try await endpointManager.refreshEndpoints(for: service)
statusMessage = "Endpoints updated successfully."
} catch let error as EndpointError {
statusMessage = "Error: \(error)"
} catch {
statusMessage = "An unexpected error occurred: \(error.localizedDescription)"
}
}
}
This client-side implementation correctly encapsulates the logic for fetching and cycling through endpoints. The calling code doesn’t need to know about JSON, HTTP, or caching; it just asks for the next available connection target.
Limitations and Future Iterations
This architecture, while robust, is not without its limitations. The Endpoint Provider service itself, while stateless, represents a potential single point of failure. In a true production environment, multiple instances of the Go service would be deployed behind a load balancer. This might seem circular, but the key difference is that this load balancer handles traffic for a simple, high-performance, stateless service, whereas the original problem involved load balancing for stateful or computationally heavy workers.
The client-side selection strategy is a basic round-robin. A more advanced implementation could involve the Endpoint Provider annotating the list with metadata from Consul (e.g., tags indicating geographic region or a custom load metric from a KV store). The iOS client could then use this metadata to make a more intelligent decision, such as preferring endpoints with the lowest latency or load.
Finally, the cache invalidation is purely TTL-based. While sufficient for many use cases, a 10-second TTL means a failed node might still be served to clients for up to 10 seconds. For systems requiring near-instant failover, a more complex solution involving Consul watches and a messaging bus to proactively invalidate the Memcached key could be engineered, though this adds significant operational complexity. The current design strikes a pragmatic balance between simplicity, performance, and freshness.