Our move to a Vercel-hosted Next.js application was meant to simplify deployments and improve frontend performance. It succeeded on those fronts, but it created a stark architectural boundary. A critical, revenue-generating legacy monolith, responsible for distributed batch processing, still lives inside our AWS VPC. Its control plane is orchestrated by a battle-hardened Zookeeper ensemble. New serverless functions, triggered by user actions on the Vercel side, needed to acquire locks and update configuration on this system. The immediate problem: a Vercel Function runs in a public, multi-tenant environment, while our Zookeeper ensemble is locked down within a private subnet, completely inaccessible from the outside world.
The first suggestion, opening Zookeeper’s 2181 port to Vercel’s published IP ranges, was a non-starter for security reasons. The second, a chain of Vercel Function -> API Gateway -> VPC Lambda -> Zookeeper
, introduced unacceptable latency and operational complexity for what should be a simple coordination task. We needed a solution that felt more like a direct connection, without compromising the security of the VPC. This led us down the path of bridging Vercel’s environment directly with our AWS VPC and building a specialized proxy to reconcile the ephemeral, stateless nature of serverless with Zookeeper’s persistent, session-oriented protocol.
The core of our pain was the impedance mismatch. Zookeeper clients are designed to maintain long-lived TCP sessions, managing heartbeats and session state. A Vercel Function, particularly on a consumption plan, is ephemeral. An invocation might last seconds or even milliseconds. Establishing a new Zookeeper session on every function invocation would be prohibitively slow and would flood the Zookeeper ensemble with thousands of short-lived, useless connections, a pattern known as a “thundering herd.”
Our strategy coalesced around two major components:
- Network Plumbing: Establish a secure, private network path from Vercel’s execution environment into our target AWS VPC using Vercel’s Secure Compute feature.
- Session Mediation: Develop a lightweight, stateful TCP proxy service running inside the VPC. This proxy would maintain a stable pool of long-lived Zookeeper sessions and expose a simplified, stateless protocol for our ephemeral Vercel Functions to consume.
The architecture would look like this:
graph TD subgraph Vercel Platform A[Vercel Function] end subgraph AWS VPC subgraph Private Subnet 1 B[ZK Proxy Service] end subgraph Private Subnet 2 C[Zookeeper Ensemble] end end A -- "Private Network Path (Vercel Secure Compute)" --> B B -- "Persistent ZK Sessions" --> C style B fill:#f9f,stroke:#333,stroke-width:2px
This design isolates the complexity. The Vercel Function remains simple; it only needs to know how to speak to our proxy over a standard TCP socket. The proxy handles the entire lifecycle of Zookeeper connections, sessions, and reconnections.
Phase 1: Establishing the Secure Network Path
Vercel provides a mechanism to connect projects to private networks. The first step was configuring this bridge. In AWS, this required setting up a private connection endpoint, which Vercel’s infrastructure could then peer with.
The crucial part of the infrastructure-as-code setup (we use Terraform) was ensuring the security groups were correctly configured. The Zookeeper ensemble’s security group only allows inbound traffic on port 2181 from our proxy’s security group. The proxy’s security group, in turn, allows inbound traffic on its custom port (we chose 9876) only from the CIDR block assigned to our Vercel Secure Compute environment.
A simplified representation of the security group rules in HCL:
# Security group for the Zookeeper ensemble
resource "aws_security_group" "zookeeper_sg" {
name = "zookeeper-ensemble-sg"
description = "Allow Zookeeper client traffic"
vpc_id = aws_vpc.main.id
# Ingress from the ZK Proxy ONLY
ingress {
from_port = 2181
to_port = 2181
protocol = "tcp"
security_groups = [aws_security_group.zk_proxy_sg.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
# Security group for our custom TCP proxy service
resource "aws_security_group" "zk_proxy_sg" {
name = "zookeeper-proxy-sg"
description = "Allow traffic from Vercel Secure Compute"
vpc_id = aws_vpc.main.id
# Ingress from Vercel's private network CIDR
ingress {
description = "Allow Vercel Functions to connect to the proxy"
from_port = 9876
to_port = 9876
protocol = "tcp"
cidr_blocks = ["10.x.x.x/24"] # Example CIDR for Vercel connection
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
With the network path established and secured, a Vercel Function could theoretically open a TCP socket to a service running on a private IP within our VPC. The next step was to build that service.
Phase 2: The Stateful Zookeeper Session Proxy
We decided to build the proxy in Node.js for its strong asynchronous I/O capabilities, running it as a containerized service on AWS Fargate for simplicity. The proxy has two main responsibilities: managing its own lifecycle with the Zookeeper ensemble and serving requests from Vercel Functions.
Managing the Zookeeper Connection
The proxy uses the node-zookeeper-client
library. A common mistake here is to create a new client for every incoming request. Instead, our proxy creates a single, persistent client instance upon startup and manages its state transitions (connected
, disconnected
, expired
).
zookeeper-manager.js
:
const zookeeper = require('node-zookeeper-client');
const EventEmitter = require('events');
// A simple logger prefixing messages
const log = (message, ...args) => console.log(`[ZKManager] ${message}`, ...args);
class ZookeeperManager extends EventEmitter {
constructor(connectionString, options) {
super();
this.connectionString = connectionString;
this.options = options || {
sessionTimeout: 5000,
spinDelay: 1000,
retries: 5
};
this.client = null;
this.state = 'INITIAL'; // INITIAL, CONNECTED, DISCONNECTED, EXPIRED
}
connect() {
if (this.client) {
this.client.close();
}
log(`Attempting to connect to ${this.connectionString}`);
this.client = zookeeper.createClient(this.connectionString, this.options);
this.client.once('connected', () => {
this.state = 'CONNECTED';
log('Successfully connected to Zookeeper.');
this.emit('connected');
});
this.client.on('disconnected', () => {
this.state = 'DISCONNECTED';
log('Disconnected from Zookeeper. Attempting to reconnect...');
// The client library handles reconnection logic internally.
// We just log the state change.
this.emit('disconnected');
});
this.client.on('expired', () => {
this.state = 'EXPIRED';
log('Session expired. Closing client and forcing a full reconnect.');
// When a session expires, we must create a new client instance.
this.emit('expired');
this.client.close();
// Trigger a reconnect after a short delay to avoid spamming connections.
setTimeout(() => this.connect(), this.options.spinDelay);
});
this.client.connect();
}
// Expose the raw client for proxy operations
getClient() {
if (this.state !== 'CONNECTED' || !this.client) {
return null;
}
return this.client;
}
// Graceful shutdown
close() {
if (this.client) {
log('Closing Zookeeper client connection.');
this.client.close();
}
}
}
// Singleton instance for the application
const ZK_CONNECTION_STRING = process.env.ZK_SERVERS || 'localhost:2181';
const manager = new ZookeeperManager(ZK_CONNECTION_STRING);
module.exports = manager;
This manager class encapsulates the stateful connection. It emits events that the main server can listen to, for instance, to stop accepting new requests if the Zookeeper connection is lost. The critical part is handling the expired
event correctly, which requires creating a completely new client instance.
The TCP Server and Protocol
We needed a simple, text-based protocol for the Vercel Functions. A production-grade system might use gRPC or Protocol Buffers, but for this specific problem, a simple newline-delimited command protocol was sufficient and minimized dependencies on the Vercel Function side.
The protocol format: COMMAND|PATH|DATA|OPTIONS\n
- COMMAND:
CREATE
,GET
,SET
,EXISTS
,DELETE
,ACQUIRE_LOCK
,RELEASE_LOCK
- PATH: The ZNode path.
- DATA: Base64-encoded data for
CREATE
orSET
. - OPTIONS: A JSON string for flags like
EPHEMERAL
,SEQUENCE
.
The response format: STATUS|PAYLOAD\n
- STATUS:
OK
,ERROR
,NODE_EXISTS
,NO_NODE
- PAYLOAD: Base64-encoded data for
GET
or an error message.
proxy-server.js
:
const net = require('net');
const zookeeper = require('node-zookeeper-client');
const zkManager = require('./zookeeper-manager');
const PORT = process.env.PORT || 9876;
const HOST = '0.0.0.0';
// A simple logger prefixing messages
const log = (message, ...args) => console.log(`[ProxyServer] ${message}`, ...args);
const server = net.createServer((socket) => {
const clientAddr = `${socket.remoteAddress}:${socket.remotePort}`;
log(`Client connected: ${clientAddr}`);
socket.on('data', async (data) => {
const rawCommand = data.toString().trim();
if (!rawCommand) return;
log(`Received command from ${clientAddr}: ${rawCommand}`);
const zkClient = zkManager.getClient();
if (!zkClient) {
socket.write('ERROR|Zookeeper not connected\n');
return;
}
try {
const response = await handleCommand(zkClient, rawCommand);
socket.write(response + '\n');
} catch (error) {
log(`Error processing command "${rawCommand}":`, error.message);
socket.write(`ERROR|${Buffer.from(error.message).toString('base64')}\n`);
}
});
socket.on('end', () => {
log(`Client disconnected: ${clientAddr}`);
});
socket.on('error', (err) => {
log(`Socket error from ${clientAddr}:`, err.message);
});
});
async function handleCommand(client, rawCommand) {
const parts = rawCommand.split('|');
const [command, path, payload, optionsStr] = parts;
if (!command || !path) {
throw new Error('Invalid command format. Expected COMMAND|PATH|...');
}
const data = payload ? Buffer.from(payload, 'base64') : undefined;
const options = optionsStr ? JSON.parse(optionsStr) : {};
switch (command.toUpperCase()) {
case 'CREATE': {
// A pitfall: ephemeral nodes are tied to the proxy's session.
// This means if the proxy dies, the node is gone. This is actually
// the desired behavior for our locking mechanism.
const mode = options.ephemeral
? (options.sequence ? zookeeper.CreateMode.EPHEMERAL_SEQUENTIAL : zookeeper.CreateMode.EPHEMERAL)
: (options.sequence ? zookeeper.CreateMode.PERSISTENT_SEQUENTIAL : zookeeper.CreateMode.PERSISTENT);
return new Promise((resolve, reject) => {
client.create(path, data, mode, (error, createdPath) => {
if (error) {
if (error.getCode() === zookeeper.Exception.NODE_EXISTS) {
return resolve('NODE_EXISTS|');
}
return reject(error);
}
resolve(`OK|${createdPath}`);
});
});
}
case 'GET': {
return new Promise((resolve, reject) => {
client.getData(path, (error, data, stat) => {
if (error) {
if (error.getCode() === zookeeper.Exception.NO_NODE) {
return resolve('NO_NODE|');
}
return reject(error);
}
const payload = data ? data.toString('base64') : '';
resolve(`OK|${payload}`);
});
});
}
// ... Implement other commands like SET, EXISTS, DELETE similarly
default:
throw new Error(`Unknown command: ${command}`);
}
}
server.listen(PORT, HOST, () => {
log(`Zookeeper proxy server listening on ${HOST}:${PORT}`);
// Start the Zookeeper connection manager
zkManager.connect();
});
process.on('SIGTERM', () => {
log('SIGTERM received. Shutting down gracefully.');
server.close(() => {
zkManager.close();
process.exit(0);
});
});
This server code is deliberately simple. It listens for TCP connections, parses our custom protocol, and translates the commands into calls to the node-zookeeper-client
. Error handling is crucial; Zookeeper-specific errors (like NODE_EXISTS
) are translated into specific status codes in our protocol, while generic errors are returned as a general ERROR
.
Phase 3: The Vercel Function Client
The final piece is the client code within the Vercel Function. This code’s responsibility is to connect to the proxy, send a command, and parse the response. We wrapped this logic in a small utility class to make it reusable across different serverless functions.
api/lib/zookeeper-proxy-client.ts
:
import * as net from 'net';
// These would come from environment variables in Vercel
const PROXY_HOST = process.env.ZK_PROXY_HOST!; // The private IP of the proxy service
const PROXY_PORT = parseInt(process.env.ZK_PROXY_PORT!, 10);
interface ProxyResponse {
status: 'OK' | 'ERROR' | 'NODE_EXISTS' | 'NO_NODE';
payload: string;
}
/**
* A client for the Zookeeper TCP proxy.
* A common mistake in serverless is creating new connections on every call.
* Node.js on Vercel can reuse TCP sockets across invocations of a warm function.
* We don't implement complex pooling here, but rely on the runtime's behavior.
*/
export class ZookeeperProxyClient {
private host: string;
private port: number;
constructor(host: string, port: number) {
if (!host || !port) {
throw new Error('Proxy host and port must be configured.');
}
this.host = host;
this.port = port;
}
private sendCommand(command: string): Promise<ProxyResponse> {
return new Promise((resolve, reject) => {
const socket = new net.Socket();
let responseReceived = false;
// Set a timeout for the entire operation. This is critical in a serverless environment.
const timeout = setTimeout(() => {
if (!responseReceived) {
socket.destroy();
reject(new Error('Proxy command timed out after 5000ms'));
}
}, 5000);
socket.connect(this.port, this.host, () => {
socket.write(command + '\n');
});
socket.on('data', (data) => {
responseReceived = true;
clearTimeout(timeout);
const responseStr = data.toString().trim();
const [status, payload] = responseStr.split('|');
resolve({
status: status as ProxyResponse['status'],
payload: payload || '',
});
socket.end();
});
socket.on('error', (err) => {
clearTimeout(timeout);
reject(new Error(`Proxy connection error: ${err.message}`));
socket.destroy();
});
socket.on('close', () => {
if (!responseReceived) {
clearTimeout(timeout);
reject(new Error('Proxy connection closed before response'));
}
});
});
}
public async createNode(path: string, data: Buffer, options?: { ephemeral?: boolean; sequence?: boolean }): Promise<{ status: ProxyResponse['status']; path?: string }> {
const optionsStr = options ? JSON.stringify(options) : '';
const command = `CREATE|${path}|${data.toString('base64')}|${optionsStr}`;
const response = await this.sendCommand(command);
return { status: response.status, path: response.payload };
}
public async getData(path: string): Promise<{ status: ProxyResponse['status']; data?: Buffer }> {
const command = `GET|${path}||`;
const response = await this.sendCommand(command);
return {
status: response.status,
data: response.payload ? Buffer.from(response.payload, 'base64') : undefined,
};
}
}
With this client, a Vercel Function can now interact with Zookeeper in a clean, abstracted way. Here is an example of a serverless function acquiring a distributed lock.
api/acquire-job-lock.ts
:
import type { VercelRequest, VercelResponse } from '@vercel/node';
import { ZookeeperProxyClient } from './lib/zookeeper-proxy-client';
const proxyClient = new ZookeeperProxyClient(
process.env.ZK_PROXY_HOST!,
parseInt(process.env.ZK_PROXY_PORT!, 10)
);
export default async function handler(
request: VercelRequest,
response: VercelResponse,
) {
const { jobId } = request.query;
if (!jobId || typeof jobId !== 'string') {
return response.status(400).json({ error: 'jobId is required' });
}
const lockPath = `/locks/jobs/${jobId}`;
try {
// Attempt to create an ephemeral node to act as a lock.
// The node being ephemeral is key. If this function crashes, or the proxy session dies,
// the lock is automatically released by Zookeeper.
const result = await proxyClient.createNode(lockPath, Buffer.from('locked'), { ephemeral: true });
if (result.status === 'OK') {
// We successfully acquired the lock.
// ... proceed with critical section logic ...
console.log(`Lock acquired for job ${jobId} at path ${result.path}`);
return response.status(200).json({ status: 'lock_acquired', path: result.path });
} else if (result.status === 'NODE_EXISTS') {
// Lock is already held by another process.
console.log(`Failed to acquire lock for job ${jobId}, already held.`);
return response.status(409).json({ status: 'lock_held' });
} else {
// Some other Zookeeper error occurred.
return response.status(500).json({ error: 'Failed to interact with Zookeeper', details: result.status });
}
} catch (error: any) {
console.error('Error acquiring lock:', error);
return response.status(503).json({ error: 'Service Unavailable', details: error.message });
}
}
This solution worked. It provided a secure and reasonably performant bridge between our new serverless world and the legacy stateful system. The latency overhead was measurable—the round trip through the proxy added about 5-10ms compared to a direct client connection within the VPC—but it was predictable and far better than the alternatives.
The final architecture and flow for acquiring a lock:
sequenceDiagram participant VF as Vercel Function participant VercelNW as Vercel Secure Compute participant ZKP as ZK Proxy (Fargate) participant ZK as Zookeeper Ensemble VF->>VercelNW: net.connect(proxy_private_ip) VercelNW->>ZKP: TCP connection established VF->>ZKP: Send "CREATE|/locks/job1|...|{ephemeral:true}" ZKP-->>ZK: client.create("/locks/job1", ..., EPHEMERAL) alt Lock available ZK-->>ZKP: Success, path=/locks/job1 ZKP->>VF: Send "OK|/locks/job1" else Lock held ZK-->>ZKP: Error: NODE_EXISTS ZKP->>VF: Send "NODE_EXISTS|" end VF->>ZKP: socket.end()
This approach is not without its own set of trade-offs and future considerations. The custom TCP proxy is now a critical piece of infrastructure; it must be made highly available by running multiple instances behind an AWS Network Load Balancer. The custom text protocol is also brittle; a move to gRPC would provide schema enforcement, better performance, and streaming capabilities, which could be useful for Zookeeper watches.
Fundamentally, this architecture is a bridge, not a destination. It allows us to modernize parts of our stack incrementally without being blocked by legacy dependencies. The long-term objective remains to decompose the monolith and replace Zookeeper-based coordination with cloud-native services like AWS Step Functions or DynamoDB’s conditional expressions, which are inherently more compatible with a serverless execution model. For now, however, this pragmatic solution unblocked our development and keeps two very different architectural paradigms working in concert.