The operational stability of our Prometheus fleet had degraded to an unacceptable level. The root cause was classic cardinality explosion, driven by an increasingly ephemeral microservices architecture running on Kubernetes. Labels identifying individual pods, replica sets, and transient job identifiers were creating millions of unique time series, leading to memory exhaustion, slow queries, and frequent OOM kills on the Prometheus instances. Simply increasing instance sizes was a losing, expensive battle. A fundamental architectural change was required to pre-process our metrics stream before it ever reached the primary TSDB.
Our design objective was to build a stateless, high-performance ingestion service—a “Cardinality Governor”—that would sit between our metric producers and consumers. This service would intelligently split incoming metrics, routing low-cardinality, aggregated data to Prometheus for dashboarding and alerting, while shunting high-cardinality raw data to a more suitable long-term storage system. The relationships between the metric labels themselves, representing our system’s topology, needed to be stored in a way that was queryable and maintainable.
graph TD subgraph Producers A[Microservice A] --> N; B[Microservice B] --> N; C[Ephemeral Job] --> N; end subgraph Ingestion Pipeline N[NATS Message Queue] --> ZG{Zig Cardinality Governor}; end subgraph Data Stores ZG -- "1. Resolve/Create Series ID" --> AR[(ArangoDB - Metadata Graph)]; ZG -- "2. Store Raw Data" --> SC[(ScyllaDB - Raw Metrics)]; ZG -- "3. Expose Aggregates" --> P([Prometheus TSDB]); end AR -- "Returns Stable Series ID" --> ZG; P -- Scrapes --> ZG; style ZG fill:#f9f,stroke:#333,stroke-width:2px
This architecture dictated a specific and somewhat unconventional set of technologies:
- Message Queue (NATS): To act as a durable, high-throughput buffer and decouple producers from the Governor. Its simplicity and performance were key factors.
- ArangoDB: The relationships between labels (e.g.,
pod
->replicaset
->deployment
) form a graph. ArangoDB’s multi-model nature is perfect for storing and querying this metadata topology to generate stable, numerical series identifiers. - ScyllaDB: For the raw, high-cardinality event stream. Its Cassandra-compatible API and proven low-latency performance at massive scale make it ideal for storing time-series data indexed by the stable series IDs from ArangoDB.
- Zig: The Governor itself had to be exceptionally fast, memory-efficient, and free of GC pauses that could disrupt the ingestion flow. Zig’s focus on performance, safety, and C interoperability made it a superior choice over Go or Java for this specific component.
The core of this post-mortem is the implementation of the Cardinality Governor in Zig.
The Zig Governor: Service Skeleton and Dependencies
The first step was setting up the Zig project structure. We needed clients for NATS, ScyllaDB (via its C/C++ driver), and a standard HTTP client for ArangoDB.
The build.zig
file manages these dependencies. A key decision here was to statically link the ScyllaDB driver to produce a single, self-contained binary, simplifying deployment.
build.zig
const std = @import("std");
pub fn build(b: *std.Build) void {
const target = b.standardTargetOptions(.{});
const optimize = b.standardOptimizeOption(.{});
const exe = b.addExecutable(.{
.name = "cardinality-governor",
.root_source_file = .{ .path = "src/main.zig" },
.target = target,
.optimize = optimize,
});
// Add necessary packages
const nats_client = b.dependency("zig_nats", .{});
exe.addModule("nats", nats_client.module("nats"));
// Add ScyllaDB C++ driver dependency
// In a real project, this would involve fetching and building the driver
// For simplicity here, we assume pre-compiled libs
exe.addIncludePath(.{ .path = "/usr/local/include" });
exe.addLibraryPath(.{ .path = "/usr/local/lib" });
exe.linkSystemLibrary("scylla-cpp-driver");
exe.linkSystemLibrary("crypto");
exe.linkSystemLibrary("ssl");
exe.linkSystemLibrary("uv");
exe.linkSystemLibrary("stdc++");
b.installArtifact(exe);
const run_cmd = b.addRunArtifact(exe);
run_cmd.step.dependOn(b.getInstallStep());
if (b.args) |args| {
run_cmd.addArgs(args);
}
const run_step = b.step("run", "Run the application");
run_step.dependOn(&run_cmd.step);
const unit_tests = b.addTest(.{
.root_source_file = .{ .path = "src/main.zig" },
.target = target,
.optimize = optimize,
});
unit_tests.addModule("nats", nats_client.module("nats"));
unit_tests.linkSystemLibrary("scylla-cpp-driver");
unit_tests.linkSystemLibrary("stdc++");
const test_step = b.step("test", "Run unit tests");
test_step.dependOn(&unit_tests.step);
}
This build script outlines the core components: the main executable, a NATS client package, and linking against the system-installed ScyllaDB driver. The pragmatic choice here is leveraging a mature C++ driver instead of writing a native one from scratch, a common pattern when working with newer languages like Zig.
Ingesting Metrics from NATS
The entry point for our service is a subscription to a NATS subject where raw metrics are published. The Zig NATS client provides an asynchronous interface that fits well into a main event loop.
src/nats_subscriber.zig
const std = @import("std");
const nats = @import("nats");
const processor = @import("metric_processor.zig");
const AppState = struct {
allocator: std.mem.Allocator,
metric_processor: *processor.MetricProcessor,
};
pub fn start(
allocator: std.mem.Allocator,
metric_processor: *processor.MetricProcessor,
nats_url: []const u8,
nats_subject: []const u8,
) !void {
var nc = try nats.Connection.connect(allocator, .{ .url = nats_url });
defer nc.close();
std.log.info("Connected to NATS at {s}", .{nats_url});
var app_state = AppState{
.allocator = allocator,
.metric_processor = metric_processor,
};
var sub = try nc.subscribe(nats_subject, messageHandler, &app_state);
defer sub.unsubscribe();
// Keep the application running. In a real service, this would be a proper event loop.
std.log.info("Listening for messages on subject '{s}'", .{nats_subject});
while (true) {
std.time.sleep(std.time.ns_per_s);
}
}
fn messageHandler(msg: nats.Message, context: ?*anyopaque) void {
const app_state: *AppState = @ptrCast(@alignCast(context orelse @panic("Context is null")));
const payload = msg.payload;
std.log.debug("Received {d} bytes from subject {s}", .{ payload.len, msg.subject });
// The payload can contain multiple metrics, separated by newlines.
var lines = std.mem.splitScalar(u8, payload, '\n');
while (lines.next()) |line| {
if (line.len == 0 or line[0] == '#') continue; // Skip empty lines and comments
if (app_state.metric_processor.processMetricLine(line)) |_| {
// Successfully processed
} else |err| {
std.log.err("Failed to process metric line: {s}, error: {any}", .{ line, err });
}
}
}
This module establishes the connection and sets up a callback, messageHandler
. The critical part here is delegating the actual parsing and processing logic to a separate MetricProcessor
, keeping the concerns of transport and business logic cleanly separated. In a production environment, this while(true)
loop would be replaced by a more robust mechanism tied to the async I/O event loop of the application.
Parsing Prometheus Metrics without Regex
A common performance pitfall is using regular expressions on a hot path. For parsing the Prometheus exposition format, a simple, hand-rolled state machine parser is significantly more performant and uses less memory.
src/metric_parser.zig
const std = @import("std");
pub const Label = struct {
name: []const u8,
value: []const u8,
};
pub const ParsedMetric = struct {
name: []const u8,
labels: std.ArrayList(Label),
value: f64,
timestamp: ?u64 = null,
};
// ... state machine enum definition ...
// enum ParseState { Name, BeforeLabelName, LabelName, AfterLabelName, BeforeLabelValue, InLabelValue, ... }
// This is a simplified parser for demonstration. A production version would be more robust.
pub fn parseMetricLine(allocator: std.mem.Allocator, line: []const u8) !ParsedMetric {
var result = ParsedMetric{
.name = "",
.labels = std.ArrayList(Label).init(allocator),
.value = 0.0,
};
var cursor: usize = 0;
while (cursor < line.len and std.ascii.isAlphabetic(line[cursor])) : (cursor += 1) {}
result.name = line[0..cursor];
if (cursor == line.len) return error.InvalidFormat;
// Parse labels if they exist
if (line[cursor] == '{') {
cursor += 1;
while (cursor < line.len and line[cursor] != '}') {
while (cursor < line.len and std.ascii.isWhitespace(line[cursor])) : (cursor += 1) {}
const label_name_start = cursor;
while (cursor < line.len and std.ascii.isAlphabetic(line[cursor])) : (cursor += 1) {}
const label_name = line[label_name_start..cursor];
if (cursor + 1 >= line.len or line[cursor] != '=' or line[cursor+1] != '"') return error.InvalidLabelFormat;
cursor += 2;
const label_value_start = cursor;
while (cursor < line.len and line[cursor] != '"') : (cursor += 1) {}
const label_value = line[label_value_start..cursor];
try result.labels.append(.{.name = label_name, .value = label_value});
cursor += 1; // skip closing quote
if (cursor < line.len and line[cursor] == ',') {
cursor += 1;
}
}
cursor += 1; // skip '}'
}
while (cursor < line.len and std.ascii.isWhitespace(line[cursor])) : (cursor += 1) {}
const value_start = cursor;
while (cursor < line.len and (std.ascii.isDigit(line[cursor]) or line[cursor] == '.' or line[cursor] == '-')) : (cursor += 1) {}
result.value = try std.fmt.parseFloat(f64, line[value_start..cursor]);
// Optional timestamp parsing would go here
return result;
}
test "metric parser" {
const allocator = std.testing.allocator;
const line = "http_requests_total{method=\"POST\",code=\"200\",pod_name=\"abc-123\"} 1027";
var metric = try parseMetricLine(allocator, line);
defer metric.labels.deinit();
try std.testing.expectEqualStrings("http_requests_total", metric.name);
try std.testing.expectEqual(@as(f64, 1027), metric.value);
try std.testing.expectEqual(@as(usize, 3), metric.labels.items.len);
try std.testing.expectEqualStrings("method", metric.labels.items[0].name);
try std.testing.expectEqualStrings("POST", metric.labels.items[0].value);
try std.testing.expectEqualStrings("pod_name", metric.labels.items[2].name);
try std.testing.expectEqualStrings("abc-123", metric.labels.items[2].value);
}
This parser is purpose-built and avoids the overhead of a generic regex engine. The unit test demonstrates its correctness for a typical metric line. The real value is its predictable performance inside the tight loop of the message handler.
The Core Logic: Cardinality Splitting and ArangoDB Interaction
This is the heart of the Governor. It separates labels into “stable” and “volatile” sets based on a predefined configuration. The stable labels are used to query ArangoDB for a persistent, numerical series ID.
src/metric_processor.zig
const std = @import("std");
const parser = @import("metric_parser.zig");
// Other imports for ArangoDB client, ScyllaDB client etc.
const StableLabelSet = std.ArrayList([]const u8);
pub const MetricProcessor = struct {
allocator: std.mem.Allocator,
stable_label_keys: std.StringArrayHashMapUnmanaged(void),
arangodb_client: *ArangoDBClient, // Assume this client exists
scylladb_client: *ScyllaDBClient, // Assume this client exists
// In-memory cache for series IDs would be here
series_id_cache: std.AutoHashMap(u64, u64),
pub fn init(...) !*MetricProcessor { ... }
pub fn deinit(self: *MetricProcessor) void { ... }
pub fn processMetricLine(self: *MetricProcessor, line: []const u8) !void {
var metric = try parser.parseMetricLine(self.allocator, line);
defer metric.labels.deinit();
var stable_labels = std.ArrayList(parser.Label).init(self.allocator);
defer stable_labels.deinit();
var volatile_labels = std.ArrayList(parser.Label).init(self.allocator);
defer volatile_labels.deinit();
// 1. Split labels based on the configured stable keys
for (metric.labels.items) |label| {
if (self.stable_label_keys.contains(label.name)) {
try stable_labels.append(label);
} else {
try volatile_labels.append(label);
}
}
// 2. Generate a stable hash from metric name + stable labels
var stable_hasher = std.hash.Wyhash.init(0);
std.hash.autoHash(&stable_hasher, metric.name);
// Important: Sort stable labels to ensure consistent hash
std.sort.block(parser.Label, stable_labels.items, {}, comptime std.sort.asc(u8));
for (stable_labels.items) |label| {
std.hash.autoHash(&stable_hasher, label.name);
std.hash.autoHash(&stable_hasher, label.value);
}
const stable_hash = stable_hasher.final();
// 3. Check local cache first. A real implementation would use a concurrent hash map.
if (self.series_id_cache.get(stable_hash)) |series_id| {
// Cache hit, proceed to write to ScyllaDB
try self.scylladb_client.writeRawMetric(series_id, metric.value, volatile_labels.items);
return;
}
// 4. Cache miss: Query/create series ID in ArangoDB
const series_id = try self.arangodb_client.getOrCreateSeriesId(metric.name, stable_labels.items);
// 5. Update cache and write to ScyllaDB
try self.series_id_cache.put(stable_hash, series_id);
try self.scylladb_client.writeRawMetric(series_id, metric.value, volatile_labels.items);
// 6. Update in-memory aggregates for Prometheus scraping (not shown)
// self.updatePrometheusAggregates(metric.name, stable_labels, metric.value);
}
};
The logic is straightforward but powerful:
- Segregate: Labels are split. For
http_requests_total{method="POST",pod="abc-123"}
,method
might be stable, butpod
would be volatile. - Identify: A hash is computed from the metric name and the sorted stable labels. Sorting is critical for canonical representation.
- Cache: An in-memory cache is checked first to avoid hitting the database for every metric.
- Resolve: On a cache miss, we query ArangoDB. The AQL query for this is an
UPSERT
, which is atomic and efficient.
ArangoDB AQL Query (executed via HTTP client from Zig):
UPSERT { _key: @stable_hash_string }
INSERT { _key: @stable_hash_string, name: @metric_name, labels: @stable_labels, series_id: @new_series_id }
UPDATE {}
IN timeseries_metadata
RETURN { series_id: NEW ? NEW.series_id : OLD.series_id }
This AQL query uses the stable hash as the document key. If it exists, it does nothing. If it doesn’t, it inserts a new document containing the metric metadata and a new, unique series_id
(which could be generated by a separate ArangoDB counter or by the Governor itself). It always returns the series_id
, whether old or new. This atomic operation is the cornerstone of generating consistent IDs.
Storing Raw Data in ScyllaDB
With a stable series_id
, the raw data point is written to ScyllaDB. The schema is optimized for time-series queries.
ScyllaDB Table Schema (CQL):
CREATE KEYSPACE IF NOT EXISTS observability WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3 };
USE observability;
CREATE TABLE IF NOT EXISTS raw_metrics (
series_id bigint,
ts timestamp,
value double,
volatile_labels map<text, text>,
PRIMARY KEY (series_id, ts)
) WITH CLUSTERING ORDER BY (ts DESC);
The partition key is series_id
, meaning all data points for a single stable time series are co-located on the same node and sorted by timestamp. This makes queries for a specific series over a time range extremely fast.
Zig ScyllaDB Client Snippet (wrapping the C++ driver):
const scylla = @cImport({
@cInclude("cassandra.h");
});
pub const ScyllaDBClient = struct {
cluster: *scylla.CassCluster,
session: *scylla.CassSession,
// Prepared statement for insertion
insert_statement: *const scylla.CassPrepared,
pub fn writeRawMetric(self: *ScyllaDBClient, series_id: u64, value: f64, volatile_labels: []const parser.Label) !void {
const statement = scylla.cass_prepared_bind(self.insert_statement);
defer scylla.cass_statement_free(statement);
// Bind the parameters
_ = scylla.cass_statement_bind_int64(statement, 0, @as(i64, @intCast(series_id)));
_ = scylla.cass_statement_bind_int64(statement, 1, std.time.timestamp()); // ts
_ = scylla.cass_statement_bind_double(statement, 2, value);
// Build the volatile labels map
const collection = scylla.cass_collection_new(scylla.CASS_COLLECTION_TYPE_MAP, volatile_labels.len);
defer scylla.cass_collection_free(collection);
for (volatile_labels) |label| {
_ = scylla.cass_collection_append_string_n(collection, label.name.ptr, label.name.len);
_ = scylla.cass_collection_append_string_n(collection, label.value.ptr, label.value.len);
}
_ = scylla.cass_statement_bind_collection(statement, 3, collection);
const future = scylla.cass_session_execute(self.session, statement);
defer scylla.cass_future_free(future);
// A production implementation would use async execution, not blocking waits.
scylla.cass_future_wait(future);
if (scylla.cass_future_error_code(future) != scylla.CASS_OK) {
var message: [*:0]const u8 = undefined;
var message_length: usize = undefined;
scylla.cass_future_error_message(future, &message, &message_length);
std.log.err("ScyllaDB write failed: {s}", .{message[0..message_length]});
return error.ScyllaDBWriteFailed;
}
}
};
This demonstrates the interaction with the C++ driver from Zig. The pitfall here is the complexity of C interop. Careful memory management is required, as we are now responsible for freeing objects created by the C library (cass_statement_free
, etc.). Using prepared statements is a non-negotiable performance best practice when dealing with databases like ScyllaDB.
Limitations and Future Work
This system successfully solved our primary pain point: Prometheus stability. Ingestion is now decoupled and resilient, and we have a rich, queryable dataset in ScyllaDB and ArangoDB for deep forensic analysis. However, the architecture is not without its own set of trade-offs and complexities.
The query path is now federated. A simple query may require joining data from Prometheus (for recent, aggregated trends), ScyllaDB (for raw historical data), and ArangoDB (to resolve series IDs back to human-readable labels). This is a significant operational burden. The next logical evolution is to build a unified query layer, potentially using something like Cortex or building a custom proxy that understands how to query the different backends and merge the results.
Furthermore, the Zig Governor, while performant, is a single point of failure in its current form. Making it horizontally scalable requires a solution for sharing the series ID cache. A distributed cache like Redis or Hazelcast could work, but introduces another dependency. A more elegant solution might involve using consistent hashing to shard the metric stream across multiple Governor instances, ensuring that metrics for the same stable series always land on the same instance, thus keeping the cache local and effective.
Finally, the reliance on a C++ driver in Zig adds build complexity and a layer of unsafe
code. As the native Zig ecosystem matures, migrating to a pure Zig driver for ScyllaDB would be a significant win for maintainability and safety.