The Prometheus remote-write endpoint was being throttled, and queries against labels like customer_id
or session_id
were timing out consistently. Our standard observability stack, built around Prometheus, was failing under the load of a new event-driven service generating millions of time series. The core issue was high-cardinality labels. For each unique combination of labels, Prometheus creates a new time series. When a label has unbounded values, the TSDB index balloons, memory usage spikes, and performance plummets. In a real-world project, this isn’t a theoretical problem; it’s a production outage waiting to happen. Our alerting was compromised, and forensic analysis of user-specific issues became impossible.
Our initial concept was to build a dual-backend system. We’d keep Prometheus for what it excels at: low-cardinality, high-frequency metrics essential for real-time alerting and operational dashboards. For the high-cardinality data needed for deep analytics and debugging, we needed a different tool. We chose ClickHouse. Its columnar storage engine is exceptionally well-suited for the type of OLAP queries we needed to run over massive metric datasets. The plan was to create a data pipeline that could intelligently route metrics to the correct backend.
This decision cascaded into several others. To manage the configuration of this new distributed fleet of metric-routing agents, we leveraged our existing Zookeeper cluster. It was already a core part of our infrastructure for service discovery and configuration, making it a pragmatic choice. For the user interface, the requirements were split. Our Site Reliability Engineering (SRE) team needed a high-performance, native desktop tool for intensive analysis, leading us to SwiftUI for a macOS application. Concurrently, we needed a lightweight, embeddable web component for broader visibility within our existing internal portals, which led us to experiment with Turbopack for its promised build speed and development experience.
This is the log of how we built that system, from the data plane up to the user-facing clients.
The Data Plane: Schema Design in ClickHouse
The first step was to design a table in ClickHouse capable of efficiently storing and querying time-series data. A common mistake is to treat a columnar database like a relational one. For metrics, a denormalized, wide table is often the most performant approach. We settled on a single table for all metrics, leveraging the power of the MergeTree
engine family.
Here is the core table schema. The choice of ReplicatedReplacingMergeTree
is deliberate; it provides data replication across nodes and handles potential duplicate data points by replacing rows with the same sorting key, which can happen during pipeline failures and retries.
-- clickhouse-schema.sql
CREATE TABLE IF NOT EXISTS metrics.timeseries_dist ON CLUSTER '{cluster}'
(
-- The timestamp is the primary dimension for partitioning.
-- Using DateTime64 with nanosecond precision is crucial for high-frequency data.
`timestamp` DateTime64(9),
-- The metric name itself. This is a low-cardinality field and a primary query filter.
`name` LowCardinality(String),
-- Metric value. Float64 is standard for Prometheus.
`value` Float64,
-- Labels are stored in a Map. This is the key to handling high cardinality.
-- Instead of creating columns for each label, we store them in a flexible key-value structure.
-- This avoids schema alterations and handles sparse labels efficiently.
`labels` Map(LowCardinality(String), String),
-- A materialized column containing a hash of the labels map.
-- This is a critical optimization for the ORDER BY key. Sorting and merging
-- on a fixed-size hash is significantly faster than on a variable-size map.
`labels_hash` UInt64 MATERIALIZED sipHash64(labels)
) ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/metrics/timeseries_dist', '{replica}')
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (name, labels_hash, timestamp)
SETTINGS index_granularity = 8192;
The ORDER BY
clause is the most critical part of this schema for performance. By ordering by (name, labels_hash, timestamp)
, we physically co-locate all data points for a single time series on disk. When a query filters by name
and a set of labels (which we can hash to get labels_hash
), ClickHouse can perform a very fast primary key scan and read a minimal amount of data. Partitioning by day (toYYYYMMDD(timestamp)
) is a standard practice that aids in data management and retention policies.
The Control Plane: Zookeeper for Dynamic Agent Configuration
With the storage backend defined, we needed to build the pipeline to get data into it. We designed a small, stateless agent in Go that would receive data via Prometheus’s remote_write
protocol. The key requirement was that we needed to be able to dynamically reconfigure these agents—to add new filtering rules or change routing logic—without a full deployment.
This is where Zookeeper comes in. Each agent, upon startup, registers itself and pulls its configuration from a specific ZNode path. It then places a watch on that ZNode, allowing for near-instantaneous configuration updates pushed from a central control point.
The ZNode structure looks like this:
/observability/agents/
├── config/
│ ├── agent-group-01.json
│ └── agent-group-02.json
└── assignments/
├── {agent-instance-id-1} -> agent-group-01.json
└── {agent-instance-id-2} -> agent-group-01.json
An agent’s configuration, stored as JSON in the config
path, defines which metrics to route to ClickHouse.
// agent-group-01.json
{
"version": "1.2.0",
"clickhouse_endpoints": ["tcp://ch-node1:9000", "tcp://ch-node2:9000"],
"routing_rules": [
{
"description": "Route all metrics with a customer_id label to ClickHouse",
"matchers": [
{
"type": "EQUAL",
"name": "__name__",
"value": "http_requests_total"
},
{
"type": "EXISTS",
"name": "customer_id"
}
],
"action": "ROUTE_TO_CLICKHOUSE"
},
{
"description": "Drop high-frequency internal metrics",
"matchers": [
{
"type": "EQUAL",
"name": "__name__",
"value": "internal_rpc_latency_bucket"
}
],
"action": "DROP"
}
],
"default_action": "ROUTE_TO_PROMETHEUS"
}
The Go agent uses the official Zookeeper library to manage this. The code for initialization and live-reloading is critical for operational stability.
// agent/config/zookeeper_manager.go
package config
import (
"encoding/json"
"fmt"
"log"
"os"
"time"
"github.com/go-zookeeper/zk"
)
type ZookeeperManager struct {
conn *zk.Conn
agentID string
configPath string
currentConfig *AgentConfig
configChan chan *AgentConfig
}
func NewZookeeperManager(servers []string, agentID string) (*ZookeeperManager, error) {
conn, _, err := zk.Connect(servers, 5*time.Second)
if err != nil {
return nil, fmt.Errorf("failed to connect to Zookeeper: %w", err)
}
manager := &ZookeeperManager{
conn: conn,
agentID: agentID,
configChan: make(chan *AgentConfig, 1), // Buffered channel to avoid blocking
}
// This path is ephemeral, used to know which config file to load
assignmentPath := fmt.Sprintf("/observability/assignments/%s", agentID)
// A real implementation would have a more robust mechanism to assign configs.
// For this example, we assume it's pre-assigned or done by another process.
// We'll hardcode the assignment for demonstration purposes.
configFileName := "agent-group-01.json"
// Register the agent's assignment (could be done by a controller)
if err := manager.ensurePath(assignmentPath); err != nil {
log.Printf("Could not ensure assignment path: %v", err)
}
_, err = manager.conn.Create(assignmentPath, []byte(configFileName), zk.FlagEphemeral, zk.WorldACL(zk.PermAll))
if err != nil && err != zk.ErrNodeExists {
return nil, fmt.Errorf("failed to create assignment znode: %w", err)
}
manager.configPath = fmt.Sprintf("/observability/config/%s", configFileName)
log.Printf("Agent %s assigned to config: %s", agentID, manager.configPath)
return manager, nil
}
func (zm *ZookeeperManager) ensurePath(path string) error {
exists, _, err := zm.conn.Exists(path)
if err != nil {
return err
}
if !exists {
_, err := zm.conn.Create(path, []byte{}, 0, zk.WorldACL(zk.PermAll))
if err != nil && err != zk.ErrNodeExists {
return err
}
}
return nil
}
// Start watches for configuration changes and pushes them to a channel.
func (zm *ZookeeperManager) Start() (<-chan *AgentConfig, error) {
go func() {
for {
// Get the config data and set a watch.
// The watch is a one-time trigger. The event channel will receive a notification
// when the data on the node changes. The loop ensures we re-set the watch after a trigger.
data, _, eventChan, err := zm.conn.GetW(zm.configPath)
if err != nil {
log.Printf("Error getting config from Zookeeper: %v. Retrying in 10s.", err)
time.Sleep(10 * time.Second)
continue
}
var newConfig AgentConfig
if err := json.Unmarshal(data, &newConfig); err != nil {
log.Printf("Error unmarshalling config JSON: %v. Using previous config.", err)
} else {
// Compare to avoid pushing identical configs. A real-world scenario would use a version hash.
if zm.currentConfig == nil || zm.currentConfig.Version != newConfig.Version {
log.Printf("New configuration loaded, version: %s", newConfig.Version)
zm.currentConfig = &newConfig
zm.configChan <- &newConfig
}
}
// Block here until a change event occurs on the watched znode.
evt := <-eventChan
log.Printf("Zookeeper event received: %+v. Reloading configuration.", evt)
}
}()
// Perform initial load
data, _, err := zm.conn.Get(zm.configPath)
if err != nil {
return nil, fmt.Errorf("initial config load failed: %w", err)
}
var initialConfig AgentConfig
if err := json.Unmarshal(data, &initialConfig); err != nil {
return nil, fmt.Errorf("initial config unmarshal failed: %w", err)
}
zm.currentConfig = &initialConfig
zm.configChan <- &initialConfig
return zm.configChan, nil
}
// Close gracefully closes the Zookeeper connection.
func (zm *ZookeeperManager) Close() {
zm.conn.Close()
}
// AgentConfig struct (simplified for brevity)
type AgentConfig struct {
Version string `json:"version"`
// ... other fields like ClickhouseEndpoints, RoutingRules
}
The Query Gateway: A Unified API Layer
With data flowing into two separate systems, we needed to present a unified query interface to the frontends. A simple backend service, also written in Go, was created for this purpose. It accepts a PromQL-like query and decides whether to proxy it to Prometheus or translate it into SQL for ClickHouse.
The routing logic is straightforward but effective: if the query contains a known high-cardinality label or the time range exceeds the retention period of Prometheus (e.g., 7 days), it’s routed to ClickHouse. Otherwise, it goes to Prometheus.
graph TD subgraph SRE/Developer A[SwiftUI Client] B[Turbopack Web Component] end subgraph Backend C[Query Gateway API] end subgraph Data Stores D[Prometheus] E[ClickHouse Cluster] end subgraph Ingestion Pipeline F[Services] G[Metrics Agent] H[Zookeeper] end A -->|gRPC/REST| C B -->|REST API| C C -->|PromQL| D C -->|SQL| E D -.->|Alerts| I{Alertmanager} F -->|remote_write| G G -->|low-cardinality| D G -->|high-cardinality| E H -.->|Config Watch| G
The most complex piece of this gateway is the translation from PromQL to ClickHouse SQL. We didn’t implement a full parser; instead, we handled the most common query patterns used by our teams.
// gateway/query_translator.go
package gateway
import (
"fmt"
"strings"
"time"
)
// A simplified model of a parsed PromQL query
type PromQuery struct {
MetricName string
LabelMatchers map[string]string
TimeRangeStart time.Time
TimeRangeEnd time.Time
IsRate bool
RateWindow time.Duration
}
// Translate converts a simplified PromQuery model into a ClickHouse SQL query.
// This is not a full PromQL parser, but handles common cases.
func (q *PromQuery) TranslateToClickHouseSQL() string {
var whereClauses []string
whereClauses = append(whereClauses, fmt.Sprintf("name = '%s'", q.MetricName))
whereClauses = append(whereClauses, fmt.Sprintf("timestamp >= toDateTime64('%s', 9)", q.TimeRangeStart.Format(time.RFC3339Nano)))
whereClauses = append(whereClauses, fmt.Sprintf("timestamp <= toDateTime64('%s', 9)", q.TimeRangeEnd.Format(time.RFC3339Nano)))
for label, value := range q.LabelMatchers {
// The `labels` map in ClickHouse allows for flexible querying.
whereClauses = append(whereClauses, fmt.Sprintf("labels['%s'] = '%s'", label, value))
}
whereStatement := strings.Join(whereClauses, " AND ")
// A real implementation of rate() is much more complex and involves window functions
// and calculating the difference between consecutive points in a series.
// This is a simplified example to show the concept.
if q.IsRate {
// Using windowFunnel for sessionization/rate approximation. A more accurate way would
// be using array functions on grouped time series.
return fmt.Sprintf(`
SELECT
t,
if(increase > 0, increase / %.0f, 0) AS value
FROM (
SELECT
tumbleEnd(w.id) as t,
max(value) - min(value) as increase
FROM (
SELECT
tumble(timestamp, toIntervalSecond(%d)) as t,
any(value) as value
FROM metrics.timeseries_dist
WHERE %s
GROUP BY t
)
GROUP BY t
)
ORDER BY t
`, q.RateWindow.Seconds(), int(q.RateWindow.Seconds()), whereStatement)
}
// Simple time series selection
return fmt.Sprintf(`
SELECT
timestamp,
value,
labels
FROM metrics.timeseries_dist
WHERE %s
ORDER BY timestamp
`, whereStatement)
}
The pitfall here is underestimating the complexity of accurately replicating PromQL functions like rate()
or histogram_quantile()
in SQL. Our initial implementation was naive and produced slightly different results. A production-grade system requires significant investment in this translation layer to ensure consistency.
The Native Frontend: SwiftUI for High-Fidelity Analysis
For the SRE team, a responsive UI that can handle streaming large volumes of data without stuttering was paramount. SwiftUI, combined with the Combine framework, was a natural fit. We defined a ViewModel that communicates with our Query Gateway and publishes data streams that the UI can subscribe to.
// ObservabilityViewModel.swift
import Foundation
import Combine
class ObservabilityViewModel: ObservableObject {
// @Published properties automatically notify SwiftUI views of any changes.
@Published var dataPoints: [DataPoint] = []
@Published var isLoading: Bool = false
@Published var errorMessage: String? = nil
private var cancellables = Set<AnyCancellable>()
private let apiClient: APIClient
struct DataPoint: Identifiable {
let id = UUID()
let timestamp: Date
let value: Double
}
init(apiClient: APIClient = APIClient()) {
self.apiClient = apiClient
}
func executeQuery(query: String, start: Date, end: Date) {
self.isLoading = true
self.errorMessage = nil
self.dataPoints = []
apiClient.fetchMetrics(query: query, start: start, end: end)
// Move processing to a background thread.
.subscribe(on: DispatchQueue.global(qos: .userInitiated))
// Receive results on the main thread to update the UI.
.receive(on: DispatchQueue.main)
.sink(receiveCompletion: { [weak self] completion in
self?.isLoading = false
switch completion {
case .failure(let error):
// A proper implementation would have user-friendly error messages.
self?.errorMessage = "Failed to fetch data: \(error.localizedDescription)"
print("Error fetching metrics: \(error)")
case .finished:
break
}
}, receiveValue: { [weak self] fetchedPoints in
// Map the API response model to our UI model.
self?.dataPoints = fetchedPoints.map {
DataPoint(timestamp: $0.timestamp, value: $0.value)
}
})
// Store the subscription to keep it alive.
.store(in: &cancellables)
}
}
// A simplified API Client to interact with the Query Gateway
struct APIClient {
// This would typically use URLSession and Codable to decode JSON.
func fetchMetrics(query: String, start: Date, end: Date) -> AnyPublisher<[APIResponseDataPoint], Error> {
// ... implementation of network request to Query Gateway ...
// For this example, we return a mock publisher.
let mockData = [
APIResponseDataPoint(timestamp: Date().addingTimeInterval(-100), value: 50.0),
APIResponseDataPoint(timestamp: Date().addingTimeInterval(-50), value: 75.0),
APIResponseDataPoint(timestamp: Date(), value: 60.0)
]
return Just(mockData)
.setFailureType(to: Error.self)
.delay(for: 1.0, scheduler: DispatchQueue.main) // Simulate network latency
.eraseToAnyPublisher()
}
}
struct APIResponseDataPoint: Decodable {
let timestamp: Date
let value: Double
}
The SwiftUI view itself becomes declarative and simple, focusing on layout and presentation while the ViewModel handles the state and logic. This separation of concerns is critical for building a maintainable application.
The Web Component: Turbopack for Rapid Iteration
While the SwiftUI app served our power users, we needed a more accessible view for other engineers. We chose to build a React component and used Turbopack as the bundler. The primary driver for this choice was developer velocity. In a large monorepo, the slow feedback loop of traditional bundlers can be a significant drag on productivity.
The component’s structure is standard for a React application. We use a data-fetching library like React Query to handle state management, caching, and background refetching.
// components/MetricsChart.tsx
import React from 'react';
import { useQuery } from '@tanstack/react-query';
// Assuming a charting library like Recharts or D3 is used
// import { LineChart, Line, XAxis, YAxis, Tooltip } from 'recharts';
interface MetricsChartProps {
query: string;
timeRange: { start: string; end:string; };
}
const fetchMetricsData = async (query: string, timeRange: { start: string; end: string; }) => {
// A real implementation would validate inputs and handle API specifics.
const apiUrl = `/api/query?query=${encodeURIComponent(query)}&start=${timeRange.start}&end=${timeRange.end}`;
const response = await fetch(apiUrl);
if (!response.ok) {
// In a production app, error logging/reporting would happen here.
throw new Error(`Network response was not ok: ${response.statusText}`);
}
return response.json();
};
export const MetricsChart: React.FC<MetricsChartProps> = ({ query, timeRange }) => {
const { data, error, isLoading, isError } = useQuery({
// The query key ensures that React Query refetches when props change.
queryKey: ['metrics', query, timeRange],
queryFn: () => fetchMetricsData(query, timeRange),
// Configuration like stale time and cache time is important for performance.
staleTime: 5 * 60 * 1000, // 5 minutes
refetchOnWindowFocus: true,
});
if (isLoading) {
return <div>Loading chart data...</div>;
}
if (isError) {
// A more sophisticated error boundary component would be used in a real app.
return <div>Error loading data: {(error as Error).message}</div>;
}
// Render the chart with the fetched data.
// The charting library implementation is omitted for brevity.
return (
<div className="chart-container">
{/* <LineChart data={data}> ... </LineChart> */}
<pre>{JSON.stringify(data, null, 2)}</pre>
</div>
);
};
The experience with Turbopack was notable primarily for its speed. Hot Module Replacement (HMR) felt instantaneous, even as the component grew in complexity. While it was still in beta and lacked some of the plugin ecosystem of Webpack, for a focused component library like ours, the performance gain was a clear win.
The architecture we developed is not without its limitations. The PromQL-to-SQL translation layer remains a significant technical challenge and a potential source of inconsistency. Relying on Zookeeper, while pragmatic for us, introduces a stateful dependency that requires careful operational management. Furthermore, the system currently only addresses metrics, leaving logs and traces to be handled by separate stacks. The logical next step is to integrate a logging solution like Loki or a tracing system like Jaeger, routing their data into ClickHouse as well to create a truly unified observability platform. This would require extending the Query Gateway to support LogQL and trace queries, further increasing its complexity but also its value.