Implementing a High-Throughput Observability Ingestion Service with Micronaut and ClickHouse under Podman


The project mandate was clear: build a zero-instrumentation observability pipeline. We were drowning in bespoke application metrics and logs, each with its own format, and blind to kernel-level network interactions. Traditional APM agents were deemed too invasive, carrying performance overhead and requiring constant updates across a polyglot microservices landscape. The technical pain point wasn’t a lack of data, but a lack of unified, low-overhead, and universally applicable visibility.

Our initial concept gravitated towards eBPF. By tapping directly into kernel syscalls, we could capture a pristine stream of network events—TCP connections, HTTP requests, DNS lookups—for every process on a host, without modifying a single line of application code. This raw data stream, however, presented the real engineering challenge: building a performant, scalable, and resource-efficient pipeline to ingest, store, and query terabytes of event data in near real-time.

Technology Selection Rationale

Choosing the right tools for the ingestion and analytics backend was critical. Every component had to be lean and purposeful.

  • Ingestion Service: Micronaut. The service responsible for receiving batched eBPF events needed to be incredibly lightweight. Its sole purpose: validate, transform, and write data to the database. A standard Spring Boot application, with its reflective runtime and higher memory footprint, felt like overkill. Micronaut, with its ahead-of-time (AOT) compilation model, promised minimal memory usage and near-instant startup times. This was a perfect fit for a containerized utility service, especially when considering a native image build with GraalVM.

  • Analytics Database: ClickHouse. The data volume from eBPF is immense. We anticipated tens of millions of events per minute from our fleet. A traditional row-based RDBMS like PostgreSQL would crumble under this write load and the analytical query patterns (e.g., “show me the P99 latency for all HTTP requests to service X, grouped by source service, in the last 15 minutes”). ClickHouse, a columnar store built from the ground up for OLAP, was the obvious choice. Its ability to ingest data at massive scale while providing millisecond-latency aggregations was precisely what we needed.

  • Containerization: Podman. Our infrastructure is primarily RHEL-based. Podman offered a daemonless, rootless container workflow that was a significant security improvement over the Docker daemon model. For an observability platform that handles potentially sensitive operational data, running services as a non-root user within a user namespace is a non-negotiable security posture. The goal was to prove the entire stack could be managed declaratively using podman-compose.

  • Visualization UI: UnoCSS. The front-end was a simple dashboard to visualize the aggregated data. We explicitly wanted to avoid heavy frameworks like React or Angular. The UI is a thin layer served directly by the Micronaut application. UnoCSS provided an on-demand, atomic CSS engine, allowing for a custom, highly performant UI with minimal CSS bloat and no complex front-end build chain. It keeps the entire solution self-contained and lean.

The proposed architecture is straightforward: an (unspecified) eBPF agent pushes JSON event batches to the Micronaut ingestion service. The service asynchronously writes these events into ClickHouse. A separate API endpoint on the same Micronaut service queries ClickHouse to provide aggregated data for a simple, static HTML/JS dashboard styled with UnoCSS. The whole system runs in containers managed by Podman.

graph TD
    subgraph Host Machine
        eBPF_Agent[eBPF Agent on Kernel]
    end

    subgraph Podman Pod
        IngestionService[Micronaut Ingestion Service]
        ClickHouseDB[(ClickHouse)]
    end

    subgraph Browser
        Dashboard[UI with UnoCSS]
    end

    eBPF_Agent -- Raw Events (JSON Batch) --> IngestionService
    IngestionService -- JDBC Batch Insert --> ClickHouseDB
    Dashboard -- REST API Call --> IngestionService
    IngestionService -- Aggregation Query --> ClickHouseDB
    IngestionService -- Aggregated JSON --> Dashboard

Environment Setup with Podman Compose

The first step is defining the entire stack declaratively. In a real-world project, relying on manual podman run commands is brittle and not reproducible. We use podman-compose for local development and testing.

Here is the podman-compose.yml file. A key detail is the network configuration, ensuring the Micronaut service can resolve the clickhouse-server hostname.

# podman-compose.yml
version: '3.8'

services:
  clickhouse-server:
    image: clickhouse/clickhouse-server:23.8
    container_name: clickhouse-server
    ports:
      - "8123:8123" # HTTP interface
      - "9000:9000" # Native client interface
    ulimits:
      nproc: 65535
      nofile:
        soft: 262144
        hard: 262144
    volumes:
      - clickhouse_data:/var/lib/clickhouse
      - ./config/clickhouse/users.xml:/etc/clickhouse-server/users.xml:ro
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8123/ping"]
      interval: 30s
      timeout: 10s
      retries: 3

  ingestion-service:
    # We will build this image locally using the Containerfile
    image: localhost/ingestion-service:0.1
    container_name: ingestion-service
    build:
      context: .
      dockerfile: Containerfile
    ports:
      - "8080:8080"
    depends_on:
      clickhouse-server:
        condition: service_healthy
    environment:
      # Micronaut configuration via environment variables
      CLICKHOUSE_URL: "jdbc:ch://clickhouse-server:8123/default"
      CLICKHOUSE_USER: "default"
      CLICKHOUSE_PASSWORD: "" # No password for local dev
      MICRONAUT_SERVER_PORT: 8080

volumes:
  clickhouse_data:

The custom users.xml for ClickHouse is a good practice even for local development, allowing us to manage users and profiles without altering the default configuration.

ClickHouse Schema Design: The Foundation of Performance

The performance of the entire system hinges on the ClickHouse table schema. A poorly designed schema will lead to slow writes and even slower queries, regardless of hardware. Our raw event data is modeled as network flows.

The critical decisions are the partitioning key (PARTITION BY) and the sorting/primary key (ORDER BY).

  • Partitioning: We partition by day (toYYYYMM(timestamp)). This is a coarse-grained mechanism to group data into manageable chunks on disk. It helps with maintenance tasks like dropping old data.
  • Sorting Key: This is the most important optimization. ClickHouse stores data physically sorted by this key. Queries that filter or aggregate on the prefix of this key are extremely fast. We chose (source_service, dest_service, event_type, timestamp). This structure optimizes for our most common query pattern: “analyze traffic between two specific services for a given event type within a time range.”

Here is the DDL for our network_events table:

-- Executed via clickhouse-client or any DB tool
CREATE TABLE default.network_events
(
    `timestamp` DateTime64(3, 'UTC'),
    `trace_id` String,
    `source_service` LowCardinality(String),
    `source_ip` String,
    `dest_service` LowCardinality(String),
    `dest_ip` String,
    `dest_port` UInt16,
    `event_type` Enum8('HTTP_REQUEST' = 1, 'TCP_CONNECT' = 2, 'DNS_LOOKUP' = 3),
    `latency_micros` UInt64,
    `http_status` Nullable(UInt16),
    `http_method` Nullable(LowCardinality(String))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (source_service, dest_service, event_type, timestamp)
TTL timestamp + INTERVAL 30 DAY;

A common mistake is to ignore the LowCardinality data type. For columns with a limited number of unique string values (like service names or HTTP methods), this dramatically reduces storage footprint and speeds up queries by using dictionary encoding internally. The TTL clause is a production-grade necessity for managing data retention automatically.

The Micronaut Ingestion Service

The core of our pipeline is the Micronaut application. It’s built with Java and Gradle.

1. Project Structure and Dependencies

The build.gradle.kts file includes the necessary dependencies. We use micronaut-jdbc-hikari for connection pooling and the official clickhouse-jdbc driver.

// build.gradle.kts (partial)
plugins {
    id("com.github.johnrengelman.shadow") version "8.1.1"
    id("io.micronaut.application") version "4.1.0"
    id("io.micronaut.aot") version "4.1.0"
}

dependencies {
    implementation("io.micronaut:micronaut-http-client")
    implementation("io.micronaut.serde:micronaut-serde-jackson")
    implementation("io.micronaut.sql:micronaut-jdbc-hikari")
    runtimeOnly("com.clickhouse:clickhouse-jdbc:0.5.0:all")
    runtimeOnly("ch.qos.logback:logback-classic")
}

2. The Ingestion Controller

This controller exposes a single endpoint, /ingest, which accepts a batch of events. The critical design choice here is to avoid blocking the HTTP worker thread with database operations. We immediately hand off the processing to a separate thread pool dedicated to I/O tasks. A common pitfall in high-throughput services is saturating the event loop with blocking calls.

// src/main/java/com/observability/ingest/IngestionController.java
package com.observability.ingest;

import io.micronaut.http.HttpResponse;
import io.micronaut.http.annotation.Body;
import io.micronaut.http.annotation.Controller;
import io.micronaut.http.annotation.Post;
import io.micronaut.scheduling.TaskExecutors;
import io.micronaut.scheduling.annotation.ExecuteOn;
import jakarta.inject.Inject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;

@Controller("/ingest")
public class IngestionController {

    private static final Logger LOG = LoggerFactory.getLogger(IngestionController.class);

    private final EventRepository eventRepository;

    @Inject
    public IngestionController(EventRepository eventRepository) {
        this.eventRepository = eventRepository;
    }

    @Post
    @ExecuteOn(TaskExecutors.IO) // VERY IMPORTANT: Offload blocking work from the event loop
    public HttpResponse<Void> ingestEvents(@Body List<NetworkEvent> events) {
        if (events == null || events.isEmpty()) {
            return HttpResponse.badRequest();
        }

        try {
            // The repository handles the actual batch insertion logic
            eventRepository.saveBatch(events);
            if (LOG.isDebugEnabled()) {
                LOG.debug("Successfully ingested batch of {} events", events.size());
            }
            return HttpResponse.accepted();
        } catch (Exception e) {
            LOG.error("Failed to ingest event batch", e);
            // In a production system, you might push to a dead-letter queue here
            return HttpResponse.serverError();
        }
    }
}

// Simple DTO for deserialization
// import com.fasterxml.jackson.annotation.JsonProperty;
// record NetworkEvent(...) {}

3. The ClickHouse Repository: Batching is Key

The EventRepository is where the database interaction happens. The single most important performance factor for writing to ClickHouse is batching. Inserting rows one by one will destroy performance. The JDBC driver provides an efficient way to do this.

// src/main/java/com/observability/ingest/EventRepository.java
package com.observability.ingest;

import jakarta.inject.Singleton;
import javax.sql.DataSource;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.sql.Timestamp;
import java.util.List;
import java.util.Objects;

@Singleton
public class EventRepository {

    private final DataSource dataSource;

    public EventRepository(DataSource dataSource) {
        this.dataSource = dataSource;
    }

    public void saveBatch(List<NetworkEvent> events) {
        // A common mistake is not using a try-with-resources statement here, leading to connection leaks.
        try (Connection connection = dataSource.getConnection()) {
            // Using the ClickHouse specific batching is more efficient
            // but for portability, standard JDBC batching is shown here.
            connection.setAutoCommit(false);

            String sql = "INSERT INTO network_events (timestamp, trace_id, source_service, source_ip, dest_service, dest_ip, dest_port, event_type, latency_micros, http_status, http_method) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)";

            try (PreparedStatement ps = connection.prepareStatement(sql)) {
                for (NetworkEvent event : events) {
                    ps.setTimestamp(1, Timestamp.from(event.timestamp()));
                    ps.setString(2, event.traceId());
                    ps.setString(3, event.sourceService());
                    ps.setString(4, event.sourceIp());
                    ps.setString(5, event.destService());
                    ps.setString(6, event.destIp());
                    ps.setInt(7, event.destPort());
                    ps.setString(8, event.eventType());
                    ps.setLong(9, event.latencyMicros());

                    // Handle nullable fields carefully to avoid NullPointerExceptions
                    if (event.httpStatus() != null) {
                        ps.setInt(10, event.httpStatus());
                    } else {
                        ps.setNull(10, java.sql.Types.INTEGER);
                    }
                    ps.setString(11, event.httpMethod());

                    ps.addBatch();
                }
                ps.executeBatch();
                connection.commit();
            } catch (SQLException e) {
                connection.rollback();
                throw new RuntimeException("Error executing batch insert", e);
            }
        } catch (SQLException e) {
            throw new RuntimeException("Could not get database connection", e);
        }
    }
}

Error handling is crucial. If any part of the batch fails, we roll back the entire transaction. In a more resilient architecture, this would also involve a retry mechanism with exponential backoff or forwarding the failed batch to a persistent queue.

4. The API Endpoint for Visualization

To feed the dashboard, we need an endpoint that runs analytical queries against ClickHouse.

// src/main/java/com/observability/api/MetricsController.java
package com.observability.api;

import io.micronaut.http.annotation.Controller;
import io.micronaut.http.annotation.Get;
import io.micronaut.http.annotation.QueryValue;
import jakarta.inject.Inject;
import javax.sql.DataSource;
import java.sql.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

@Controller("/api/metrics")
public class MetricsController {

    private final DataSource dataSource;

    @Inject
    public MetricsController(DataSource dataSource) {
        this.dataSource = dataSource;
    }

    @Get("/service-latency")
    public List<Map<String, Object>> getServiceLatency(@QueryValue String destService) {
        // This query leverages the sorting key for performance
        String query = """
            SELECT
                source_service,
                count() AS request_count,
                avg(latency_micros) AS avg_latency_micros,
                quantile(0.95)(latency_micros) AS p95_latency,
                quantile(0.99)(latency_micros) AS p99_latency
            FROM network_events
            WHERE dest_service = ?
              AND event_type = 'HTTP_REQUEST'
              AND timestamp >= now() - INTERVAL 1 HOUR
            GROUP BY source_service
            ORDER BY request_count DESC
            LIMIT 10
        """;
        List<Map<String, Object>> results = new ArrayList<>();
        try (Connection conn = dataSource.getConnection();
             PreparedStatement ps = conn.prepareStatement(query)) {
            ps.setString(1, destService);
            ResultSet rs = ps.executeQuery();
            ResultSetMetaData metaData = rs.getMetaData();
            int columnCount = metaData.getColumnCount();

            while (rs.next()) {
                Map<String, Object> row = new HashMap<>();
                for (int i = 1; i <= columnCount; i++) {
                    row.put(metaData.getColumnName(i), rs.getObject(i));
                }
                results.add(row);
            }
        } catch (SQLException e) {
            throw new RuntimeException("Failed to query metrics", e);
        }
        return results;
    }
}

This query is a prime example of why we chose ClickHouse. Calculating quantiles (P95, P99) over millions of rows would be prohibitively slow in most other databases, but ClickHouse handles it with ease.

Building a Lean Container with AOT

Micronaut’s real power shines when building a native executable with GraalVM. This produces a self-contained binary with no JVM dependency, resulting in a tiny container image and drastically reduced startup time and memory consumption.

The Containerfile is deceptively simple, but it leverages a multi-stage build.

# Containerfile
# Stage 1: Build the native executable using the GraalVM builder image
FROM ghcr.io/graalvm/native-image-community:17-ol8 as graal
WORKDIR /home/app

COPY . .
# The Micronaut Gradle plugin simplifies this. It creates a nativeCompile task.
RUN ./gradlew nativeCompile

# Stage 2: Create the final, minimal image
# We use a distroless image for a minimal attack surface.
FROM gcr.io/distroless/cc-static
WORKDIR /app
COPY --from=graal /home/app/build/native/nativeCompile/ingestion-service .

# Metadata
EXPOSE 8080
ENTRYPOINT ["/app/ingestion-service"]

After building with podman build -t localhost/ingestion-service:0.1 ., the resulting image is often under 100MB, compared to a 300-400MB image for a traditional fat-JAR JVM application. The service starts in milliseconds instead of seconds. For a high-density microservices environment, this resource efficiency is a massive operational win.

The UnoCSS Dashboard

The final piece is the visualization layer. We don’t need a complex Single Page Application. Micronaut can serve static assets directly from the src/main/resources/public directory.

The index.html is minimal. It includes the UnoCSS reset and the main script via a CDN for simplicity (in a production build, this would be self-hosted).

<!-- src/main/resources/public/index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Service Dashboard</title>
    <script src="https://cdn.jsdelivr.net/npm/@unocss/runtime"></script>
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@unocss/reset/tailwind.css">
</head>
<body class="bg-gray-900 text-gray-200 font-sans">
    <div class="container mx-auto p-4">
        <h1 class="text-3xl font-bold mb-4">Service Call Latency</h1>
        <div class="bg-gray-800 p-4 rounded-lg shadow-lg">
            <!-- Table will be populated by JavaScript -->
            <table class="w-full text-left">
                <thead class="border-b border-gray-600">
                    <tr>
                        <th class="p-2">Source Service</th>
                        <th class="p-2">Request Count</th>
                        <th class="p-2">Avg Latency (µs)</th>
                        <th class="p-2">P95 Latency (µs)</th>
                    </tr>
                </thead>
                <tbody id="metrics-table">
                    <!-- Data Rows Here -->
                </tbody>
            </table>
        </div>
    </div>
    <script>
        async function fetchMetrics() {
            // A simple fetch call to our backend API
            const response = await fetch('/api/metrics/service-latency?destService=api-gateway');
            const data = await response.json();
            const tableBody = document.getElementById('metrics-table');
            tableBody.innerHTML = ''; // Clear old data
            data.forEach(row => {
                tableBody.innerHTML += `
                    <tr class="border-b border-gray-700">
                        <td class="p-2 font-mono">${row.source_service}</td>
                        <td class="p-2">${row.request_count}</td>
                        <td class="p-2">${Math.round(row.avg_latency_micros)}</td>
                        <td class="p-2">${row.p95_latency}</td>
                    </tr>
                `;
            });
        }
        setInterval(fetchMetrics, 5000);
        fetchMetrics();
    </script>
</body>
</html>

The beauty of UnoCSS is visible in the class attributes like bg-gray-900, text-gray-200, container, mx-auto, p-4. There is no separate CSS file to manage. The runtime script generates the necessary styles on the fly based on the classes it finds in the HTML. This is perfect for a small, self-contained dashboard.

Lingering Issues and Future Iterations

This implementation serves as a robust proof-of-concept, but a production-grade system would require several enhancements. The current direct-to-database ingestion path is a single point of failure and lacks backpressure handling. The logical next step is to introduce a message broker like Apache Kafka or Pulsar between the eBPF agent and the Micronaut service. This would decouple the components, provide durable storage for events, and allow the ingestion service to consume data at its own pace.

Furthermore, the ClickHouse deployment is a single node. For high availability and scalability, a clustered setup with ZooKeeper for coordination and data replication is necessary. The schema itself could be enhanced with Materialized Views to pre-aggregate data for the most common dashboard queries, reducing query latency even further. Finally, the security of the API endpoints is non-existent; adding authentication and authorization would be a prerequisite for any real-world deployment.


  TOC