Building a Zero-Instrumentation Kubernetes Network Visualizer with eBPF and Astro

Cloud Native

Word Count: 2.9k

Read Times: 18 Min

The mandate was clear: gain real-time visibility into pod-to-pod network connections across our Kubernetes clusters. The constraint was the problem: we could not afford the performance overhead of service mesh sidecars, nor did we have the organizational bandwidth to enforce application-level instrumentation across hundreds of microservices. We needed a solution that was transparent to applications, lightweight, and provided kernel-level ground truth. This is the log of how we built it.

Technical Pain Point and Initial Concept

Our primary operational blindness was with transient network connections. A pod might spin up, make a brief connection to a database or another service, and then terminate. Standard metrics often missed these ephemeral events. Debugging network policy issues was a nightmare of tcpdump sessions and log correlation. We needed to see every connect() call, in real-time, attributed to the correct source and destination pods.

The initial concept was a two-part system:

A Kernel-Level Collector: A daemon running on every Kubernetes node that could intercept network system calls without modifying the application or its container.
A High-Performance UI: A web-based dashboard that could handle a high volume of real-time events from all nodes without collapsing under rendering pressure.

This immediately pushed us toward a specific, albeit unconventional, combination of technologies.

Technology Selection Rationale

For the collector, eBPF was the only viable choice. It allows us to run sandboxed programs in the Linux kernel, attaching them to hooks like system calls (syscalls) and tracepoints. This provides the raw data we need without any userspace overhead or application changes. We chose Go to write the userspace agent that would load and manage the eBPF program, primarily for its strong concurrency primitives, robust Kubernetes client libraries (client-go), and the existence of mature eBPF libraries like libbpfgo.

For the front end, the requirements were performance and developer efficiency. The dashboard would be displaying a potentially high-velocity stream of data. A traditional Single Page Application (SPA) felt like overkill and could introduce significant client-side overhead. We selected Astro. Its island architecture allows for creating mostly static HTML sites with isolated pockets of client-side JavaScript. This was perfect; the dashboard shell could be static and load instantly, while the real-time data table would be a single, self-contained interactive “island.”

To build the UI for this island, writing components from scratch would be a waste of time. We needed a comprehensive, production-ready component library. Ant Design was selected for its rich set of components, particularly its powerful and configurable Table, which we knew would be the core of our visualization. The main technical risk was integrating a React-based library like Ant Design cleanly into Astro’s architecture.

The entire system would, of course, run on Kubernetes. The collector agent would be deployed as a DaemonSet to ensure it ran on every node, and the Astro front end would be a standard Deployment and Service.

Here is the high-level architecture we landed on:

graph TD
    subgraph Kubernetes Cluster
        subgraph Node 1
            Kubelet1[Kubelet]
            PodA[App Pod A]
            PodB[App Pod B]
            Agent1[eBPF Agent Pod] -- Manages --> eBPF1[eBPF Program @ Kernel]
        end
        subgraph Node 2
            Kubelet2[Kubelet]
            PodC[App Pod C]
            PodD[App Pod D]
            Agent2[eBPF Agent Pod] -- Manages --> eBPF2[eBPF Program @ Kernel]
        end
    end

    eBPF1 -- Captures syscalls --> Agent1
    eBPF2 -- Captures syscalls --> Agent2

    Agent1 -- Enriches with Pod Metadata --> WSS1[WebSocket Stream]
    Agent2 -- Enriches with Pod Metadata --> WSS2[WebSocket Stream]
    
    WSS1 --> K8sIngress[K8s Ingress/LoadBalancer]
    WSS2 --> K8sIngress

    subgraph User Browser
        AstroUI[Astro + Ant Design UI]
    end

    K8sIngress -- Serves WebSocket --> AstroUI

    PodA -- TCP Connect --> PodD
    
    style Agent1 fill:#282c34,stroke:#61dafb,stroke-width:2px,color:#fff
    style Agent2 fill:#282c34,stroke:#61dafb,stroke-width:2px,color:#fff

Step 1: The eBPF Kernel Collector

The core of the system is the eBPF program. We decided to write this in C, as it’s the most direct way to interact with kernel helpers and data structures. The goal is to hook into the tcp_connect and tcp_v4_connect kernel functions to capture outbound connections.

Here is the C code for our eBPF program (bpf_bpfel_x86_64.c). In a real-world project, this would be generated from a higher-level tool or written carefully, but for this log, the direct C is clearer.

// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>

char LICENSE[] SEC("license") = "Dual BSD/GPL";

// Define the data structure that will be sent from kernel to userspace.
// This is the contract between our eBPF program and the Go agent.
struct event {
	__u32 pid;
	__u32 ppid;
	__u32 uid;
	__u32 saddr;
	__u32 daddr;
	__u16 dport;
	char comm[16]; // TASK_COMM_LEN
};

// The ring buffer is the modern and efficient way to send data to userspace.
// BPF_MAP_TYPE_RINGBUF is preferred over BPF_MAP_TYPE_PERF_EVENT_ARRAY.
struct {
	__uint(type, BPF_MAP_TYPE_RINGBUF);
	__uint(max_entries, 256 * 1024); // 256 KB
} rb SEC(".maps");

// Helper function to submit an event
static __always_inline int submit_event(void *ctx, __u32 saddr, __u32 daddr, __u16 dport) {
    struct event *e;
	e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
	if (!e) {
		return 0; // Failed to reserve space
	}

    // Get process metadata
    __u64 id = bpf_get_current_pid_tgid();
    e->pid = id >> 32;
    e->uid = bpf_get_current_uid_gid();
    bpf_get_current_comm(&e->comm, sizeof(e->comm));

    // Fill network data
    e->saddr = saddr;
    e->daddr = daddr;
    e->dport = dport;

    bpf_ringbuf_submit(e, 0);
    return 0;
}

// Attach to the entry of the tcp_v4_connect kernel function.
// This is where the kernel prepares to make a TCP connection for IPv4.
SEC("kprobe/tcp_v4_connect")
int BPF_KPROBE(kprobe__tcp_v4_connect, struct sock *sk)
{
	__u16 dport;
	__u32 saddr, daddr;

    // Use BPF helpers to read kernel struct members safely.
	saddr = BPF_CORE_READ(sk, __sk_common.skc_rcv_saddr);
	daddr = BPF_CORE_READ(sk, __sk_common.skc_daddr);
	dport = BPF_CORE_READ(sk, __sk_common.skc_dport);

    // The destination port is network byte order, so we need to flip it.
	return submit_event(ctx, saddr, daddr, bpf_ntohs(dport));
}

This C code is then compiled into an eBPF object file using clang. We then need a userspace program to load it.

Step 2: The Go Userspace Agent

The Go agent has several responsibilities:

Load the compiled eBPF object file into the kernel.
Attach the eBPF programs to the specified kprobes.
Listen for events from the eBPF ring buffer.
For each event, enrich it with Kubernetes metadata (like pod name, namespace, etc.). This is the critical step that makes the data useful.
Serve this enriched data over a WebSocket connection.

Here is a significant portion of the Go agent (main.go). It uses cilium/ebpf for loading and interacting with the eBPF program and gorilla/websocket for the server.

package main

import (
	"bytes"
	"encoding/binary"
	"errors"
	"log"
	"net"
	"net/http"
	"os"
	"os/signal"
	"sync"
	"syscall"

	"github.com/cilium/ebpf/link"
	"github.com/cilium/ebpf/ringbuf"
	"github.com/gorilla/websocket"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
)

//go:generate go run github.com/cilium/ebpf/cmd/bpf2go bpf bpf_bpfel_x86_64.c -- -I./headers

// This is the Go representation of the C struct in our eBPF code.
// It MUST match perfectly.
type Event struct {
	PID    uint32
	PPID   uint32
	UID    uint32
	Saddr  uint32
	Daddr  uint32
	Dport  uint16
	Comm   [16]byte
}

// This is the enriched event we'll send over WebSocket.
type EnrichedEvent struct {
    Timestamp   string `json:"timestamp"`
	NodeName    string `json:"nodeName"`
	SourcePod   string `json:"sourcePod"`
	SourceNS    string `json:"sourceNamespace"`
	DestAddress string `json:"destAddress"`
	DestPort    uint16 `json:"destPort"`
	ProcessName string `json:"processName"`
}


var upgrader = websocket.Upgrader{
	CheckOrigin: func(r *http.Request) bool {
		return true // In production, lock this down.
	},
}

// A simple connection manager for our WebSocket clients.
var (
	clients   = make(map[*websocket.Conn]bool)
	clientsMu sync.Mutex
)

func main() {
	// Setup signal handling for graceful shutdown.
	stopper := make(chan os.Signal, 1)
	signal.Notify(stopper, os.Interrupt, syscall.SIGTERM)

	// Load the compiled eBPF objects.
	objs := bpfObjects{}
	if err := loadBpfObjects(&objs, nil); err != nil {
		log.Fatalf("loading objects: %v", err)
	}
	defer objs.Close()

	// Attach the kprobe.
	kp, err := link.Kprobe("tcp_v4_connect", objs.KprobeTcpV4Connect, nil)
	if err != nil {
		log.Fatalf("attaching kprobe: %s", err)
	}
	defer kp.Close()

	// Open the ring buffer from the eBPF map.
	rd, err := ringbuf.NewReader(objs.Rb)
	if err != nil {
		log.Fatalf("opening ringbuf reader: %s", err)
	}
	defer rd.Close()

	// Setup Kubernetes client
	// In-cluster config will be used when running inside a pod.
	config, err := rest.InClusterConfig()
	if err != nil {
		log.Fatalf("failed to get in-cluster config: %v", err)
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		log.Fatalf("failed to create k8s clientset: %v", err)
	}
    
    // Create a pod cache for faster lookups
    podCache := NewPodCache(clientset)
    go podCache.Run()


	// Goroutine to read from the ring buffer and broadcast to clients.
	go func() {
		nodeName := os.Getenv("NODE_NAME")
		if nodeName == "" {
			nodeName = "unknown"
		}

		for {
			record, err := rd.Read()
			if err != nil {
				if errors.Is(err, ringbuf.ErrClosed) {
					log.Println("Received signal, exiting..")
					return
				}
				log.Printf("reading from reader: %s", err)
				continue
			}

			var event Event
			if err := binary.Read(bytes.NewBuffer(record.RawSample), binary.LittleEndian, &event); err != nil {
				log.Printf("parsing ringbuf event: %s", err)
				continue
			}
            
            // Enrich the data
            // This is a simplified enrichment. A production system would be more robust.
            sourcePod, sourceNS := podCache.GetPodForPID(int(event.PID))
			
			enriched := EnrichedEvent{
				Timestamp:   time.Now().Format(time.RFC3339),
				NodeName:    nodeName,
				SourcePod:   sourcePod,
				SourceNS:    sourceNS,
				DestAddress: intToIP(event.Daddr).String(),
				DestPort:    event.Dport,
				ProcessName: string(bytes.TrimRight(event.Comm[:], "\x00")),
			}

			// Broadcast to all connected WebSocket clients.
			clientsMu.Lock()
			for client := range clients {
				if err := client.WriteJSON(enriched); err != nil {
					log.Printf("websocket write error: %v", err)
					client.Close()
					delete(clients, client)
				}
			}
			clientsMu.Unlock()
		}
	}()

	// Setup and start the WebSocket server.
	http.HandleFunc("/ws", handleConnections)
	go func() {
		log.Println("http server started on :8080")
		if err := http.ListenAndServe(":8080", nil); err != nil {
			log.Fatalf("failed to start http server: %v", err)
		}
	}()

	// Wait for shutdown signal.
	<-stopper
	log.Println("Shutting down.")
	rd.Close()
}

func handleConnections(w http.ResponseWriter, r *http.Request) {
	ws, err := upgrader.Upgrade(w, r, nil)
	if err != nil {
		log.Fatal(err)
	}
	defer ws.Close()

	clientsMu.Lock()
	clients[ws] = true
	clientsMu.Unlock()

	log.Println("client connected")

	// The read loop is necessary to detect when a client disconnects.
	for {
		_, _, err := ws.ReadMessage()
		if err != nil {
			clientsMu.Lock()
			delete(clients, ws)
			clientsMu.Unlock()
			log.Printf("client disconnected: %v", err)
			break
		}
	}
}

// Simple utility to convert uint32 IP to net.IP
func intToIP(ipInt uint32) net.IP {
	ip := make(net.IP, 4)
	binary.LittleEndian.PutUint32(ip, ipInt)
	return ip
}

// NOTE: The PodCache implementation is omitted for brevity but would typically
// use a Kubernetes Informer to watch for pod changes and maintain a local
// cache mapping PIDs to pod metadata. This avoids hitting the API server
// for every single event, which is critical for performance.

Step 3: Kubernetes Deployment

To deploy this agent, we use a DaemonSet. This ensures one instance of our agent pod runs on each node in the cluster. Crucially, we need to give the pod hostPID: true and privileged security context to allow it to load eBPF programs and access the host’s process IDs. In a real-world scenario, you’d use Linux capabilities (CAP_BPF, CAP_PERFMON, etc.) for a more fine-grained security posture instead of full privileged: true.

Here’s the daemonset.yaml:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: net-observer-agent
  namespace: observability
spec:
  selector:
    matchLabels:
      name: net-observer-agent
  template:
    metadata:
      labels:
        name: net-observer-agent
    spec:
      tolerations:
      - operator: Exists
      hostNetwork: true
      hostPID: true
      containers:
      - name: agent
        image: your-repo/net-observer-agent:latest # Replace with your image
        securityContext:
          privileged: true # Required for BPF. Use capabilities in production.
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: bpf-fs
          mountPath: /sys/fs/bpf
          readOnly: false
      volumes:
      - name: bpf-fs
        hostPath:
          path: /sys/fs/bpf
---
apiVersion: v1
kind: Service
metadata:
  name: net-observer-service
  namespace: observability
spec:
  selector:
    name: net-observer-agent
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

This configuration deploys our agent. Now we need a front end to consume the WebSocket stream.

Step 4: The Astro and Ant Design Front End

The front-end setup was where we anticipated friction, but Astro’s component model handled it surprisingly well.

First, we create a new Astro project and add the React and Ant Design dependencies:
npx astro add react
npm install antd

The main page (src/pages/index.astro) will be simple. It provides the overall page layout and, critically, imports our React-based dashboard component as a client-side island.

---
import Layout from '../layouts/Layout.astro';
import Dashboard from '../components/Dashboard';
---

<Layout title="K8s Network Observer">
	<main>
		<h1>Real-Time Kubernetes Network Connections</h1>
		<p>Live data captured from node agents via eBPF.</p>
		
		<!-- This is the key part. 'client:load' tells Astro to hydrate this
		     React component as soon as the page loads. It will be the only
			 interactive part of this page. -->
		<Dashboard client:load />

	</main>
</Layout>

<style>
	main {
		margin: auto;
		padding: 1.5rem;
		max-width: 120ch;
	}
</style>

The real work happens inside src/components/Dashboard.tsx. This is a standard React component that will manage the WebSocket connection and render the data using Ant Design’s Table.

import React, { useState, useEffect, useRef } from 'react';
import { Table, Tag, Tooltip } from 'antd';
import type { ColumnsType } from 'antd/es/table';
import 'antd/dist/reset.css'; // Ant Design styles

// This type must match the JSON structure from our Go agent.
interface ConnectionEvent {
    key: string;
    timestamp: string;
    nodeName: string;
    sourcePod: string;
    sourceNamespace: string;
    destAddress: string;
    destPort: number;
    processName: string;
}

const columns: ColumnsType<ConnectionEvent> = [
    { title: 'Timestamp', dataIndex: 'timestamp', key: 'timestamp', width: 200 },
    { title: 'Node', dataIndex: 'nodeName', key: 'nodeName', width: 150 },
    { title: 'Source Namespace', dataIndex: 'sourceNamespace', key: 'sourceNamespace', width: 150 },
    { title: 'Source Pod', dataIndex: 'sourcePod', key: 'sourcePod', width: 250 },
    { title: 'Process', dataIndex: 'processName', key: 'processName', width: 120 },
    { 
        title: 'Destination', 
        dataIndex: 'destAddress', 
        key: 'destAddress',
        render: (text, record) => `${record.destAddress}:${record.destPort}`,
    },
];

const MAX_EVENTS = 500; // Limit the number of events in the table to prevent browser slowdown.

const Dashboard: React.FC = () => {
    const [events, setEvents] = useState<ConnectionEvent[]>([]);
    const [isConnected, setIsConnected] = useState(false);
    const ws = useRef<WebSocket | null>(null);

    useEffect(() => {
        // The WebSocket URL would point to our Kubernetes Ingress.
        // For local dev, you might port-forward the service.
        const WS_URL = 'ws://localhost:8080/ws'; // Replace with your service URL
        
        function connect() {
            ws.current = new WebSocket(WS_URL);

            ws.current.onopen = () => {
                console.log('WebSocket Connected');
                setIsConnected(true);
            };

            ws.current.onmessage = (event) => {
                try {
                    const data = JSON.parse(event.data);
                    const newEvent: ConnectionEvent = { ...data, key: `${data.timestamp}-${Math.random()}` };

                    setEvents(prevEvents => {
                        const updatedEvents = [newEvent, ...prevEvents];
                        if (updatedEvents.length > MAX_EVENTS) {
                            return updatedEvents.slice(0, MAX_EVENTS);
                        }
                        return updatedEvents;
                    });
                } catch (error) {
                    console.error('Failed to parse message:', error);
                }
            };
            
            ws.current.onclose = () => {
                console.log('WebSocket Disconnected. Reconnecting...');
                setIsConnected(false);
                setTimeout(connect, 3000); // Attempt to reconnect after 3 seconds.
            };

            ws.current.onerror = (err) => {
                console.error('WebSocket Error:', err);
                ws.current?.close();
            };
        }

        connect();

        return () => {
            ws.current?.close();
        };
    }, []);

    return (
        <div>
            <div>
                <strong>Status:</strong> 
                <Tag color={isConnected ? 'green' : 'red'}>
                    {isConnected ? 'Connected' : 'Disconnected'}
                </Tag>
            </div>
            <Table
                columns={columns}
                dataSource={events}
                size="small"
                pagination={false}
                scroll={{ y: 600 }}
                style={{ marginTop: '20px' }}
            />
        </div>
    );
};

export default Dashboard;

The final result is a clean, performant dashboard. The Astro shell loads instantly, and the Dashboard component, our interactive island, takes over to fetch and display the live stream of network data from the eBPF agents running across the cluster.

Lingering Issues and Future Iterations

This solution successfully met our core requirement for zero-instrumentation network visibility. However, in its current state, it has clear limitations. The first is that it only captures the initiation of TCP connections. It doesn’t track connection duration, data volume, or UDP/ICMP traffic. Extending the eBPF program to capture tcp_close and other protocols is a logical next step.

The architecture itself has scalability boundaries. Every front-end client connects directly to a WebSocket service that fans out to all node agents. A better approach for larger clusters would be to have the agents publish events to a message bus like Kafka or NATS. A central aggregation service could then consume from this bus, perform more complex processing, and serve the data to clients. This would decouple the front end from the agents and provide a buffer for event spikes.

Finally, the data is entirely ephemeral. For forensic analysis and long-term trend monitoring, these events need to be persisted. The aggregator service mentioned above would be the perfect place to sink this data into a time-series database or a log analytics platform for historical querying and alerting.

Ant Design Kubernetes Go Observability Astro eBPF

Implementing a Resilient Search Indexing Pipeline with a Message Queue, Meilisearch, and a React Native Swift Bridge

2023-10-27 System Design

Swift Meilisearch React Native Message Queue Architecture Search

Implementing Automated Canary Analysis for Android Fleet System Updates Using Spinnaker and Prometheus

2023-10-27 DevOps

Android Puppet Prometheus Spinnaker BDD