Implementing End-to-End Multi-Tenancy Isolation from UI Styling to Database with Cilium and ScyllaDB

Cloud Native

Word Count: 3k

Read Times: 18 Min

The initial architecture for our real-time analytics platform was causing production incidents. Built on a shared-everything model, a single tenant’s poorly optimized dashboard query could saturate our Cassandra cluster, leading to cascading latency across all customers. Worse, a seemingly innocuous CSS change pushed for one tenant’s custom theme once leaked globally, breaking the UI layout for everyone. Debugging was a nightmare; tracing a single slow API call involved manually correlating logs from our ingress controller, multiple microservices, and the database, a process that could take hours. This operational fragility was a direct threat to the business. We needed a new architecture founded on the principle of “hard” isolation at every single layer: UI, API, network, and database.

Our new design philosophy dictated that a failure or performance issue within one tenant’s scope must be completely contained, invisible to all other tenants. This led to a technology selection process driven by isolation guarantees and observability. For the database, we needed raw performance to handle high-throughput writes and low-latency queries, which pushed us towards ScyllaDB for its thread-per-core architecture, a significant advantage over Cassandra’s JVM-based approach in our benchmarks. At the edge, Apache APISIX was chosen for its performance and extensible plugin system, allowing us to build custom tenant-aware routing and authentication. For the front-end, which was a complex React single-page application, CSS Modules became the non-negotiable choice for guaranteeing build-time style encapsulation, preventing the kind of global CSS leakage we’d experienced before.

The lynchpin, however, was the network layer within Kubernetes. Standard Kubernetes NetworkPolicies, while useful, often rely on iptables, which can become unwieldy and inefficient at scale. More importantly, they lack the deep observability needed to diagnose cross-service performance issues. This is where Cilium came in. Its eBPF-based data path promised not only more efficient and powerful network policy enforcement but also kernel-level visibility into traffic flows without requiring any application instrumentation or service mesh sidecars. This was the key to building the unified observability plane our operations team desperately needed.

Foundation: Namespace and Network Isolation with Cilium

The first step was to establish the foundational security posture in our Kubernetes cluster. We decided on a namespace-per-tenant model. This provides a natural boundary for Kubernetes resources like Deployments and Services. However, by default, Kubernetes allows all pods to communicate with each other across namespaces. Our first task was to lock this down with a default-deny policy using Cilium.

We deployed Cilium and then applied the following CiliumClusterwideNetworkPolicy. In a real-world project, it’s critical to start with a default-deny stance and explicitly allow traffic. A common mistake is to operate in a default-allow mode, which inevitably leads to security gaps.

# ccnp-default-deny.yaml
apiVersion: "cilium.io/v2"
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: "default-deny-all"
spec:
  description: "Deny all traffic cluster-wide by default"
  endpointSelector: {}
  ingress:
  - fromEndpoints:
    - {}
  egress:
  - toEndpoints:
    - {}

Applying this policy effectively silences the cluster. No pod can talk to any other pod unless an explicit policy allows it. This is our clean slate.

Next, we create namespaces for two tenants, tenant-alpha and tenant-beta, and a shared database namespace for ScyllaDB.

kubectl create namespace tenant-alpha
kubectl create namespace tenant-beta
kubectl create namespace database
kubectl create namespace apisix

Now, we can demonstrate the isolation. We deploy a simple netshoot pod in each tenant namespace and try to communicate.

kubectl run --namespace=tenant-alpha netshoot-alpha --image=nicolaka/netshoot -- /bin/bash -c "sleep 10000"
kubectl run --namespace=tenant-beta netshoot-beta --image=nicolaka/netshoot -- /bin/bash -c "sleep 10000"

# Get the IP of the beta pod
BETA_POD_IP=$(kubectl get pod -n tenant-beta netshoot-beta -o jsonpath='{.status.podIP}')

# Attempt to ping from alpha to beta
kubectl exec -n tenant-alpha netshoot-alpha -- ping -c 3 $BETA_POD_IP
# This will time out, as expected.

Using Cilium’s Hubble CLI, we can observe these dropped packets in real-time. This immediate feedback loop is invaluable during development and troubleshooting.

# In another terminal, after installing hubble cli
hubble observe --namespace tenant-alpha -f --verdict DROPPED

# Example Output:
# Oct 27 10:35:01.458 [DROPPED] tenant-alpha/netshoot-alpha-5f6f8f7d9-abcde -> tenant-beta/netshoot-beta-6b7c9c8d4-fghij IPV4 TCP Flags: SYN
# Verdict:      POLICY_DENIED
# Reason:       traffic-denied-by-default-deny-all

This confirms our network foundation is secure. No tenant can ever interfere with another at the network level unless we explicitly permit it.

Deploying and Securing the Data Tier: ScyllaDB

With the network foundation in place, we deployed our ScyllaDB cluster into the database namespace using the Scylla Operator. The operator simplifies management significantly.

# scylladb-cluster.yaml
apiVersion: scylla.scylladb.com/v1
kind: ScyllaDBCluster
metadata:
  name: scylla-cluster
  namespace: database
spec:
  version: 5.2.0
  agentVersion: 3.1.0
  datacenter:
    name: dc1
    racks:
      - name: rack-a
        members: 3
        storage:
          capacity: 10Gi
          storageClassName: standard
        resources:
          requests:
            cpu: "1"
            memory: 2Gi
          limits:
            cpu: "1"
            memory: 2Gi

Once the cluster is up, its services are exposed within the database namespace. Our default-deny policy means nothing can connect to it yet. We need a specific CiliumNetworkPolicy to allow our future tenant backend services to access the database.

A key principle of Cilium policies is using labels and identity rather than IP addresses. We’ll define a policy that allows any pod with the label app.kubernetes.io/name: analytics-backend to connect to ScyllaDB pods on the required ports (9042 for CQL, 7000/7001 for internode communication).

# scylladb-access-policy.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "allow-backend-to-scylla"
  namespace: database
spec:
  description: "Allow analytics backends to connect to ScyllaDB"
  endpointSelector:
    matchLabels:
      app.kubernetes.io/name: scylla
  ingress:
  - fromEndpoints:
    - matchLabels:
        app.kubernetes.io/name: analytics-backend
        # This policy is applied cluster-wide by checking labels in any namespace
    toPorts:
    - ports:
      - port: "9042"
        protocol: TCP
      - port: "7000"
        protocol: TCP
      - port: "7001"
        protocol: TCP

This policy is precise. It doesn’t matter which namespace the analytics-backend pod is in; as long as it carries that label, it’s granted access. A pod without that label, even if it’s in the same namespace, will be blocked. This is identity-based security, far superior to fragile IP-based rules.

The Gateway: Tenant-Aware Routing with APISIX

APISIX, running in its own apisix namespace, will be our single point of entry. We’ll configure it to inspect JWT tokens, identify the tenant from a custom claim (tid), and then route the request to the correct backend service in the corresponding tenant namespace.

First, we define ApisixConsumer objects for each tenant, which hold their credentials. In this case, we’re using a JWT plugin.

# apisix-consumers.yaml
apiVersion: apisix.apache.org/v2
kind: ApisixConsumer
metadata:
  name: tenant-alpha-consumer
  namespace: apisix
spec:
  authParameter:
    jwtAuth:
      secret: tenant-alpha-secret-key
      key: tenant-alpha
---
apiVersion: apisix.apache.org/v2
kind: ApisixConsumer
metadata:
  name: tenant-beta-consumer
  namespace: apisix
spec:
  authParameter:
    jwtAuth:
      secret: tenant-beta-secret-key
      key: tenant-beta

Now, the core routing logic. We create a single ApisixRoute that handles all traffic to /api/v1/metrics. The magic lies in using APISIX’s variables and expression-based routing. The upstream.name is dynamically constructed using the jwt_claim_tid variable, which APISIX populates after validating the token.

# apisix-tenant-route.yaml
apiVersion: apisix.apache.org/v2
kind: ApisixRoute
metadata:
  name: analytics-api-route
  namespace: apisix
spec:
  http:
  - name: tenant-routing
    match:
      hosts:
      - "api.analytics.local"
      paths:
      - "/api/v1/metrics"
    backends:
    - serviceName: analytics-backend-{{.jwt_claim_tid}}
      servicePort: 8080
    authentication:
      enable: true
      type: jwtAuth
    plugins:
    - name: "proxy-rewrite"
      enable: true
      config:
        # We rewrite the upstream name dynamically based on the tenant ID claim in the JWT.
        # This is a powerful feature for multi-tenant routing.
        # Note: this requires APISIX to have proper RBAC permissions to discover services cluster-wide.
        # The service will be looked for in the namespace of the ApisixRoute (apisix),
        # so we must reference the fully qualified domain name (FQDN).
        upstream_name: "analytics-backend-{{jwt_claim_tid}}.tenant-{{jwt_claim_tid}}.svc.cluster.local"

The pitfall here is that the service discovery for this dynamic upstream is tricky. The expression analytics-backend-{{jwt_claim_tid}}.tenant-{{jwt_claim_tid}}.svc.cluster.local constructs the FQDN of the backend service. For this to work, the APISIX instance must have the necessary permissions to resolve services across different namespaces.

Finally, we need another CiliumNetworkPolicy to allow APISIX pods to communicate with the backend services in the tenant namespaces.

# allow-apisix-to-backends.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "allow-apisix-to-backends-alpha"
  namespace: tenant-alpha
spec:
  endpointSelector:
    matchLabels:
      app.kubernetes.io/name: analytics-backend
      tenant: alpha
  ingress:
  - fromEndpoints:
    - matchLabels:
        app.kubernetes.io/name: apisix-gateway
      namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: apisix
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP

# A similar policy would be created for namespace tenant-beta

The Tenant Backend Service

Each tenant gets their own isolated backend service. Here is a production-grade Go service that connects to ScyllaDB and exposes a simple metrics endpoint. It includes structured logging, configuration management, and graceful shutdown.

// main.go
package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/gocql/gocql"
	"github.com/scylladb/gocqlx/v2"
	"github.com/scylladb/gocqlx/v2/table"
	"go.uber.org/zap"
)

// Metric represents a simple time-series data point.
type Metric struct {
	TenantID  string    `db:"tenant_id"`
	Timestamp time.Time `db:"ts"`
	Value     float64   `db:"value"`
}

// metadata specifies the table name and partition/clustering keys.
// In a real application, this would be part of a proper data model package.
var metricMetadata = table.Metadata{
	Name:    "metrics",
	Columns: []string{"tenant_id", "ts", "value"},
	PartKey: []string{"tenant_id"},
	SortKey: []string{"ts"},
}
var metricTable = table.New(metricMetadata)

func main() {
	// Production-grade logging with Zap.
	logger, err := zap.NewProduction()
	if err != nil {
		log.Fatalf("can't initialize zap logger: %v", err)
	}
	defer logger.Sync()
	sugar := logger.Sugar()

	// Configuration from environment variables.
	scyllaHost := os.Getenv("SCYLLA_HOST")
	if scyllaHost == "" {
		scyllaHost = "scylla-cluster.database.svc.cluster.local"
	}
	tenantID := os.Getenv("TENANT_ID")
	if tenantID == "" {
		sugar.Fatal("TENANT_ID environment variable is not set")
	}

	// Create a ScyllaDB session.
	cluster := gocql.NewCluster(scyllaHost)
	cluster.Keyspace = "analytics"
	cluster.Consistency = gocql.Quorum
	cluster.Timeout = 5 * time.Second

	sugar.Infof("Connecting to ScyllaDB at %s...", scyllaHost)
	session, err := gocqlx.WrapSession(cluster.CreateSession())
	if err != nil {
		sugar.Fatalf("failed to connect to scylla: %v", err)
	}
	defer session.Close()
	sugar.Info("ScyllaDB connection successful")

	// Set up HTTP server and handler.
	mux := http.NewServeMux()
	mux.HandleFunc("/api/v1/metrics", metricsHandler(session, tenantID, sugar))

	server := &http.Server{
		Addr:    ":8080",
		Handler: mux,
	}

	// Graceful shutdown handling.
	go func() {
		sugar.Infof("Server starting on port 8080 for tenant %s", tenantID)
		if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
			sugar.Fatalf("could not listen on :8080: %v", err)
		}
	}()

	stop := make(chan os.Signal, 1)
	signal.Notify(stop, syscall.SIGINT, syscall.SIGTERM)
	<-stop

	sugar.Info("Shutting down server...")

	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
	defer cancel()
	if err := server.Shutdown(ctx); err != nil {
		sugar.Fatalf("Server Shutdown Failed:%+v", err)
	}

	sugar.Info("Server exited properly")
}

func metricsHandler(session gocqlx.Session, tenantID string, logger *zap.SugaredLogger) http.HandlerFunc {
	return func(w http.ResponseWriter, r *http.Request) {
		logger.Infow("received metrics request", "tenant", tenantID, "remote_addr", r.RemoteAddr)
		
		// Query ScyllaDB for metrics for this specific tenant.
		// The service is hard-coded to its tenant, ensuring it can't query data for another.
		q := session.Query(metricTable.Select()).BindMap(gocql.M{"tenant_id": tenantID})
		
		var metrics []Metric
		if err := q.Select(&metrics); err != nil {
			logger.Errorw("failed to query metrics from ScyllaDB", "error", err, "tenant", tenantID)
			http.Error(w, "Internal Server Error", http.StatusInternalServerError)
			return
		}

		w.Header().Set("Content-Type", "application/json")
		w.WriteHeader(http.StatusOK)
		if err := json.NewEncoder(w).Encode(metrics); err != nil {
			logger.Errorw("failed to encode response", "error", err, "tenant", tenantID)
		}
	}
}

We build this into a container and create a Kubernetes Deployment for each tenant in their respective namespace, making sure to apply the correct labels.

# tenant-alpha-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: analytics-backend-alpha
  namespace: tenant-alpha
  labels:
    app.kubernetes.io/name: analytics-backend
    tenant: alpha
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: analytics-backend
      tenant: alpha
  template:
    metadata:
      labels:
        app.kubernetes.io/name: analytics-backend
        tenant: alpha
    spec:
      containers:
      - name: backend
        image: your-repo/analytics-backend:1.0.0
        ports:
        - containerPort: 8080
        env:
        - name: TENANT_ID
          value: "alpha"
        - name: SCYLLA_HOST
          value: "scylla-cluster.database.svc.cluster.local:9042"
---
apiVersion: v1
kind: Service
metadata:
  name: analytics-backend-alpha
  namespace: tenant-alpha
spec:
  selector:
    app.kubernetes.io/name: analytics-backend
    tenant: alpha
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080

A corresponding set of manifests is created for tenant-beta. Now our backend is fully isolated.

Final Layer: UI Styling Isolation with CSS Modules

On the front-end, tenants can upload their own color schemes and logos. The biggest fear is a tenant defining a broad CSS rule like div { background-color: black !important; } that could break the UI for everyone. This is where CSS Modules provide a compile-time solution.

Consider a simple React chart component:

// components/Chart/Chart.js
import React from 'react';
import styles from './Chart.module.css';

// The 'styles' object contains class names that are unique to this component instance.
// For example, styles.chartContainer might become 'Chart_chartContainer__2g5Zk'.

const Chart = ({ title, data }) => {
  return (
    <div className={styles.chartContainer}>
      <h3 className={styles.chartTitle}>{title}</h3>
      {/* Chart rendering logic here */}
    </div>
  );
};

export default Chart;

The corresponding CSS file uses standard syntax:

/* components/Chart/Chart.module.css */
.chartContainer {
  border: 1px solid #ccc;
  padding: 16px;
  border-radius: 8px;
  background-color: var(--chart-background-color, #fff); /* Uses CSS variables for theming */
}

.chartTitle {
  font-size: 1.2em;
  color: var(--chart-title-color, #333);
  margin-top: 0;
}

During the build process (via Webpack or Vite), Chart.module.css is processed. The class .chartContainer is transformed into a unique hash, like Chart_chartContainer__2g5Zk, and the styles object imported into the component maps the original name to the hashed one. This makes it programmatically impossible for styles from one component or tenant theme to accidentally override another’s. A common mistake is to mix global CSS with CSS Modules without a clear strategy, which negates the benefits. Our rule is simple: all component styles must be in a .module.css file. Global styles are only for foundational elements like fonts and resets.

The Observability Breakthrough: Tying It All Together

The system is now fully isolated. But how do we debug it? A tenant reports their dashboard is slow.

This is where the investment in Cilium and its eBPF-based observability pays off. We can use Hubble to trace a request from the moment it leaves APISIX to the backend and from the backend to ScyllaDB, all without adding a single line of tracing code to our applications.

graph TD
    User --JWT for Tenant Alpha--> A[api.analytics.local];
    A --> B{APISIX Ingress};
    B --Valid JWT, tid=alpha--> C[Service: analytics-backend-alpha];
    C --Checks Cilium Policy--> D{Policy Allows};
    D --> E[Pod: analytics-backend-alpha-xyz];
    E --CQL Query for tenant_id=alpha--> F[Service: scylla-cluster];
    F --Checks Cilium Policy--> G{Policy Allows};
    G --> H[Pod: scylla-cluster-dc1-rack-a-0];
    H --Returns Data--> E;
    E --Returns JSON--> B;
    B --Returns JSON--> User;

    subgraph "Namespace: apisix"
        B
    end

    subgraph "Namespace: tenant-alpha"
        C
        E
    end

    subgraph "Namespace: database"
        F
        H
    end

We run a Hubble query to observe traffic flowing from the APISIX pod to the tenant-alpha backend:

hubble observe --from-pod apisix/apisix-7f8d9... --to-namespace tenant-alpha --verdict FORWARDED --http

The output gives us HTTP-level visibility (method, path, status code, latency) for that specific network hop, extracted directly from kernel space by eBPF.

Oct 27 11:15:01.123 [FORWARDED] apisix/apisix-7f8d9... -> tenant-alpha/analytics-backend-alpha-5b6c7... L7/HTTP Request
  Path: /api/v1/metrics
  Method: GET
  Latency: 250ms

A 250ms latency for an internal service call is high. We now drill down into the next hop: from the analytics-backend-alpha pod to ScyllaDB.

hubble observe --from-pod tenant-alpha/analytics-backend-alpha-5b6c7... --to-pod database/scylla-cluster-dc1-rack-a-0 -f

Hubble might not parse the CQL protocol by default, but it gives us invaluable TCP-level information. We might see a high number of TCP retransmits or high SYN latency, indicating either network congestion or that the ScyllaDB pod is too busy to accept new connections quickly. This points us directly to the database as the bottleneck. Armed with this information, we can then use ScyllaDB’s own monitoring tools to investigate the specific query patterns for tenant-alpha, having already ruled out the network path and the API gateway as the source of the delay. This entire diagnostic process took minutes, not hours.

The current implementation achieves our primary goals of isolation and observability. However, running a separate deployment for every tenant is not resource-efficient and can lead to pod sprawl. A potential future iteration is to move to a single, multi-tenant backend service that performs application-level data isolation. This would require rigorous code reviews and testing to prevent data leakage, but would be more cost-effective. Cilium’s identity-based policies would still be crucial here, perhaps enforcing that only the unified backend service identity can talk to the database, while still isolating tenants at the ingress level via APISIX. Furthermore, the tenant onboarding process is still manual; building a GitOps workflow to automatically provision namespaces, policies, and deployments is the next critical step toward operational maturity.

ScyllaDB APISIX Kubernetes Multi-tenancy eBPF Cilium CSS Modules

Achieving ACID-like Guarantees in Serverless Workflows with Apache Pulsar Transactions and AWS Lambda

2023-10-27 Distributed Systems

Pulsar Serverless Saga Pattern ACID AWS Lambda

Implementing eBPF-Based Kernel Observability in AWS Lambda Environments

2023-10-27 Cloud Native

Observability eBPF Cilium AWS Lambda NoSQL