The initial architecture for our real-time analytics platform was causing production incidents. Built on a shared-everything model, a single tenant’s poorly optimized dashboard query could saturate our Cassandra cluster, leading to cascading latency across all customers. Worse, a seemingly innocuous CSS change pushed for one tenant’s custom theme once leaked globally, breaking the UI layout for everyone. Debugging was a nightmare; tracing a single slow API call involved manually correlating logs from our ingress controller, multiple microservices, and the database, a process that could take hours. This operational fragility was a direct threat to the business. We needed a new architecture founded on the principle of “hard” isolation at every single layer: UI, API, network, and database.
Our new design philosophy dictated that a failure or performance issue within one tenant’s scope must be completely contained, invisible to all other tenants. This led to a technology selection process driven by isolation guarantees and observability. For the database, we needed raw performance to handle high-throughput writes and low-latency queries, which pushed us towards ScyllaDB for its thread-per-core architecture, a significant advantage over Cassandra’s JVM-based approach in our benchmarks. At the edge, Apache APISIX was chosen for its performance and extensible plugin system, allowing us to build custom tenant-aware routing and authentication. For the front-end, which was a complex React single-page application, CSS Modules became the non-negotiable choice for guaranteeing build-time style encapsulation, preventing the kind of global CSS leakage we’d experienced before.
The lynchpin, however, was the network layer within Kubernetes. Standard Kubernetes NetworkPolicies, while useful, often rely on iptables, which can become unwieldy and inefficient at scale. More importantly, they lack the deep observability needed to diagnose cross-service performance issues. This is where Cilium came in. Its eBPF-based data path promised not only more efficient and powerful network policy enforcement but also kernel-level visibility into traffic flows without requiring any application instrumentation or service mesh sidecars. This was the key to building the unified observability plane our operations team desperately needed.
Foundation: Namespace and Network Isolation with Cilium
The first step was to establish the foundational security posture in our Kubernetes cluster. We decided on a namespace-per-tenant model. This provides a natural boundary for Kubernetes resources like Deployments and Services. However, by default, Kubernetes allows all pods to communicate with each other across namespaces. Our first task was to lock this down with a default-deny policy using Cilium.
We deployed Cilium and then applied the following CiliumClusterwideNetworkPolicy
. In a real-world project, it’s critical to start with a default-deny stance and explicitly allow traffic. A common mistake is to operate in a default-allow mode, which inevitably leads to security gaps.
# ccnp-default-deny.yaml
apiVersion: "cilium.io/v2"
kind: CiliumClusterwideNetworkPolicy
metadata:
name: "default-deny-all"
spec:
description: "Deny all traffic cluster-wide by default"
endpointSelector: {}
ingress:
- fromEndpoints:
- {}
egress:
- toEndpoints:
- {}
Applying this policy effectively silences the cluster. No pod can talk to any other pod unless an explicit policy allows it. This is our clean slate.
Next, we create namespaces for two tenants, tenant-alpha
and tenant-beta
, and a shared database
namespace for ScyllaDB.
kubectl create namespace tenant-alpha
kubectl create namespace tenant-beta
kubectl create namespace database
kubectl create namespace apisix
Now, we can demonstrate the isolation. We deploy a simple netshoot
pod in each tenant namespace and try to communicate.
kubectl run --namespace=tenant-alpha netshoot-alpha --image=nicolaka/netshoot -- /bin/bash -c "sleep 10000"
kubectl run --namespace=tenant-beta netshoot-beta --image=nicolaka/netshoot -- /bin/bash -c "sleep 10000"
# Get the IP of the beta pod
BETA_POD_IP=$(kubectl get pod -n tenant-beta netshoot-beta -o jsonpath='{.status.podIP}')
# Attempt to ping from alpha to beta
kubectl exec -n tenant-alpha netshoot-alpha -- ping -c 3 $BETA_POD_IP
# This will time out, as expected.
Using Cilium’s Hubble CLI, we can observe these dropped packets in real-time. This immediate feedback loop is invaluable during development and troubleshooting.
# In another terminal, after installing hubble cli
hubble observe --namespace tenant-alpha -f --verdict DROPPED
# Example Output:
# Oct 27 10:35:01.458 [DROPPED] tenant-alpha/netshoot-alpha-5f6f8f7d9-abcde -> tenant-beta/netshoot-beta-6b7c9c8d4-fghij IPV4 TCP Flags: SYN
# Verdict: POLICY_DENIED
# Reason: traffic-denied-by-default-deny-all
This confirms our network foundation is secure. No tenant can ever interfere with another at the network level unless we explicitly permit it.
Deploying and Securing the Data Tier: ScyllaDB
With the network foundation in place, we deployed our ScyllaDB cluster into the database
namespace using the Scylla Operator. The operator simplifies management significantly.
# scylladb-cluster.yaml
apiVersion: scylla.scylladb.com/v1
kind: ScyllaDBCluster
metadata:
name: scylla-cluster
namespace: database
spec:
version: 5.2.0
agentVersion: 3.1.0
datacenter:
name: dc1
racks:
- name: rack-a
members: 3
storage:
capacity: 10Gi
storageClassName: standard
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "1"
memory: 2Gi
Once the cluster is up, its services are exposed within the database
namespace. Our default-deny policy means nothing can connect to it yet. We need a specific CiliumNetworkPolicy
to allow our future tenant backend services to access the database.
A key principle of Cilium policies is using labels and identity rather than IP addresses. We’ll define a policy that allows any pod with the label app.kubernetes.io/name: analytics-backend
to connect to ScyllaDB pods on the required ports (9042 for CQL, 7000/7001 for internode communication).
# scylladb-access-policy.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "allow-backend-to-scylla"
namespace: database
spec:
description: "Allow analytics backends to connect to ScyllaDB"
endpointSelector:
matchLabels:
app.kubernetes.io/name: scylla
ingress:
- fromEndpoints:
- matchLabels:
app.kubernetes.io/name: analytics-backend
# This policy is applied cluster-wide by checking labels in any namespace
toPorts:
- ports:
- port: "9042"
protocol: TCP
- port: "7000"
protocol: TCP
- port: "7001"
protocol: TCP
This policy is precise. It doesn’t matter which namespace the analytics-backend
pod is in; as long as it carries that label, it’s granted access. A pod without that label, even if it’s in the same namespace, will be blocked. This is identity-based security, far superior to fragile IP-based rules.
The Gateway: Tenant-Aware Routing with APISIX
APISIX, running in its own apisix
namespace, will be our single point of entry. We’ll configure it to inspect JWT tokens, identify the tenant from a custom claim (tid
), and then route the request to the correct backend service in the corresponding tenant namespace.
First, we define ApisixConsumer
objects for each tenant, which hold their credentials. In this case, we’re using a JWT plugin.
# apisix-consumers.yaml
apiVersion: apisix.apache.org/v2
kind: ApisixConsumer
metadata:
name: tenant-alpha-consumer
namespace: apisix
spec:
authParameter:
jwtAuth:
secret: tenant-alpha-secret-key
key: tenant-alpha
---
apiVersion: apisix.apache.org/v2
kind: ApisixConsumer
metadata:
name: tenant-beta-consumer
namespace: apisix
spec:
authParameter:
jwtAuth:
secret: tenant-beta-secret-key
key: tenant-beta
Now, the core routing logic. We create a single ApisixRoute
that handles all traffic to /api/v1/metrics
. The magic lies in using APISIX’s variables and expression-based routing. The upstream.name
is dynamically constructed using the jwt_claim_tid
variable, which APISIX populates after validating the token.
# apisix-tenant-route.yaml
apiVersion: apisix.apache.org/v2
kind: ApisixRoute
metadata:
name: analytics-api-route
namespace: apisix
spec:
http:
- name: tenant-routing
match:
hosts:
- "api.analytics.local"
paths:
- "/api/v1/metrics"
backends:
- serviceName: analytics-backend-{{.jwt_claim_tid}}
servicePort: 8080
authentication:
enable: true
type: jwtAuth
plugins:
- name: "proxy-rewrite"
enable: true
config:
# We rewrite the upstream name dynamically based on the tenant ID claim in the JWT.
# This is a powerful feature for multi-tenant routing.
# Note: this requires APISIX to have proper RBAC permissions to discover services cluster-wide.
# The service will be looked for in the namespace of the ApisixRoute (apisix),
# so we must reference the fully qualified domain name (FQDN).
upstream_name: "analytics-backend-{{jwt_claim_tid}}.tenant-{{jwt_claim_tid}}.svc.cluster.local"
The pitfall here is that the service discovery for this dynamic upstream is tricky. The expression analytics-backend-{{jwt_claim_tid}}.tenant-{{jwt_claim_tid}}.svc.cluster.local
constructs the FQDN of the backend service. For this to work, the APISIX instance must have the necessary permissions to resolve services across different namespaces.
Finally, we need another CiliumNetworkPolicy
to allow APISIX pods to communicate with the backend services in the tenant namespaces.
# allow-apisix-to-backends.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "allow-apisix-to-backends-alpha"
namespace: tenant-alpha
spec:
endpointSelector:
matchLabels:
app.kubernetes.io/name: analytics-backend
tenant: alpha
ingress:
- fromEndpoints:
- matchLabels:
app.kubernetes.io/name: apisix-gateway
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: apisix
toPorts:
- ports:
- port: "8080"
protocol: TCP
# A similar policy would be created for namespace tenant-beta
The Tenant Backend Service
Each tenant gets their own isolated backend service. Here is a production-grade Go service that connects to ScyllaDB and exposes a simple metrics endpoint. It includes structured logging, configuration management, and graceful shutdown.
// main.go
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
"github.com/gocql/gocql"
"github.com/scylladb/gocqlx/v2"
"github.com/scylladb/gocqlx/v2/table"
"go.uber.org/zap"
)
// Metric represents a simple time-series data point.
type Metric struct {
TenantID string `db:"tenant_id"`
Timestamp time.Time `db:"ts"`
Value float64 `db:"value"`
}
// metadata specifies the table name and partition/clustering keys.
// In a real application, this would be part of a proper data model package.
var metricMetadata = table.Metadata{
Name: "metrics",
Columns: []string{"tenant_id", "ts", "value"},
PartKey: []string{"tenant_id"},
SortKey: []string{"ts"},
}
var metricTable = table.New(metricMetadata)
func main() {
// Production-grade logging with Zap.
logger, err := zap.NewProduction()
if err != nil {
log.Fatalf("can't initialize zap logger: %v", err)
}
defer logger.Sync()
sugar := logger.Sugar()
// Configuration from environment variables.
scyllaHost := os.Getenv("SCYLLA_HOST")
if scyllaHost == "" {
scyllaHost = "scylla-cluster.database.svc.cluster.local"
}
tenantID := os.Getenv("TENANT_ID")
if tenantID == "" {
sugar.Fatal("TENANT_ID environment variable is not set")
}
// Create a ScyllaDB session.
cluster := gocql.NewCluster(scyllaHost)
cluster.Keyspace = "analytics"
cluster.Consistency = gocql.Quorum
cluster.Timeout = 5 * time.Second
sugar.Infof("Connecting to ScyllaDB at %s...", scyllaHost)
session, err := gocqlx.WrapSession(cluster.CreateSession())
if err != nil {
sugar.Fatalf("failed to connect to scylla: %v", err)
}
defer session.Close()
sugar.Info("ScyllaDB connection successful")
// Set up HTTP server and handler.
mux := http.NewServeMux()
mux.HandleFunc("/api/v1/metrics", metricsHandler(session, tenantID, sugar))
server := &http.Server{
Addr: ":8080",
Handler: mux,
}
// Graceful shutdown handling.
go func() {
sugar.Infof("Server starting on port 8080 for tenant %s", tenantID)
if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
sugar.Fatalf("could not listen on :8080: %v", err)
}
}()
stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGINT, syscall.SIGTERM)
<-stop
sugar.Info("Shutting down server...")
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := server.Shutdown(ctx); err != nil {
sugar.Fatalf("Server Shutdown Failed:%+v", err)
}
sugar.Info("Server exited properly")
}
func metricsHandler(session gocqlx.Session, tenantID string, logger *zap.SugaredLogger) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
logger.Infow("received metrics request", "tenant", tenantID, "remote_addr", r.RemoteAddr)
// Query ScyllaDB for metrics for this specific tenant.
// The service is hard-coded to its tenant, ensuring it can't query data for another.
q := session.Query(metricTable.Select()).BindMap(gocql.M{"tenant_id": tenantID})
var metrics []Metric
if err := q.Select(&metrics); err != nil {
logger.Errorw("failed to query metrics from ScyllaDB", "error", err, "tenant", tenantID)
http.Error(w, "Internal Server Error", http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
if err := json.NewEncoder(w).Encode(metrics); err != nil {
logger.Errorw("failed to encode response", "error", err, "tenant", tenantID)
}
}
}
We build this into a container and create a Kubernetes Deployment
for each tenant in their respective namespace, making sure to apply the correct labels.
# tenant-alpha-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: analytics-backend-alpha
namespace: tenant-alpha
labels:
app.kubernetes.io/name: analytics-backend
tenant: alpha
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: analytics-backend
tenant: alpha
template:
metadata:
labels:
app.kubernetes.io/name: analytics-backend
tenant: alpha
spec:
containers:
- name: backend
image: your-repo/analytics-backend:1.0.0
ports:
- containerPort: 8080
env:
- name: TENANT_ID
value: "alpha"
- name: SCYLLA_HOST
value: "scylla-cluster.database.svc.cluster.local:9042"
---
apiVersion: v1
kind: Service
metadata:
name: analytics-backend-alpha
namespace: tenant-alpha
spec:
selector:
app.kubernetes.io/name: analytics-backend
tenant: alpha
ports:
- protocol: TCP
port: 8080
targetPort: 8080
A corresponding set of manifests is created for tenant-beta
. Now our backend is fully isolated.
Final Layer: UI Styling Isolation with CSS Modules
On the front-end, tenants can upload their own color schemes and logos. The biggest fear is a tenant defining a broad CSS rule like div { background-color: black !important; }
that could break the UI for everyone. This is where CSS Modules provide a compile-time solution.
Consider a simple React chart component:
// components/Chart/Chart.js
import React from 'react';
import styles from './Chart.module.css';
// The 'styles' object contains class names that are unique to this component instance.
// For example, styles.chartContainer might become 'Chart_chartContainer__2g5Zk'.
const Chart = ({ title, data }) => {
return (
<div className={styles.chartContainer}>
<h3 className={styles.chartTitle}>{title}</h3>
{/* Chart rendering logic here */}
</div>
);
};
export default Chart;
The corresponding CSS file uses standard syntax:
/* components/Chart/Chart.module.css */
.chartContainer {
border: 1px solid #ccc;
padding: 16px;
border-radius: 8px;
background-color: var(--chart-background-color, #fff); /* Uses CSS variables for theming */
}
.chartTitle {
font-size: 1.2em;
color: var(--chart-title-color, #333);
margin-top: 0;
}
During the build process (via Webpack or Vite), Chart.module.css
is processed. The class .chartContainer
is transformed into a unique hash, like Chart_chartContainer__2g5Zk
, and the styles
object imported into the component maps the original name to the hashed one. This makes it programmatically impossible for styles from one component or tenant theme to accidentally override another’s. A common mistake is to mix global CSS with CSS Modules without a clear strategy, which negates the benefits. Our rule is simple: all component styles must be in a .module.css
file. Global styles are only for foundational elements like fonts and resets.
The Observability Breakthrough: Tying It All Together
The system is now fully isolated. But how do we debug it? A tenant reports their dashboard is slow.
This is where the investment in Cilium and its eBPF-based observability pays off. We can use Hubble to trace a request from the moment it leaves APISIX to the backend and from the backend to ScyllaDB, all without adding a single line of tracing code to our applications.
graph TD User --JWT for Tenant Alpha--> A[api.analytics.local]; A --> B{APISIX Ingress}; B --Valid JWT, tid=alpha--> C[Service: analytics-backend-alpha]; C --Checks Cilium Policy--> D{Policy Allows}; D --> E[Pod: analytics-backend-alpha-xyz]; E --CQL Query for tenant_id=alpha--> F[Service: scylla-cluster]; F --Checks Cilium Policy--> G{Policy Allows}; G --> H[Pod: scylla-cluster-dc1-rack-a-0]; H --Returns Data--> E; E --Returns JSON--> B; B --Returns JSON--> User; subgraph "Namespace: apisix" B end subgraph "Namespace: tenant-alpha" C E end subgraph "Namespace: database" F H end
We run a Hubble query to observe traffic flowing from the APISIX pod to the tenant-alpha
backend:
hubble observe --from-pod apisix/apisix-7f8d9... --to-namespace tenant-alpha --verdict FORWARDED --http
The output gives us HTTP-level visibility (method, path, status code, latency) for that specific network hop, extracted directly from kernel space by eBPF.
Oct 27 11:15:01.123 [FORWARDED] apisix/apisix-7f8d9... -> tenant-alpha/analytics-backend-alpha-5b6c7... L7/HTTP Request
Path: /api/v1/metrics
Method: GET
Latency: 250ms
A 250ms latency for an internal service call is high. We now drill down into the next hop: from the analytics-backend-alpha
pod to ScyllaDB.
hubble observe --from-pod tenant-alpha/analytics-backend-alpha-5b6c7... --to-pod database/scylla-cluster-dc1-rack-a-0 -f
Hubble might not parse the CQL protocol by default, but it gives us invaluable TCP-level information. We might see a high number of TCP retransmits or high SYN
latency, indicating either network congestion or that the ScyllaDB pod is too busy to accept new connections quickly. This points us directly to the database as the bottleneck. Armed with this information, we can then use ScyllaDB’s own monitoring tools to investigate the specific query patterns for tenant-alpha
, having already ruled out the network path and the API gateway as the source of the delay. This entire diagnostic process took minutes, not hours.
The current implementation achieves our primary goals of isolation and observability. However, running a separate deployment for every tenant is not resource-efficient and can lead to pod sprawl. A potential future iteration is to move to a single, multi-tenant backend service that performs application-level data isolation. This would require rigorous code reviews and testing to prevent data leakage, but would be more cost-effective. Cilium’s identity-based policies would still be crucial here, perhaps enforcing that only the unified backend service identity can talk to the database, while still isolating tenants at the ingress level via APISIX. Furthermore, the tenant onboarding process is still manual; building a GitOps workflow to automatically provision namespaces, policies, and deployments is the next critical step toward operational maturity.