Our centralized MLflow Tracking Server became a victim of its own success. Initially a tool for a small data science team, it was now a critical, shared resource for dozens of teams, each running hundreds of experiments. The operational burden was immense, but the primary pain point was the complete lack of multi-tenant isolation. The MLflow OSS API and UI offer no meaningful namespacing; any user with access could see, and potentially alter, any other team’s experiments. This wasn’t just an inconvenience; it was a compliance and security nightmare waiting to happen.
Our initial thought was to fork MLflow and build in an authentication and authorization layer. This was quickly dismissed. The maintenance overhead would be prohibitive, and we’d constantly be chasing upstream updates. The next idea was an API gateway with a complex authorization plugin. While feasible, this approach tightly couples our security logic with the gateway’s application code. Any change to tenancy rules would require a full redeployment of a critical infrastructure component.
The final architecture we settled on pushes the tenancy enforcement down to the CNI layer, completely decoupling it from the application logic. We would build a GraphQL gateway to provide a tailored API for our internal observability dashboards, but the actual security enforcement would be handled by Cilium’s eBPF-powered L7 network policies. This allows us to define and enforce access control based on cryptographic workload identity and HTTP headers, without the gateway or MLflow needing to know anything about tenants.
Here is the high-level flow we implemented:
graph TD subgraph Kubernetes Cluster subgraph Tenant-A Namespace RelayDashboardA[React/Relay Dashboard Pod
SA: tenant-a-sa] end subgraph Tenant-B Namespace RelayDashboardB[React/Relay Dashboard Pod
SA: tenant-b-sa] end subgraph MLOps Namespace RelayDashboardA -- "GraphQL Query (X-Tenant-ID: tenant-a)" --> CiliumProxy[Cilium Envoy Proxy] RelayDashboardB -- "GraphQL Query (X-Tenant-ID: tenant-b)" --> CiliumProxy CiliumProxy -- "Policy Enforced" --> GraphQLGateway[GraphQL Gateway Pod] GraphQLGateway -- "Upstream REST Call" --> MLflowServer[MLflow Tracking Server Pod] end end style RelayDashboardA fill:#cde4ff style RelayDashboardB fill:#cde4ff style CiliumProxy fill:#f9d5a9
This post-mortem documents the build process, the critical configuration snippets, and the pitfalls encountered while weaving together these three distinct technologies.
1. Establishing the Baseline Environment
A real-world project requires a reproducible local environment. We use kind
for this. The key requirement is to disable the default CNI (disableDefaultCNI: true
) so we can install Cilium.
kind-config.yaml
# kind-cluster-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
networking:
disableDefaultCNI: true # IMPORTANT: Disable default CNI
podSubnet: "10.244.0.0/16"
serviceSubnet: "10.96.0.0/12"
With this configuration, cluster creation is straightforward.
# Create the kind cluster
kind create cluster --config=kind-config.yaml --name=mlops-secure
# Verify nodes are ready (will show NotReady until CNI is installed)
kubectl get nodes
Next, we install Cilium via Helm. For L7 policy enforcement on HTTP, Hubble with UI is invaluable for debugging, and enabling Prometheus metrics is essential for production monitoring.
# Add Cilium Helm repo
helm repo add cilium https://helm.cilium.io/
# Install Cilium with necessary features
helm install cilium cilium/cilium --version 1.14.5 \
--namespace kube-system \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true \
--set prometheus.enabled=true \
--set operator.prometheus.enabled=true \
--set kubeProxyReplacement=strict \
--set securityContext.privileged=true \
--set bpf.masquerade=true \
--set extraConfig.enable-envoy-config=true # Not always needed, but useful for debugging
After a few minutes, cilium status
should report everything is healthy. Now we can set up our application namespaces and MLflow itself.
mlflow-deployment.yaml
# namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
name: mlops
---
apiVersion: v1
kind: Namespace
metadata:
name: tenant-a
---
apiVersion: v1
kind: Namespace
metadata:
name: tenant-b
---
# mlflow-server.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow-server
namespace: mlops
labels:
app: mlflow-server
spec:
replicas: 1
selector:
matchLabels:
app: mlflow-server
template:
metadata:
labels:
app: mlflow-server
spec:
containers:
- name: mlflow-server
image: ghcr.io/mlflow/mlflow:v2.8.0
args:
- "--host=0.0.0.0"
- "--port=5000"
- "--backend-store-uri=sqlite:///mlflow.db"
- "--default-artifact-root=./mlartifacts"
ports:
- containerPort: 5000
name: http
---
apiVersion: v1
kind: Service
metadata:
name: mlflow-server-svc
namespace: mlops
spec:
selector:
app: mlflow-server
ports:
- protocol: TCP
port: 5000
targetPort: http
Apply this manifest to get our backend service running: kubectl apply -f mlflow-deployment.yaml
.
2. The GraphQL Gateway: Abstraction and Decoupling
The gateway’s purpose is to translate GraphQL queries into REST API calls against the MLflow backend. This abstraction lets us design a schema that is ergonomic for our front-end developers, hiding the complexities of the MLflow API. We chose Apollo Server on Node.js for its maturity and ease of use.
package.json
{
"name": "mlflow-gateway",
"version": "1.0.0",
"description": "GraphQL Gateway for MLflow",
"main": "index.js",
"type": "module",
"scripts": {
"start": "node index.js"
},
"dependencies": {
"@apollo/server": "^4.9.5",
"axios": "^1.6.0",
"graphql": "^16.8.1",
"pino": "^8.16.1",
"pino-pretty": "^10.2.3"
}
}
The core logic resides in the server setup and the resolvers. A critical detail in a production system is structured logging. We use pino
for this.
index.js
import { ApolloServer } from '@apollo/server';
import { startStandaloneServer } from '@apollo/server/standalone';
import axios from 'axios';
import pino from 'pino';
// Production-grade logger
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
transport: {
target: 'pino-pretty',
options: {
colorize: true,
},
},
});
// MLflow service URL from environment variable for configuration flexibility
const MLFLOW_API_URL = process.env.MLFLOW_API_URL || 'http://mlflow-server-svc.mlops:5000/api/2.0/mlflow';
const typeDefs = `#graphql
type Experiment {
experiment_id: String!
name: String!
artifact_location: String
lifecycle_stage: String
tags: [Tag]
}
type Tag {
key: String
value: String
}
type Query {
experiments(filter: String): [Experiment]
experimentByName(name: String!): Experiment
}
`;
const resolvers = {
Query: {
experiments: async (_, { filter }) => {
try {
logger.info({ filter }, 'Fetching all experiments');
const response = await axios.get(`${MLFLOW_API_URL}/experiments/list`);
let experiments = response.data.experiments || [];
// In a real application, filtering would be more robust.
// This is a simple implementation for demonstration.
if (filter) {
experiments = experiments.filter(exp => exp.name.includes(filter));
}
return experiments;
} catch (error) {
logger.error({ err: error.message }, 'Failed to fetch experiments from MLflow');
// Propagate a user-friendly error
throw new Error('Could not connect to the MLflow backend.');
}
},
experimentByName: async (_, { name }) => {
try {
logger.info({ experimentName: name }, 'Fetching experiment by name');
const response = await axios.get(`${MLFLOW_API_URL}/experiments/get-by-name`, {
params: { experiment_name: name },
});
return response.data.experiment;
} catch (error) {
if (error.response && error.response.status === 404) {
logger.warn({ experimentName: name }, 'Experiment not found');
return null;
}
logger.error({ err: error.message, experimentName: name }, 'Failed to fetch experiment by name');
throw new Error('Error fetching specific experiment from MLflow.');
}
},
},
};
const server = new ApolloServer({
typeDefs,
resolvers,
});
const { url } = await startStandaloneServer(server, {
listen: { port: 4000 },
context: async ({ req }) => {
// Extract tenant ID for potential future use in application logic
// For now, we rely solely on Cilium, but this is good practice.
const tenantId = req.headers['x-tenant-id'] || 'unknown';
logger.info({ tenantId, path: req.url }, 'Received request');
return { tenantId };
},
});
logger.info(`🚀 Gateway ready at: ${url}`);
To deploy this, we need a Dockerfile
and a Kubernetes Deployment
.
Dockerfile
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
EXPOSE 4000
CMD ["npm", "start"]
gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gql-gateway
namespace: mlops
labels:
app: gql-gateway
spec:
replicas: 1
selector:
matchLabels:
app: gql-gateway
template:
metadata:
labels:
app: gql-gateway
spec:
containers:
- name: gateway
# Replace with your actual registry path
image: your-registry/mlflow-gateway:1.0.0
ports:
- containerPort: 4000
name: http-gql
env:
- name: MLFLOW_API_URL
value: "http://mlflow-server-svc.mlops.svc.cluster.local:5000/api/2.0/mlflow"
- name: LOG_LEVEL
value: "debug"
---
apiVersion: v1
kind: Service
metadata:
name: gql-gateway-svc
namespace: mlops
spec:
selector:
app: gql-gateway
ports:
- protocol: TCP
port: 4000
targetPort: http-gql
After building and pushing the image, apply the manifest: kubectl apply -f gateway-deployment.yaml
. At this point, any pod in the cluster can access the gateway and see all MLflow data. The system is functional but insecure.
3. Cilium L7 Policies: The Core of Enforcement
This is where the magic happens. We will define CiliumNetworkPolicy
resources that control traffic based on Kubernetes identities (ServiceAccount
) and L7 properties (HTTP headers).
First, let’s create identities for our tenant workloads. We’ll deploy simple curl
pods in each tenant namespace, each with a unique ServiceAccount
.
tenant-workloads.yaml
# tenant-a-resources.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: tenant-a-sa
namespace: tenant-a
---
apiVersion: v1
kind: Pod
metadata:
name: tenant-a-client
namespace: tenant-a
labels:
app: tenant-a-client
spec:
serviceAccountName: tenant-a-sa
containers:
- name: curl
image: curlimages/curl
command: ["sleep", "3600"]
---
# tenant-b-resources.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: tenant-b-sa
namespace: tenant-b
---
apiVersion: v1
kind: Pod
metadata:
name: tenant-b-client
namespace: tenant-b
labels:
app: tenant-b-client
spec:
serviceAccountName: tenant-b-sa
containers:
- name: curl
image: curlimages/curl
command: ["sleep", "3600"]
Apply this: kubectl apply -f tenant-workloads.yaml
.
Now, we define the network policies. The strategy is:
- A default-deny policy on the
gql-gateway
to block all ingress traffic. - A specific L7 policy to allow ingress traffic that meets our tenant criteria.
cilium-policies.yaml
# 01-default-deny.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: gql-gateway-deny-all
namespace: mlops
spec:
endpointSelector:
matchLabels:
app: gql-gateway
ingress: [] # Empty ingress array means deny all ingress traffic
---
# 02-allow-tenant-access.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: allow-tenant-access-to-gateway
namespace: mlops
spec:
endpointSelector:
matchLabels:
app: gql-gateway
ingress:
- fromEndpoints:
# Rule for Tenant A
- matchLabels:
"k8s:io.kubernetes.pod.namespace": "tenant-a"
"k8s:app": "tenant-a-client" # We could also match on service account
toPorts:
- ports:
- port: "4000"
protocol: TCP
rules:
http:
- method: "POST"
path: "/graphql"
headers:
- "X-Tenant-ID: tenant-a"
- fromEndpoints:
# Rule for Tenant B
- matchLabels:
"k8s:io.kubernetes.pod.namespace": "tenant-b"
toPorts:
- ports:
- port: "4000"
protocol: TCP
rules:
http:
- method: "POST"
path: "/graphql"
headers:
- "X-Tenant-ID: tenant-b"
Let’s break down the allow-tenant-access-to-gateway
policy. It applies to pods with the label app: gql-gateway
. It defines two ingress rules:
- The first
fromEndpoints
block allows traffic from pods in thetenant-a
namespace. - The crucial part is the
rules.http
section. It specifies that for this traffic to be allowed, it must be an HTTPPOST
to/graphql
and must contain the headerX-Tenant-ID: tenant-a
. - The second block mirrors this logic for
tenant-b
.
Apply these policies: kubectl apply -f cilium-policies.yaml
. Immediately, all traffic to the gateway should be blocked.
We can now verify enforcement.
Verification inside the tenant-a-client
pod:
# Get a shell into the tenant-a pod
kubectl exec -it -n tenant-a tenant-a-client -- sh
# Define the GraphQL query
GQL_QUERY='{"query":"query { experiments(filter: \"tenant-a-project\") { name experiment_id } }"}'
# 1. Correct request: Correct tenant ID header. This should succeed.
curl -X POST \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: tenant-a" \
-d "$GQL_QUERY" \
http://gql-gateway-svc.mlops:4000/graphql
# Expected Output (if experiments exist): {"data":{"experiments":[{"name":"tenant-a-project-1",...}]}}
# 2. Incorrect request: Wrong tenant ID header. This should be blocked by Cilium.
curl -X POST \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: tenant-b" \
-d "$GQL_QUERY" \
http://gql-gateway-svc.mlops:4000/graphql
# Expected Output: curl: (56) Recv failure: Connection reset by peer
# 3. Malformed request: No tenant ID header. This should also be blocked.
curl -X POST \
-H "Content-Type: application/json" \
-d "$GQL_QUERY" \
http://gql-gateway-svc.mlops:4000/graphql
# Expected Output: curl: (56) Recv failure: Connection reset by peer
The “Connection reset by peer” message indicates that the TCP connection was forcefully closed by the Cilium-managed Envoy proxy because the HTTP request did not match the policy. Using hubble ui
or cilium monitor -n mlops
would show these as L7_DENIED
traffic flows, providing irrefutable evidence that the policy is working at the network layer.
4. The Relay Front-end: Consuming the Secured API
The final piece is the client application. Relay is a perfect fit for a data-intensive dashboard. The most relevant piece of code for this discussion is the Relay Environment setup, where we configure the network layer to inject the required X-Tenant-ID
header.
RelayEnvironment.js
import {
Environment,
Network,
RecordSource,
Store,
} from 'relay-runtime';
// This would typically come from an auth context or environment config
const TENANT_ID = 'tenant-a';
async function fetchGraphQL(params, variables) {
const response = await fetch('http://localhost:8080/graphql', { // Assuming a local proxy for dev
method: 'POST',
headers: {
'Content-Type': 'application/json',
// This is the critical part that satisfies the Cilium L7 policy
'X-Tenant-ID': TENANT_ID,
},
body: JSON.stringify({
query: params.text,
variables,
}),
});
if (response.status !== 200) {
console.error(`Network request failed with status ${response.status}`);
// In a real app, you'd have more robust error handling
throw new Error('Failed to fetch data from the GraphQL gateway.');
}
return await response.json();
}
export default new Environment({
network: Network.create(fetchGraphQL),
store: new Store(new RecordSource()),
});
A component using this environment would look something like this:
ExperimentDashboard.js
import React from 'react';
import { graphql, useLazyLoadQuery } from 'react-relay';
const ExperimentsQuery = graphql`
query ExperimentDashboardQuery($filter: String) {
experiments(filter: $filter) {
experiment_id
name
lifecycle_stage
}
}
`;
function ExperimentDashboard({ tenantProjectFilter }) {
// The tenant filter is passed as a GraphQL variable.
// The tenant identity is handled by the network layer.
const data = useLazyLoadQuery(ExperimentsQuery, { filter: tenantProjectFilter });
if (!data || !data.experiments) {
return <div>Loading experiments... or none found.</div>;
}
return (
<div>
<h1>Experiments for Filter: {tenantProjectFilter}</h1>
<ul>
{data.experiments.map(exp => (
<li key={exp.experiment_id}>
<strong>{exp.name}</strong> ({exp.lifecycle_stage})
</li>
))}
</ul>
</div>
);
}
When this component renders, Relay’s network layer automatically adds the X-Tenant-ID
header. If the dashboard pod is running with the tenant-a-sa
service account, the request will be allowed by Cilium. If it were somehow deployed with the tenant-b-sa
service account, the request would fail at the CNI level, preventing any data leakage.
Limitations and Future Considerations
While this architecture provides a robust, decoupled security model, it is not without its own set of trade-offs. The reliance on an HTTP header for tenancy assumes that the client workload (the Relay dashboard pod in this case) is trusted to send the correct header. In our controlled environment, where pod identities are tied to namespaces and service accounts, this is an acceptable risk. For a zero-trust environment, this could be hardened by using a service mesh like Istio or Linkerd in conjunction with Cilium to enforce mTLS and use SPIFFE identities, which are cryptographically verifiable.
Furthermore, L7 policy inspection, even when accelerated by eBPF, introduces a marginal amount of latency compared to pure L3/L4 policies. For our use case—an internal observability dashboard—this is negligible. For high-frequency, low-latency APIs, this would require careful performance testing.
The next evolutionary step for this system is to move beyond simple header matching. We are exploring using Cilium’s integration with Open Policy Agent (OPA) to enforce more complex, attribute-based access control rules. This would allow us to define policies like “allow access only if the experiment tag ‘cost-center’ matches the user’s group membership,” which could be retrieved from an external identity provider. This would bring even finer-grained control, still without writing a single line of authorization code in the gateway application itself.