Implementing L7-Aware Multi-Tenancy for MLflow Using Cilium and a Relay-Powered GraphQL Gateway


Our centralized MLflow Tracking Server became a victim of its own success. Initially a tool for a small data science team, it was now a critical, shared resource for dozens of teams, each running hundreds of experiments. The operational burden was immense, but the primary pain point was the complete lack of multi-tenant isolation. The MLflow OSS API and UI offer no meaningful namespacing; any user with access could see, and potentially alter, any other team’s experiments. This wasn’t just an inconvenience; it was a compliance and security nightmare waiting to happen.

Our initial thought was to fork MLflow and build in an authentication and authorization layer. This was quickly dismissed. The maintenance overhead would be prohibitive, and we’d constantly be chasing upstream updates. The next idea was an API gateway with a complex authorization plugin. While feasible, this approach tightly couples our security logic with the gateway’s application code. Any change to tenancy rules would require a full redeployment of a critical infrastructure component.

The final architecture we settled on pushes the tenancy enforcement down to the CNI layer, completely decoupling it from the application logic. We would build a GraphQL gateway to provide a tailored API for our internal observability dashboards, but the actual security enforcement would be handled by Cilium’s eBPF-powered L7 network policies. This allows us to define and enforce access control based on cryptographic workload identity and HTTP headers, without the gateway or MLflow needing to know anything about tenants.

Here is the high-level flow we implemented:

graph TD
    subgraph Kubernetes Cluster
        subgraph Tenant-A Namespace
            RelayDashboardA[React/Relay Dashboard Pod 
SA: tenant-a-sa] end subgraph Tenant-B Namespace RelayDashboardB[React/Relay Dashboard Pod
SA: tenant-b-sa] end subgraph MLOps Namespace RelayDashboardA -- "GraphQL Query (X-Tenant-ID: tenant-a)" --> CiliumProxy[Cilium Envoy Proxy] RelayDashboardB -- "GraphQL Query (X-Tenant-ID: tenant-b)" --> CiliumProxy CiliumProxy -- "Policy Enforced" --> GraphQLGateway[GraphQL Gateway Pod] GraphQLGateway -- "Upstream REST Call" --> MLflowServer[MLflow Tracking Server Pod] end end style RelayDashboardA fill:#cde4ff style RelayDashboardB fill:#cde4ff style CiliumProxy fill:#f9d5a9

This post-mortem documents the build process, the critical configuration snippets, and the pitfalls encountered while weaving together these three distinct technologies.

1. Establishing the Baseline Environment

A real-world project requires a reproducible local environment. We use kind for this. The key requirement is to disable the default CNI (disableDefaultCNI: true) so we can install Cilium.

kind-config.yaml

# kind-cluster-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
networking:
  disableDefaultCNI: true # IMPORTANT: Disable default CNI
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/12"

With this configuration, cluster creation is straightforward.

# Create the kind cluster
kind create cluster --config=kind-config.yaml --name=mlops-secure

# Verify nodes are ready (will show NotReady until CNI is installed)
kubectl get nodes

Next, we install Cilium via Helm. For L7 policy enforcement on HTTP, Hubble with UI is invaluable for debugging, and enabling Prometheus metrics is essential for production monitoring.

# Add Cilium Helm repo
helm repo add cilium https://helm.cilium.io/

# Install Cilium with necessary features
helm install cilium cilium/cilium --version 1.14.5 \
   --namespace kube-system \
   --set hubble.relay.enabled=true \
   --set hubble.ui.enabled=true \
   --set prometheus.enabled=true \
   --set operator.prometheus.enabled=true \
   --set kubeProxyReplacement=strict \
   --set securityContext.privileged=true \
   --set bpf.masquerade=true \
   --set extraConfig.enable-envoy-config=true # Not always needed, but useful for debugging

After a few minutes, cilium status should report everything is healthy. Now we can set up our application namespaces and MLflow itself.

mlflow-deployment.yaml

# namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: mlops
---
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-a
---
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-b
---
# mlflow-server.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-server
  namespace: mlops
  labels:
    app: mlflow-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow-server
  template:
    metadata:
      labels:
        app: mlflow-server
    spec:
      containers:
      - name: mlflow-server
        image: ghcr.io/mlflow/mlflow:v2.8.0
        args:
          - "--host=0.0.0.0"
          - "--port=5000"
          - "--backend-store-uri=sqlite:///mlflow.db"
          - "--default-artifact-root=./mlartifacts"
        ports:
        - containerPort: 5000
          name: http
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-server-svc
  namespace: mlops
spec:
  selector:
    app: mlflow-server
  ports:
  - protocol: TCP
    port: 5000
    targetPort: http

Apply this manifest to get our backend service running: kubectl apply -f mlflow-deployment.yaml.

2. The GraphQL Gateway: Abstraction and Decoupling

The gateway’s purpose is to translate GraphQL queries into REST API calls against the MLflow backend. This abstraction lets us design a schema that is ergonomic for our front-end developers, hiding the complexities of the MLflow API. We chose Apollo Server on Node.js for its maturity and ease of use.

package.json

{
  "name": "mlflow-gateway",
  "version": "1.0.0",
  "description": "GraphQL Gateway for MLflow",
  "main": "index.js",
  "type": "module",
  "scripts": {
    "start": "node index.js"
  },
  "dependencies": {
    "@apollo/server": "^4.9.5",
    "axios": "^1.6.0",
    "graphql": "^16.8.1",
    "pino": "^8.16.1",
    "pino-pretty": "^10.2.3"
  }
}

The core logic resides in the server setup and the resolvers. A critical detail in a production system is structured logging. We use pino for this.

index.js

import { ApolloServer } from '@apollo/server';
import { startStandaloneServer } from '@apollo/server/standalone';
import axios from 'axios';
import pino from 'pino';

// Production-grade logger
const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  transport: {
    target: 'pino-pretty',
    options: {
      colorize: true,
    },
  },
});

// MLflow service URL from environment variable for configuration flexibility
const MLFLOW_API_URL = process.env.MLFLOW_API_URL || 'http://mlflow-server-svc.mlops:5000/api/2.0/mlflow';

const typeDefs = `#graphql
  type Experiment {
    experiment_id: String!
    name: String!
    artifact_location: String
    lifecycle_stage: String
    tags: [Tag]
  }

  type Tag {
      key: String
      value: String
  }

  type Query {
    experiments(filter: String): [Experiment]
    experimentByName(name: String!): Experiment
  }
`;

const resolvers = {
  Query: {
    experiments: async (_, { filter }) => {
      try {
        logger.info({ filter }, 'Fetching all experiments');
        const response = await axios.get(`${MLFLOW_API_URL}/experiments/list`);
        let experiments = response.data.experiments || [];

        // In a real application, filtering would be more robust.
        // This is a simple implementation for demonstration.
        if (filter) {
            experiments = experiments.filter(exp => exp.name.includes(filter));
        }

        return experiments;
      } catch (error) {
        logger.error({ err: error.message }, 'Failed to fetch experiments from MLflow');
        // Propagate a user-friendly error
        throw new Error('Could not connect to the MLflow backend.');
      }
    },
    experimentByName: async (_, { name }) => {
      try {
        logger.info({ experimentName: name }, 'Fetching experiment by name');
        const response = await axios.get(`${MLFLOW_API_URL}/experiments/get-by-name`, {
          params: { experiment_name: name },
        });
        return response.data.experiment;
      } catch (error) {
        if (error.response && error.response.status === 404) {
            logger.warn({ experimentName: name }, 'Experiment not found');
            return null;
        }
        logger.error({ err: error.message, experimentName: name }, 'Failed to fetch experiment by name');
        throw new Error('Error fetching specific experiment from MLflow.');
      }
    },
  },
};

const server = new ApolloServer({
  typeDefs,
  resolvers,
});

const { url } = await startStandaloneServer(server, {
  listen: { port: 4000 },
  context: async ({ req }) => {
    // Extract tenant ID for potential future use in application logic
    // For now, we rely solely on Cilium, but this is good practice.
    const tenantId = req.headers['x-tenant-id'] || 'unknown';
    logger.info({ tenantId, path: req.url }, 'Received request');
    return { tenantId };
  },
});

logger.info(`🚀 Gateway ready at: ${url}`);

To deploy this, we need a Dockerfile and a Kubernetes Deployment.

Dockerfile

FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN npm install

COPY . .

EXPOSE 4000
CMD ["npm", "start"]

gateway-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gql-gateway
  namespace: mlops
  labels:
    app: gql-gateway
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gql-gateway
  template:
    metadata:
      labels:
        app: gql-gateway
    spec:
      containers:
      - name: gateway
        # Replace with your actual registry path
        image: your-registry/mlflow-gateway:1.0.0
        ports:
        - containerPort: 4000
          name: http-gql
        env:
        - name: MLFLOW_API_URL
          value: "http://mlflow-server-svc.mlops.svc.cluster.local:5000/api/2.0/mlflow"
        - name: LOG_LEVEL
          value: "debug"
---
apiVersion: v1
kind: Service
metadata:
  name: gql-gateway-svc
  namespace: mlops
spec:
  selector:
    app: gql-gateway
  ports:
  - protocol: TCP
    port: 4000
    targetPort: http-gql

After building and pushing the image, apply the manifest: kubectl apply -f gateway-deployment.yaml. At this point, any pod in the cluster can access the gateway and see all MLflow data. The system is functional but insecure.

3. Cilium L7 Policies: The Core of Enforcement

This is where the magic happens. We will define CiliumNetworkPolicy resources that control traffic based on Kubernetes identities (ServiceAccount) and L7 properties (HTTP headers).

First, let’s create identities for our tenant workloads. We’ll deploy simple curl pods in each tenant namespace, each with a unique ServiceAccount.

tenant-workloads.yaml

# tenant-a-resources.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tenant-a-sa
  namespace: tenant-a
---
apiVersion: v1
kind: Pod
metadata:
  name: tenant-a-client
  namespace: tenant-a
  labels:
    app: tenant-a-client
spec:
  serviceAccountName: tenant-a-sa
  containers:
  - name: curl
    image: curlimages/curl
    command: ["sleep", "3600"]
---
# tenant-b-resources.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tenant-b-sa
  namespace: tenant-b
---
apiVersion: v1
kind: Pod
metadata:
  name: tenant-b-client
  namespace: tenant-b
  labels:
    app: tenant-b-client
spec:
  serviceAccountName: tenant-b-sa
  containers:
  - name: curl
    image: curlimages/curl
    command: ["sleep", "3600"]

Apply this: kubectl apply -f tenant-workloads.yaml.

Now, we define the network policies. The strategy is:

  1. A default-deny policy on the gql-gateway to block all ingress traffic.
  2. A specific L7 policy to allow ingress traffic that meets our tenant criteria.

cilium-policies.yaml

# 01-default-deny.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: gql-gateway-deny-all
  namespace: mlops
spec:
  endpointSelector:
    matchLabels:
      app: gql-gateway
  ingress: [] # Empty ingress array means deny all ingress traffic
---
# 02-allow-tenant-access.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: allow-tenant-access-to-gateway
  namespace: mlops
spec:
  endpointSelector:
    matchLabels:
      app: gql-gateway
  ingress:
  - fromEndpoints:
    # Rule for Tenant A
    - matchLabels:
        "k8s:io.kubernetes.pod.namespace": "tenant-a"
        "k8s:app": "tenant-a-client" # We could also match on service account
    toPorts:
    - ports:
      - port: "4000"
        protocol: TCP
      rules:
        http:
        - method: "POST"
          path: "/graphql"
          headers:
          - "X-Tenant-ID: tenant-a"
  - fromEndpoints:
    # Rule for Tenant B
    - matchLabels:
        "k8s:io.kubernetes.pod.namespace": "tenant-b"
    toPorts:
    - ports:
      - port: "4000"
        protocol: TCP
      rules:
        http:
        - method: "POST"
          path: "/graphql"
          headers:
          - "X-Tenant-ID: tenant-b"

Let’s break down the allow-tenant-access-to-gateway policy. It applies to pods with the label app: gql-gateway. It defines two ingress rules:

  • The first fromEndpoints block allows traffic from pods in the tenant-a namespace.
  • The crucial part is the rules.http section. It specifies that for this traffic to be allowed, it must be an HTTP POST to /graphql and must contain the header X-Tenant-ID: tenant-a.
  • The second block mirrors this logic for tenant-b.

Apply these policies: kubectl apply -f cilium-policies.yaml. Immediately, all traffic to the gateway should be blocked.

We can now verify enforcement.

Verification inside the tenant-a-client pod:

# Get a shell into the tenant-a pod
kubectl exec -it -n tenant-a tenant-a-client -- sh

# Define the GraphQL query
GQL_QUERY='{"query":"query { experiments(filter: \"tenant-a-project\") { name experiment_id } }"}'

# 1. Correct request: Correct tenant ID header. This should succeed.
curl -X POST \
     -H "Content-Type: application/json" \
     -H "X-Tenant-ID: tenant-a" \
     -d "$GQL_QUERY" \
     http://gql-gateway-svc.mlops:4000/graphql

# Expected Output (if experiments exist): {"data":{"experiments":[{"name":"tenant-a-project-1",...}]}}


# 2. Incorrect request: Wrong tenant ID header. This should be blocked by Cilium.
curl -X POST \
     -H "Content-Type: application/json" \
     -H "X-Tenant-ID: tenant-b" \
     -d "$GQL_QUERY" \
     http://gql-gateway-svc.mlops:4000/graphql

# Expected Output: curl: (56) Recv failure: Connection reset by peer


# 3. Malformed request: No tenant ID header. This should also be blocked.
curl -X POST \
     -H "Content-Type: application/json" \
     -d "$GQL_QUERY" \
     http://gql-gateway-svc.mlops:4000/graphql

# Expected Output: curl: (56) Recv failure: Connection reset by peer

The “Connection reset by peer” message indicates that the TCP connection was forcefully closed by the Cilium-managed Envoy proxy because the HTTP request did not match the policy. Using hubble ui or cilium monitor -n mlops would show these as L7_DENIED traffic flows, providing irrefutable evidence that the policy is working at the network layer.

4. The Relay Front-end: Consuming the Secured API

The final piece is the client application. Relay is a perfect fit for a data-intensive dashboard. The most relevant piece of code for this discussion is the Relay Environment setup, where we configure the network layer to inject the required X-Tenant-ID header.

RelayEnvironment.js

import {
  Environment,
  Network,
  RecordSource,
  Store,
} from 'relay-runtime';

// This would typically come from an auth context or environment config
const TENANT_ID = 'tenant-a'; 

async function fetchGraphQL(params, variables) {
  const response = await fetch('http://localhost:8080/graphql', { // Assuming a local proxy for dev
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      // This is the critical part that satisfies the Cilium L7 policy
      'X-Tenant-ID': TENANT_ID, 
    },
    body: JSON.stringify({
      query: params.text,
      variables,
    }),
  });

  if (response.status !== 200) {
      console.error(`Network request failed with status ${response.status}`);
      // In a real app, you'd have more robust error handling
      throw new Error('Failed to fetch data from the GraphQL gateway.');
  }

  return await response.json();
}

export default new Environment({
  network: Network.create(fetchGraphQL),
  store: new Store(new RecordSource()),
});

A component using this environment would look something like this:

ExperimentDashboard.js

import React from 'react';
import { graphql, useLazyLoadQuery } from 'react-relay';

const ExperimentsQuery = graphql`
  query ExperimentDashboardQuery($filter: String) {
    experiments(filter: $filter) {
      experiment_id
      name
      lifecycle_stage
    }
  }
`;

function ExperimentDashboard({ tenantProjectFilter }) {
  // The tenant filter is passed as a GraphQL variable.
  // The tenant identity is handled by the network layer.
  const data = useLazyLoadQuery(ExperimentsQuery, { filter: tenantProjectFilter });

  if (!data || !data.experiments) {
    return <div>Loading experiments... or none found.</div>;
  }

  return (
    <div>
      <h1>Experiments for Filter: {tenantProjectFilter}</h1>
      <ul>
        {data.experiments.map(exp => (
          <li key={exp.experiment_id}>
            <strong>{exp.name}</strong> ({exp.lifecycle_stage})
          </li>
        ))}
      </ul>
    </div>
  );
}

When this component renders, Relay’s network layer automatically adds the X-Tenant-ID header. If the dashboard pod is running with the tenant-a-sa service account, the request will be allowed by Cilium. If it were somehow deployed with the tenant-b-sa service account, the request would fail at the CNI level, preventing any data leakage.

Limitations and Future Considerations

While this architecture provides a robust, decoupled security model, it is not without its own set of trade-offs. The reliance on an HTTP header for tenancy assumes that the client workload (the Relay dashboard pod in this case) is trusted to send the correct header. In our controlled environment, where pod identities are tied to namespaces and service accounts, this is an acceptable risk. For a zero-trust environment, this could be hardened by using a service mesh like Istio or Linkerd in conjunction with Cilium to enforce mTLS and use SPIFFE identities, which are cryptographically verifiable.

Furthermore, L7 policy inspection, even when accelerated by eBPF, introduces a marginal amount of latency compared to pure L3/L4 policies. For our use case—an internal observability dashboard—this is negligible. For high-frequency, low-latency APIs, this would require careful performance testing.

The next evolutionary step for this system is to move beyond simple header matching. We are exploring using Cilium’s integration with Open Policy Agent (OPA) to enforce more complex, attribute-based access control rules. This would allow us to define policies like “allow access only if the experiment tag ‘cost-center’ matches the user’s group membership,” which could be retrieved from an external identity provider. This would bring even finer-grained control, still without writing a single line of authorization code in the gateway application itself.


  TOC