Implementing Per-Tenant Data Isolation in ChromaDB Using a Firestore-Backed IAM Proxy on DigitalOcean

Security Go Firestore DigitalOcean IAM ChromaDB Multi-tenancy

Architecture

Word Count: 2.9k

Read Times: 18 Min

The initial proof-of-concept for our Retrieval-Augmented Generation (RAG) service was straightforward. A single ChromaDB collection, a batch ingestion script, and an API that converted user questions into embeddings and performed a similarity search. It worked well enough for a demo. The moment we had to architect it for multiple customers, however, we hit a wall. The core problem was that ChromaDB, in its current form, has no concept of granular, per-document access control. A simple application-level bug could cause catastrophic data leakage between tenants, a risk no production system can afford.

Our existing stack was non-negotiable: user and tenant metadata resides in Firestore for its scalability and ease of use, authentication is handled by a centralized IAM provider issuing standard JWTs, and the entire workload runs on DigitalOcean Droplets for simplicity and cost-effectiveness. The challenge was to bolt a robust, non-negotiable security layer onto ChromaDB without fundamentally changing the application’s interaction with it.

Forking ChromaDB to add native IAM was out of the question. Running a separate ChromaDB instance or even a collection per tenant felt like a path to operational chaos, especially when projecting to thousands of tenants. The cost and management overhead would become untenable. The only viable path was to intercept and rewrite all communication with the database, enforcing tenancy at a layer just before the data store. This led to the design of a mandatory, tenancy-aware proxy service.

The principle is this: no application service talks to ChromaDB directly. All traffic is routed through a lightweight Go proxy. This proxy’s sole responsibility is to validate the incoming JWT, extract a tenant_id claim, and inject a mandatory where filter into every single ChromaDB query to enforce data isolation. This architecture centralizes the security logic, making it auditable and difficult to bypass.

The Architectural Foundation

The network topology on DigitalOcean is critical for this to work. We deploy both the ChromaDB instance and our custom IAM proxy as Docker containers on the same Droplet. They communicate over a private Docker bridge network. Crucially, the ChromaDB container’s port (8000) is only exposed to this internal network. The proxy’s port is the only one exposed to the public internet (or to our internal VPC). Any attempt to bypass the proxy and hit ChromaDB directly from another service will fail at the network level.

graph TD
    subgraph "DigitalOcean Droplet"
        direction LR
        subgraph "Docker Bridge Network"
            ProxyService[IAM Proxy :8080] --> ChromaDBService[ChromaDB :8000]
        end
        Client[Client Application] -- HTTPS Request with JWT --> LB[DO Load Balancer]
        LB --> ProxyService
    end

    subgraph "External Services"
        IAMProvider[IAM Provider]
        FirestoreDB[Firestore]
    end

    Client -- Authenticates --> IAMProvider
    IAMProvider -- Issues JWT --> Client
    ProxyService -- Validates JWT Public Key --> IAMProvider
    ProxyService -- Fetches Tenant Metadata --> FirestoreDB

This setup ensures the proxy is a mandatory checkpoint, not an optional convenience.

The Proxy Implementation in Go

We chose Go for its performance, concurrency model, and strong standard library, making it ideal for a network proxy that needs to be fast and reliable.

1. Project Structure and Dependencies

A minimal Go project structure is sufficient.

/iam-chroma-proxy
|-- /cmd
|   |-- /main.go       # Application entry point
|-- /internal
|   |-- /auth          # JWT validation logic
|   |-- /config        # Configuration management
|   |-- /database      # Firestore client
|   |-- /proxy         # Core proxy and query rewriting logic
|-- go.mod
|-- go.sum
|-- Dockerfile
|-- config.yaml

The core dependencies are managed via go.mod:

// go.mod
module github.com/your-org/iam-chroma-proxy

go 1.21

require (
	cloud.google.com/go/firestore v1.14.0
	github.com/gin-gonic/gin v1.9.1
	github.com/golang-jwt/jwt/v5 v5.0.0
	github.com/jellydator/validation v1.1.0
	gopkg.in/yaml.v3 v3.0.1
)
// ... other transitive dependencies

2. Configuration and Initialization

In a real-world project, hardcoding configuration is a recipe for disaster. We manage settings through a config.yaml file loaded at startup.

# config.yaml
server:
  port: "8080"

chromadb:
  target_url: "http://chromadb:8000" # Internal Docker network hostname

iam:
  # The JWKS endpoint of your identity provider (e.g., Auth0, Cognito, etc.)
  jwks_url: "https://your-iam-provider.com/.well-known/jwks.json"
  # The audience claim expected in the JWT
  audience: "https://api.your-service.com"
  # The issuer claim expected in the JWT
  issuer: "https://your-iam-provider.com/"

firestore:
  project_id: "your-gcp-project-id"

# Used for logging and operational context
service:
  name: "iam-chroma-proxy"
  version: "1.0.0"

The Go code to load and validate this configuration is boilerplate but necessary for production readiness.

// internal/config/config.go
package config

import (
	"os"
	"gopkg.in/yaml.v3"
)

type Config struct {
	Server struct {
		Port string `yaml:"port"`
	} `yaml:"server"`
	ChromaDB struct {
		TargetURL string `yaml:"target_url"`
	} `yaml:"chromadb"`
	IAM struct {
		JWKSURL  string `yaml:"jwks_url"`
		Audience string `yaml:"audience"`
		Issuer   string `yaml:"issuer"`
	} `yaml:"iam"`
	Firestore struct {
		ProjectID string `yaml:"project_id"`
	} `yaml:"firestore"`
}

func Load(path string) (*Config, error) {
	var cfg Config
	f, err := os.ReadFile(path)
	if err != nil {
		return nil, err
	}
	if err := yaml.Unmarshal(f, &cfg); err != nil {
		return nil, err
	}
	// A common mistake is not validating config values.
	// Add validation logic here (e.g., check for empty strings, valid URLs).
	return &cfg, nil
}

3. The Core: JWT Authentication Middleware

This is the first gate. Every request must present a valid bearer token. We use the jwt/v5 library to parse the token and fetch the public key from the provider’s JWKS endpoint for signature verification. A common pitfall is failing to cache the JWKS response, leading to excessive HTTP requests to the IAM provider. A production system must implement caching with a reasonable TTL.

// internal/auth/validator.go
package auth

import (
	"context"
	"errors"
	"fmt"
	"log/slog"
	"net/http"
	"strings"
	"sync"
	"time"

	"github.com/gin-gonic/gin"
	"github.com/golang-jwt/jwt/v5"
	"github.com/MicahParks/keyfunc/v2"
)

// Define custom claims structure to extract tenant_id
type CustomClaims struct {
	TenantID string `json:"tenant_id"`
	jwt.RegisteredClaims
}

type JWTValidator struct {
	jwks     *keyfunc.JWKS
	once     sync.Once
	jwksURL  string
	audience string
	issuer   string
}

func NewJWTValidator(jwksURL, audience, issuer string) (*JWTValidator, error) {
	return &JWTValidator{
		jwksURL:  jwksURL,
		audience: audience,
		issuer:   issuer,
	}, nil
}

// initJWKS initializes the JWKS key function with caching.
func (v *JWTValidator) initJWKS() {
	var err error
	options := keyfunc.Options{
		RefreshInterval:  time.Hour,
		RefreshTimeout:   10 * time.Second,
		RefreshErrorHandler: func(err error) {
			slog.Error("JWKS refresh error", "error", err)
		},
	}
	v.jwks, err = keyfunc.Get(v.jwksURL, options)
	if err != nil {
		// This is a fatal error on startup. The proxy cannot function without the keys.
		panic(fmt.Sprintf("Failed to get JWKS: %v", err))
	}
}

// AuthMiddleware is a Gin middleware for JWT validation.
func (v *JWTValidator) AuthMiddleware() gin.HandlerFunc {
	// Lazily initialize JWKS on first request.
	v.once.Do(v.initJWKS)

	return func(c *gin.Context) {
		authHeader := c.GetHeader("Authorization")
		if authHeader == "" {
			c.AbortWithStatusJSON(http.StatusUnauthorized, gin.H{"error": "Authorization header required"})
			return
		}

		parts := strings.Split(authHeader, " ")
		if len(parts) != 2 || strings.ToLower(parts[0]) != "bearer" {
			c.AbortWithStatusJSON(http.StatusUnauthorized, gin.H{"error": "Invalid Authorization header format"})
			return
		}
		tokenString := parts[1]

		claims := &CustomClaims{}
		token, err := jwt.ParseWithClaims(tokenString, claims, v.jwks.Keyfunc)
		if err != nil {
			slog.Warn("JWT parsing failed", "error", err)
			c.AbortWithStatusJSON(http.StatusUnauthorized, gin.H{"error": "Invalid token"})
			return
		}

		if !token.Valid {
			c.AbortWithStatusJSON(http.StatusUnauthorized, gin.H{"error": "Invalid token signature or claims"})
			return
		}

		// In a real-world project, you MUST validate issuer and audience.
		// This prevents token substitution attacks.
		if !strings.EqualFold(claims.Issuer, v.issuer) {
			c.AbortWithStatusJSON(http.StatusUnauthorized, gin.H{"error": "Invalid token issuer"})
			return
		}
		if !claims.IsForAudience(v.audience) {
			c.AbortWithStatusJSON(http.StatusUnauthorized, gin.H{"error": "Invalid token audience"})
			return
		}
		if claims.TenantID == "" {
			c.AbortWithStatusJSON(http.StatusForbidden, gin.H{"error": "Token missing required tenant_id claim"})
			return
		}

		// Store the tenant ID in the context for downstream handlers.
		c.Set("tenant_id", claims.TenantID)
		c.Next()
	}
}

The most critical part here is extracting our custom tenant_id claim and injecting it into the request context. This securely passes the tenant’s identity to the next layer: the query rewriter.

4. The Query Rewriting Proxy Handler

This is where the magic happens. We define a handler that catches all requests (/api/v1/*path). It reads the incoming request body, which is a JSON payload destined for ChromaDB. We unmarshal it into a generic map[string]interface{} to avoid being coupled to ChromaDB’s specific request structures. Then, we manipulate this map to inject our security filter before re-serializing it and forwarding it to the real ChromaDB instance.

The logic for modifying the where filter must be robust. A naive implementation could be easily bypassed. For instance, if a user provides their own where clause like {"$or": [{"owner": "me"}, {"tenant_id": "another-tenant"}]}, simply adding our filter won’t work. The correct approach is to enforce an "$and" condition.

// internal/proxy/handler.go
package proxy

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"log/slog"
	"net/http"
	"net/http/httputil"
	"net/url"

	"github.com/gin-gonic/gin"
)

type ChromaProxy struct {
	target *url.URL
}

func NewChromaProxy(targetURL string) (*ChromaProxy, error) {
	u, err := url.Parse(targetURL)
	if err != nil {
		return nil, fmt.Errorf("invalid target URL: %w", err)
	}
	return &ChromaProxy{target: u}, nil
}

func (p *ChromaProxy) HandleProxy() gin.HandlerFunc {
	proxy := httputil.NewSingleHostReverseProxy(p.target)
	
	// We need to modify the request body, so the default director is not enough.
	proxy.Director = func(req *http.Request) {
		req.Host = p.target.Host
		req.URL.Scheme = p.target.Scheme
		req.URL.Host = p.target.Host
	}
	
	proxy.ModifyResponse = func(resp *http.Response) error {
		// Log errors from ChromaDB for easier debugging.
		if resp.StatusCode >= 400 {
			slog.Warn("Upstream ChromaDB error", 
				"status_code", resp.StatusCode, 
				"request_uri", resp.Request.URL.RequestURI(),
			)
		}
		return nil
	}

	return func(c *gin.Context) {
		tenantID, exists := c.Get("tenant_id")
		if !exists {
			// This should theoretically never happen if the auth middleware is applied.
			c.JSON(http.StatusInternalServerError, gin.H{"error": "Tenant ID not found in context"})
			return
		}

		// We only need to modify POST/PUT requests with bodies.
		// A more robust implementation would check the specific ChromaDB endpoints.
		// For now, we focus on the main query endpoint.
		if c.Request.Method != http.MethodPost || c.FullPath() != "/api/v1/collections/:collection_name/query" {
			proxy.ServeHTTP(c.Writer, c.Request)
			return
		}

		bodyBytes, err := io.ReadAll(c.Request.Body)
		if err != nil {
			c.JSON(http.StatusBadRequest, gin.H{"error": "Failed to read request body"})
			return
		}
		c.Request.Body.Close() // Important to close the original body

		var requestPayload map[string]interface{}
		if err := json.Unmarshal(bodyBytes, &requestPayload); err != nil {
			c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid JSON body"})
			return
		}
		
		// The core security logic.
		err = injectTenantFilter(requestPayload, tenantID.(string))
		if err != nil {
			c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
			return
		}
		
		modifiedBody, err := json.Marshal(requestPayload)
		if err != nil {
			c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to marshal modified body"})
			return
		}

		// Replace the request body with our modified version.
		c.Request.Body = io.NopCloser(bytes.NewBuffer(modifiedBody))
		c.Request.ContentLength = int64(len(modifiedBody))
		c.Request.Header.Set("Content-Length", fmt.Sprint(len(modifiedBody)))

		proxy.ServeHTTP(c.Writer, c.Request)
	}
}

// injectTenantFilter modifies the ChromaDB query payload to enforce tenancy.
func injectTenantFilter(payload map[string]interface{}, tenantID string) error {
	tenantFilter := map[string]interface{}{"tenant_id": tenantID}

	whereClause, exists := payload["where"]
	if !exists {
		// Case 1: No existing 'where' clause. Simply add our tenant filter.
		payload["where"] = tenantFilter
		return nil
	}
	
	whereClauseMap, ok := whereClause.(map[string]interface{})
	if !ok {
		return fmt.Errorf("'where' clause is not a valid JSON object")
	}

	// Case 2: A 'where' clause already exists. We must wrap it with an '$and'.
	// This prevents a malicious user from using '$or' to query other tenants' data.
	newWhereClause := map[string]interface{}{
		"$and": []interface{}{
			tenantFilter,
			whereClauseMap,
		},
	}
	payload["where"] = newWhereClause
	
	slog.Info("Injected tenant filter into ChromaDB query", "tenant_id", tenantID)
	return nil
}

A critical piece of defensive programming is to ensure any existing where clause is wrapped in an $and operation with our mandatory tenant filter. This is non-negotiable.

5. Unit Testing the Security Logic

Trusting this logic without tests is professional malpractice. We must write unit tests that cover the core injectTenantFilter function, including edge cases.

// internal/proxy/handler_test.go
package proxy

import (
	"encoding/json"
	"testing"
	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
)

func TestInjectTenantFilter(t *testing.T) {
	tenantID := "tenant-123"

	testCases := []struct {
		name          string
		inputPayload  string
		expectedPayload string
		expectError   bool
	}{
		{
			name:         "no existing where clause",
			inputPayload: `{"query_texts": ["some query"]}`,
			expectedPayload: `{
				"query_texts": ["some query"], 
				"where": {"tenant_id": "tenant-123"}
			}`,
		},
		{
			name:         "with existing where clause",
			inputPayload: `{"query_texts": ["some query"], "where": {"status": "active"}}`,
			expectedPayload: `{
				"query_texts": ["some query"],
				"where": {
					"$and": [
						{"tenant_id": "tenant-123"},
						{"status": "active"}
					]
				}
			}`,
		},
		{
			name:         "with malicious $or clause",
			inputPayload: `{"query_texts": ["some query"], "where": {"$or": [{"owner": "me"}, {"tenant_id": "other-tenant"}]}}`,
			expectedPayload: `{
				"query_texts": ["some query"],
				"where": {
					"$and": [
						{"tenant_id": "tenant-123"},
						{"$or": [{"owner": "me"}, {"tenant_id": "other-tenant"}]}
					]
				}
			}`,
		},
		{
			name:          "invalid where clause type",
			inputPayload:  `{"where": "not-an-object"}`,
			expectError:   true,
		},
	}

	for _, tc := range testCases {
		t.Run(tc.name, func(t *testing.T) {
			var payload map[string]interface{}
			err := json.Unmarshal([]byte(tc.inputPayload), &payload)
			require.NoError(t, err)

			err = injectTenantFilter(payload, tenantID)

			if tc.expectError {
				assert.Error(t, err)
			} else {
				assert.NoError(t, err)
				var expected map[string]interface{}
				err = json.Unmarshal([]byte(tc.expectedPayload), &expected)
				require.NoError(t, err)
				assert.Equal(t, expected, payload)
			}
		})
	}
}

These tests prove that our rewriting logic is sound and handles the cases we care about, including preventing trivial bypasses.

6. Dockerization for DigitalOcean

The final step is packaging the application for deployment. A multi-stage Dockerfile keeps the final image lean.

# Dockerfile

# ---- Build Stage ----
FROM golang:1.21-alpine AS builder

WORKDIR /app

# Copy go.mod and go.sum files to download dependencies
COPY go.mod go.sum ./
RUN go mod download

# Copy the source code
COPY . .

# Build the application
RUN CGO_ENABLED=0 GOOS=linux go build -o /iam-chroma-proxy ./cmd/main.go

# ---- Final Stage ----
FROM alpine:latest

# It's a good practice to run as a non-root user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser

WORKDIR /home/appuser

# Copy the binary from the builder stage
COPY --from=builder /iam-chroma-proxy .

# Copy configuration
COPY config.yaml .

# Expose the server port
EXPOSE 8080

# Run the application
CMD ["./iam-chroma-proxy"]

For local development and production deployment, a docker-compose.yml file ties everything together.

# docker-compose.yml
version: '3.8'

services:
  chromadb:
    image: chromadb/chroma:latest
    # The key is to NOT expose the port to the host machine,
    # keeping it within the Docker network.
    # ports:
    #   - "8000:8000" # DO NOT DO THIS IN PRODUCTION
    volumes:
      - chromadb_data:/chroma/.chroma/

  proxy:
    build: .
    ports:
      - "8080:8080" # This is the only publicly exposed port
    environment:
      # Pass GCP credentials for Firestore securely
      GOOGLE_APPLICATION_CREDENTIALS: /run/secrets/gcp_creds.json
    secrets:
      - gcp_creds.json
    depends_on:
      - chromadb
    command: ["./iam-chroma-proxy", "-config", "config.yaml"]

volumes:
  chromadb_data:

secrets:
  gcp_creds.json:
    file: ./path/to/your/gcp-credentials.json

This configuration, when deployed to a DigitalOcean Droplet with Docker installed, creates the exact isolated network environment required by our architecture.

Limitations and Future Considerations

This proxy architecture solves the immediate problem of multi-tenant data isolation in ChromaDB, but it is not without its own set of trade-offs and concerns. First, it introduces a single point of failure and a potential performance bottleneck. The Go proxy is fast, but it still adds a network hop and processing overhead to every query. For a high-throughput system, this proxy would need to be horizontally scaled behind a load balancer, and its performance would require continuous monitoring.

Second, the current implementation lacks sophisticated observability. Production-grade code would require structured logging with request tracing, and Prometheus metrics for latency, request rates, and error counts per tenant. This is essential for debugging and capacity planning.

Third, this solution is fundamentally a workaround for a missing feature in the underlying database. If a future version of ChromaDB introduces native, robust, role-based access control, this entire service becomes technical debt. Any team implementing such a pattern must keep a close eye on the database’s feature roadmap to avoid maintaining a complex component that is no longer necessary.

Finally, we have not addressed administrative access. A support engineer or system administrator might need to query data across all tenants. This requires a separate, highly secured “backdoor” path that can bypass the tenant injection logic based on a special administrative JWT role, a feature that must be designed and implemented with extreme care.

Security Go Firestore DigitalOcean IAM ChromaDB Multi-tenancy

Implementing End-to-End Trace Correlation in an Asynchronous Pipeline with SkyWalking, RabbitMQ, and HBase

2023-10-27 Observability

Distributed Tracing SkyWalking Memcached RabbitMQ HBase

Architectural Trade-offs Between Vue.js and Solid.js for Micro-Frontends on a Docker Swarm and Tyk Stack

2023-10-27 Architecture

Tyk Vue.js Docker Swarm Solid.js Cloud Native & DevOps