Building a Multi-Tenant MLOps Feature Store Metadata API with Koa and PostgreSQL

MLOps

Word Count: 2.9k

Read Times: 18 Min

The spreadsheet was the breaking point. Multiple data science teams, each with their own feature engineering pipelines, were tracking their feature definitions in a shared collection of CSVs and Google Sheets. It was a textbook case of organizational scaling failure. Feature names collided, data types were inconsistent between notebooks and production, and there was no ownership or access control. A feature critical to the “fraud” team’s model could be accidentally modified by an intern on the “recommendations” team. We needed a centralized, API-driven control plane for our feature metadata, and it had to be built on a foundation of strict data isolation.

Our initial concept was a system that could programmatically define and manage feature sets, where a “feature set” is a logical grouping of features derived from a single data source (e.g., user_profile_features, transaction_summary_features_7d). The critical, non-negotiable requirement was multi-tenancy. Each data science team would be a “tenant” in our system, and they should operate in complete isolation, as if they were the only user.

This led to our technology selection. For the metadata store, we chose PostgreSQL. Not just for its reliability, but specifically for its Row-Level Security (RLS) feature. A common mistake is to implement multi-tenancy logic entirely within the application layer. This is fragile. A single buggy query or a missed WHERE tenant_id = ? clause can leak data across tenants. RLS pushes the security boundary down into the database itself, making it a much more robust solution. We also needed the flexibility of JSONB to store variable metadata without constant schema migrations.

For the API layer, we selected Koa.js. In a world of feature-rich frameworks, Koa’s minimalist, middleware-centric design was a deliberate choice. It forces a clean separation of concerns and its native async/await syntax makes complex I/O flows, like managing database transactions and performing permission checks, highly readable. We needed a lean, high-performance service, not a monolithic application framework. The workload is I/O-bound (database calls), making Node.js an excellent fit.

Database Schema and Security Foundation

The schema is the backbone of any multi-tenant system. Getting it wrong means a painful migration or, worse, a security breach down the line. Every table that contains tenant-specific data must have a tenant_id column. There are no exceptions.

Here is the foundational schema for our metadata store. We use uuid for primary keys to avoid information leakage through sequential IDs and to simplify distributed systems integration later on.

-- Enable the pgcrypto extension for UUID generation
CREATE EXTENSION IF NOT EXISTS "pgcrypto";

-- Tenants represent isolated organizational units (e.g., data science teams)
CREATE TABLE tenants (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL UNIQUE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
COMMENT ON TABLE tenants IS 'Represents an isolated tenant, like a team or department.';

-- Projects are namespaces within a tenant for organizing feature sets
CREATE TABLE projects (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,
    name TEXT NOT NULL,
    description TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(tenant_id, name)
);
COMMENT ON TABLE projects IS 'A logical grouping of feature sets within a tenant.';

-- Feature Sets group related features from a single data source
CREATE TABLE feature_sets (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE, -- Denormalized for simpler RLS policies
    name TEXT NOT NULL,
    description TEXT,
    entity_column TEXT NOT NULL, -- The primary key of the data source (e.g., user_id)
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(project_id, name)
);
COMMENT ON TABLE feature_sets IS 'A collection of features from a single source, linked to an entity.';

-- Feature Definitions describe individual features within a set
CREATE TABLE feature_definitions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    feature_set_id UUID NOT NULL REFERENCES feature_sets(id) ON DELETE CASCADE,
    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE, -- Denormalized for RLS
    name TEXT NOT NULL,
    description TEXT,
    value_type TEXT NOT NULL, -- e.g., 'INT64', 'FLOAT', 'STRING', 'BOOL'
    tags JSONB,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE(feature_set_id, name)
);
COMMENT ON TABLE feature_definitions IS 'Defines a single feature''s schema and metadata.';

-- Indexes are critical for performance, especially on foreign keys and tenant_id
CREATE INDEX idx_projects_tenant_id ON projects(tenant_id);
CREATE INDEX idx_feature_sets_project_id ON feature_sets(project_id);
CREATE INDEX idx_feature_sets_tenant_id ON feature_sets(tenant_id);
CREATE INDEX idx_feature_definitions_feature_set_id ON feature_definitions(feature_set_id);
CREATE INDEX idx_feature_definitions_tenant_id ON feature_definitions(tenant_id);

Notice the denormalized tenant_id on feature_sets and feature_definitions. While technically redundant, it dramatically simplifies RLS policies by removing the need for JOINs in the policy definition, which is a significant performance consideration.

The following Mermaid diagram visualizes these relationships:

erDiagram
    tenants {
        UUID id PK
        TEXT name
    }
    projects {
        UUID id PK
        UUID tenant_id FK
        TEXT name
    }
    feature_sets {
        UUID id PK
        UUID project_id FK
        UUID tenant_id FK
        TEXT name
        TEXT entity_column
    }
    feature_definitions {
        UUID id PK
        UUID feature_set_id FK
        UUID tenant_id FK
        TEXT name
        TEXT value_type
        JSONB tags
    }

    tenants ||--o{ projects : "has"
    projects ||--o{ feature_sets : "contains"
    feature_sets ||--o{ feature_definitions : "defines"

With the schema in place, we implemented Row-Level Security. This is the core of our isolation strategy.

First, we enable RLS on the relevant tables.

ALTER TABLE projects ENABLE ROW LEVEL SECURITY;
ALTER TABLE feature_sets ENABLE ROW LEVEL SECURITY;
ALTER TABLE feature_definitions ENABLE ROW LEVEL SECURITY;

Next, we create a policy that restricts all access (SELECT, INSERT, UPDATE, DELETE) to rows matching a runtime setting. We’ll set this setting from our Koa application for each incoming request.

-- A helper function to get the current tenant ID from the session context.
-- Returns NULL if not set, which will cause policies to fail safely (deny).
CREATE OR REPLACE FUNCTION get_current_tenant_id() RETURNS UUID AS $$
BEGIN
    RETURN current_setting('app.current_tenant_id', true)::UUID;
EXCEPTION
    WHEN OTHERS THEN
        RETURN NULL;
END;
$$ LANGUAGE plpgsql;

-- Policy for the 'projects' table
CREATE POLICY tenant_isolation_policy ON projects
    FOR ALL
    USING (tenant_id = get_current_tenant_id())
    WITH CHECK (tenant_id = get_current_tenant_id());

-- Policy for the 'feature_sets' table
CREATE POLICY tenant_isolation_policy ON feature_sets
    FOR ALL
    USING (tenant_id = get_current_tenant_id())
    WITH CHECK (tenant_id = get_current_tenant_id());

-- Policy for the 'feature_definitions' table
CREATE POLICY tenant_isolation_policy ON feature_definitions
    FOR ALL
    USING (tenant_id = get_current_tenant_id())
    WITH CHECK (tenant_id = get_current_tenant_id());

The USING clause applies to existing rows for SELECT, UPDATE, and DELETE. The WITH CHECK clause applies to new rows being inserted or updated. With this in place, any database query that doesn’t have the app.current_tenant_id setting correctly configured will fail to see or modify any data. The application layer is now incapable of cross-tenant data access, by database design.

The Koa Application Structure

Our Koa application needed to be structured for maintainability and testing. A real-world project requires more than a single index.js file.

/feature-store-api
|-- /config
|   |-- default.json
|   |-- production.json
|-- /src
|   |-- /api
|   |   |-- /feature_sets
|   |   |   |-- controller.js
|   |   |   |-- router.js
|   |   |   |-- validator.js
|   |   |-- index.js  // Root router
|   |-- /db
|   |   |-- index.js // Database connection pool
|   |-- /middleware
|   |   |-- errorHandler.js
|   |   |-- logger.js
|   |   |-- tenancy.js
|   |-- app.js
|   |-- server.js
|-- /tests
|   |-- /integration
|   |   |-- feature_sets.test.js
|-- package.json
|-- .env

The configuration is managed by node-config, allowing overrides for different environments. The database connection is managed by slonik, a PostgreSQL client that emphasizes type safety and has excellent transaction support.

Here’s the core app.js that wires everything together:

// src/app.js
const Koa = require('koa');
const bodyParser = require('koa-bodyparser');
const pinoLogger = require('koa-pino-logger');
const { v4: uuidv4 } = require('uuid');

const rootRouter = require('./api');
const errorHandler = require('./middleware/errorHandler');
const tenancyHandler = require('./middleware/tenancy');
const config = require('config');

const app = new Koa();

// Add a request ID to every incoming request for traceability
app.use(async (ctx, next) => {
    ctx.state.requestId = ctx.get('X-Request-ID') || uuidv4();
    await next();
});

// Structured logging. Essential for production.
app.use(pinoLogger({
    // Redact sensitive headers
    redact: ['req.headers.authorization', 'req.headers["x-api-key"]'],
    // Add requestId to all log lines
    customProps: (req, res) => ({
        requestId: req.id,
    }),
}));

// Global error handler. Catches errors from downstream middleware.
app.use(errorHandler);

app.use(bodyParser());

// The crucial piece: tenancy middleware
app.use(tenancyHandler);

// Load all API routes
app.use(rootRouter.routes()).use(rootRouter.allowedMethods());

// Default response for unhandled routes
app.use(async (ctx) => {
    ctx.status = 404;
    ctx.body = { error: 'Not Found' };
});

module.exports = app;

The Tenancy Middleware: Bridging API and Database

This middleware is the linchpin of the entire architecture. It’s responsible for identifying the tenant from the incoming request and setting the app.current_tenant_id context in the PostgreSQL session for the duration of that request’s database transaction.

For simplicity, we’ll use a static mapping from an X-API-Key header to a tenant_id. In a production system, this would involve a database lookup or verification of a JWT.

// src/middleware/tenancy.js
const dbPool = require('../db');
const { SlonikError } = require('slonik');

// In a real system, this would come from a secure configuration
// or a database table mapping API keys to tenants.
const API_KEY_TENANT_MAP = new Map([
    ['api-key-team-fraud', '...uuid for fraud team...'],
    ['api-key-team-recs', '...uuid for recs team...'],
]);

/**
 * Koa middleware to enforce tenancy.
 * 1. Extracts an API key from the header.
 * 2. Maps the key to a tenant ID.
 * 3. Injects a `setTenantContext` function onto the Koa context (`ctx`).
 * 4. This function, when called with a database connection, sets the
 *    `app.current_tenant_id` for the current PostgreSQL session.
 */
module.exports = async (ctx, next) => {
    const apiKey = ctx.get('x-api-key');

    if (!apiKey) {
        ctx.throw(401, 'API key is required.');
    }

    const tenantId = API_KEY_TENANT_MAP.get(apiKey);

    if (!tenantId) {
        // Use a generic message to avoid leaking information about valid keys.
        ctx.throw(403, 'Invalid API key.');
    }

    // Attach the tenantId to the state for use in controllers if needed.
    ctx.state.tenantId = tenantId;

    // A critical design decision: We don't run the SET command here directly.
    // Instead, we provide a function that controllers must use inside a transaction.
    // This ensures the setting is correctly scoped to a transaction and cleaned up.
    ctx.state.setTenantContext = async (dbConnection) => {
        try {
            await dbConnection.query(dbPool.sql`
                SET LOCAL app.current_tenant_id = ${tenantId};
            `);
        } catch (error) {
            // This could happen if the tenantId is not a valid UUID, for example.
            ctx.log.error({ err: error, tenantId }, 'Failed to set tenant context in database');
            ctx.throw(500, 'Internal server error during context setup.');
        }
    };

    await next();
};

The design here is subtle but important. The middleware itself doesn’t execute the SET command. It provides a function ctx.state.setTenantContext that the controller will call after it has acquired a database connection from the pool and started a transaction. This ensures the SET LOCAL command is transaction-scoped and automatically reverts when the transaction ends. This is safer than setting a session-wide variable that might leak between requests if connection handling is ever mismanaged.

Implementing a Tenant-Aware Endpoint

Now let’s implement the POST /feature-sets endpoint. It combines validation, database transaction management, and the tenancy context.

First, the validator using joi:

// src/api/feature_sets/validator.js
const Joi = require('joi');

const createFeatureSetSchema = Joi.object({
    name: Joi.string().min(3).max(100).required(),
    projectId: Joi.string().uuid().required(),
    description: Joi.string().optional().allow(''),
    entityColumn: Joi.string().required(),
});

module.exports = {
    createFeatureSetSchema,
};

Next, the controller, which contains the core business logic. This is where we see the RLS policy pay off. Notice the INSERT query has no WHERE clause for tenant_id—the database enforces it automatically.

// src/api/feature_sets/controller.js
const dbPool = require('../../db');
const { sql } = require('slonik');
const { createFeatureSetSchema } = require('./validator');

async function createFeatureSet(ctx) {
    // 1. Validate input payload
    const { error, value } = createFeatureSetSchema.validate(ctx.request.body);
    if (error) {
        ctx.throw(400, 'Validation failed', { details: error.details });
    }

    const { name, projectId, description, entityColumn } = value;
    const { tenantId, setTenantContext } = ctx.state; // Get tenantId from our middleware

    let newFeatureSet;

    try {
        // 2. Run the operation within a database transaction for atomicity.
        await dbPool.transaction(async (transactionConnection) => {
            // 3. Set the tenant context for this specific transaction.
            // This is the command that enables the RLS policy.
            await setTenantContext(transactionConnection);

            // 4. Check if the parent project exists AND belongs to the current tenant.
            // If the project exists but for another tenant, RLS will make this query
            // return 0 rows, causing the check to fail correctly.
            const projectExists = await transactionConnection.exists(sql`
                SELECT 1 FROM projects WHERE id = ${projectId}
            `);

            if (!projectExists) {
                ctx.throw(404, `Project with ID ${projectId} not found.`);
            }

            // 5. Insert the new feature set.
            // The pitfall to avoid: The application code *could* forget to pass `tenantId`.
            // But because of our RLS `WITH CHECK` clause, the database would reject
            // the INSERT if the tenant_id column didn't match get_current_tenant_id().
            // This is defense in depth.
            newFeatureSet = await transactionConnection.one(sql`
                INSERT INTO feature_sets (project_id, tenant_id, name, description, entity_column)
                VALUES (${projectId}, ${tenantId}, ${name}, ${description || null}, ${entityColumn})
                RETURNING id, name, project_id, created_at;
            `);
        });

        // 6. Respond on success.
        ctx.status = 201;
        ctx.body = newFeatureSet;

    } catch (error) {
        // Handle potential unique constraint violations gracefully.
        if (error.code === '23505') { // PostgreSQL unique violation error code
             ctx.throw(409, `A feature set with the name "${name}" already exists in this project.`);
        }
        // Re-throw other errors to be caught by the global error handler.
        throw error;
    }
}

module.exports = {
    createFeatureSet,
};

Finally, the router ties it all together:

// src/api/feature_sets/router.js
const Router = require('@koa/router');
const controller = require('./controller');

const router = new Router({
    prefix: '/feature-sets'
});

router.post('/', controller.createFeatureSet);
// ... other routes (GET, PUT, DELETE) would follow a similar pattern

module.exports = router;

Testing the Tenancy Boundary

The true test of this system is proving that one tenant cannot access another’s data. An integration test is the best way to do this. We use supertest to make HTTP requests to our app and jest as the test runner.

// tests/integration/feature_sets.test.js
const request = require('supertest');
const app = require('../../src/app');
const dbPool = require('../../src/db');
const { sql } = require('slonik');

describe('Feature Sets API - Multi-tenancy', () => {
    let fraudTenantId, recsTenantId;
    let fraudProjectId;

    // Use fake API keys from our middleware config
    const FRAUD_API_KEY = 'api-key-team-fraud';
    const RECS_API_KEY = 'api-key-team-recs';

    beforeAll(async () => {
        // Setup tenants and a project for the "fraud" team
        await dbPool.query(sql`DELETE FROM tenants;`);
        const fraudTenant = await dbPool.one(sql`INSERT INTO tenants (name) VALUES ('fraud-team') RETURNING id;`);
        const recsTenant = await dbPool.one(sql`INSERT INTO tenants (name) VALUES ('recs-team') RETURNING id;`);
        fraudTenantId = fraudTenant.id;
        recsTenantId = recsTenant.id;

        const fraudProject = await dbPool.one(sql`
            INSERT INTO projects (tenant_id, name) VALUES (${fraudTenantId}, 'fraud_detection_v1') RETURNING id;
        `);
        fraudProjectId = fraudProject.id;
    });

    afterAll(async () => {
        await dbPool.end();
    });

    it('should allow fraud team to create a feature set in their own project', async () => {
        const response = await request(app.callback())
            .post('/feature-sets')
            .set('x-api-key', FRAUD_API_KEY)
            .send({
                name: 'user_transaction_aggregates',
                projectId: fraudProjectId,
                entityColumn: 'user_id',
            });

        expect(response.status).toBe(201);
        expect(response.body.name).toBe('user_transaction_aggregates');
        expect(response.body.project_id).toBe(fraudProjectId);
    });

    it('should FORBID recs team from creating a feature set in the fraud teams project', async () => {
        // This is the critical test. The recs team is trying to use the fraud team's project ID.
        // Our controller logic should fail with a 404 because the project lookup will
        // return no rows due to the RLS policy.
        const response = await request(app.callback())
            .post('/feature-sets')
            .set('x-api-key', RECS_API_KEY) // Different API Key
            .send({
                name: 'malicious_feature_set',
                projectId: fraudProjectId, // Fraud team's project ID
                entityColumn: 'user_id',
            });

        expect(response.status).toBe(404);
        expect(response.body.error).toContain(`Project with ID ${fraudProjectId} not found.`);
    });
});

This test explicitly confirms our security model. When the recommendations team tries to create a feature set in the fraud team’s project, the projectExists check in our controller fails. It doesn’t find the project, not because it doesn’t exist, but because the RLS policy filters it out of the query result for that tenant’s database session. The application remains blissfully unaware of the other tenant’s data, which is exactly the behavior we designed for.

This architecture provides a robust control plane for a multi-tenant MLOps platform. The combination of Koa’s clean middleware pattern and PostgreSQL’s powerful Row-Level Security creates a system with defense-in-depth, where security is not just an application-layer concern but a fundamental property of the data store itself.

However, this solution only addresses the metadata and control plane. The high-throughput data plane—the actual serving of feature vectors at low latency—presents a different set of challenges. This typically involves syncing data from offline sources (like Snowflake or BigQuery) into an online store (like Redis, DynamoDB, or a specialized feature store database). The consistency between the metadata managed by this Koa API and the data in the online store becomes a critical distributed systems problem to solve. Furthermore, while API key authentication is sufficient for service-to-service communication, integrating with an enterprise identity provider via OIDC for user-facing management UIs would be a necessary next step for broader adoption. The current RLS implementation also assumes a flat hierarchy within a tenant; implementing more granular role-based access control (e.g., viewer vs. editor) would require extending the policies and application logic.

Node.js Architecture MLOps PostgreSQL Koa Multi-tenancy

Implementing a Distributed C++ Circuit Breaker with a Memcached Backend and Ant Design Dashboard

2023-10-27 Backend Development

Distributed Systems Ant Design Memcached Caddy C++ Resilience Engineering

Federating an MLOps Control Plane with Python WebAuthn and Turbopack-based Micro-frontends

2023-10-27 Software Architecture

Turbopack Python WebAuthn Micro-frontends MLOps