Implementing a Dynamic Multi-Tenant GraphQL Layer on AWS Lambda with etcd for Real-Time Policy Propagation


The core operational friction was latency—not of the API, but of our own development cycle. Our multi-tenant GraphQL API, running on AWS Lambda, served dozens of tenants from a single deployment artifact. The architecture was clean, but our configuration management was brittle. Every time a tenant needed a new feature flag enabled, a rate limit adjusted, or a data access policy tweaked, it required a full-stack deployment. This process, including CI/CD pipeline execution, could take upwards of fifteen minutes. This wasn’t just inefficient; it was a business constraint, forcing us to bundle configuration changes with code releases, increasing risk and slowing down our ability to respond to customer needs.

Our initial mandate was simple: reduce the configuration propagation time from minutes to seconds. We needed a mechanism to dynamically alter the behavior of our GraphQL resolvers for any given tenant, live, without a single line of code being redeployed.

The first path explored was leveraging AWS-native services. AWS AppConfig or AWS Parameter Store seemed like obvious choices. They are managed, scalable, and designed for this purpose. However, they are fundamentally poll-based from the Lambda’s perspective. While we could cache configurations, we’d still be faced with a trade-off between cache TTL (and thus propagation delay) and the cost of frequent API calls to fetch updated parameters. We were aiming for near-instantaneous updates, and a 60-second polling interval felt like a compromise, not a solution.

This led us to re-evaluate the problem. We didn’t just need a key-value store; we needed a distributed coordination service with a robust “watch” capability. Our Lambda functions needed to be notified of a change, not ask for it repeatedly. This is where etcd entered the conversation. In a real-world project, introducing a stateful component like etcd into a serverless architecture is a significant decision. It adds operational overhead. But etcd‘s watch API is purpose-built for this exact scenario: reliably broadcasting changes to a fleet of distributed clients. The potential win—sub-second, push-based configuration updates—was compelling enough to warrant a proof-of-concept.

The primary technical challenge became clear: how to efficiently manage connections from ephemeral, stateless Lambda functions to a stateful etcd cluster. A naive implementation would create and tear down a TCP connection on every single invocation, a performance disaster. The solution had to leverage Lambda’s execution context reuse, maintaining a client instance across “warm” invocations while being resilient to cold starts.

Here’s the architecture we settled on:

graph TD
    subgraph AWS Cloud
        API_GW[API Gateway] --> Lambda[Node.js Lambda Function]
        Lambda -->|1. Get Tenant Config| ETCD_Cluster[etcd Cluster on EC2/EKS]
        Lambda -->|2. Query Tenant Data| MongoDB[MongoDB Atlas]
    end

    subgraph Developer/Operator
        CLI[Operator CLI] -- etcdctl put --> ETCD_Cluster
    end

    Client[GraphQL Client] -- HTTPS Request --> API_GW

    style Lambda fill:#FF9900
    style ETCD_Cluster fill:#466bb0,stroke:#333,stroke-width:2px

1. Defining the Tenant Configuration Schema in etcd

Before writing any application code, we established a clear, hierarchical key structure for storing tenant policies. This is critical for maintainability. A flat key space becomes unmanageable quickly.

Our chosen convention was /tenants/{tenant_id}/config. The value stored at this key is a JSON string containing all dynamically configurable parameters for that tenant.

A typical policy object looks like this:

{
  "tenantId": "acme-corp",
  "status": "ACTIVE",
  "rateLimit": {
    "queriesPerMinute": 5000,
    "enabled": true
  },
  "featureFlags": {
    "betaFeatureX": true,
    "newReportingDashboard": false
  },
  "queryConstraints": {
    "maxDepth": 8,
    "maxComplexity": 100
  },
  "schemaVisibility": {
    "blockedFields": [
      "User.internalNotes",
      "Invoice.auditTrail"
    ]
  }
}

To manage these configurations, we use the standard etcdctl command-line tool. For instance, updating the rate limit for the tenant acme-corp is a single command:

# First, get the current config
CURRENT_CONFIG_JSON=$(etcdctl get /tenants/acme-corp/config --print-value-only)

# Use a tool like `jq` to modify it safely
NEW_CONFIG_JSON=$(echo "$CURRENT_CONFIG_JSON" | jq '.rateLimit.queriesPerMinute = 6000')

# Write the updated config back to etcd
etcdctl put /tenants/acme-corp/config "$NEW_CONFIG_JSON"

This simple, scriptable interface forms the basis of our DevOps tooling for tenant management. It’s atomic and immediately visible to any client watching that key.

2. Building a Resilient etcd Client in AWS Lambda

This is the most critical part of the implementation. The code must be structured to reuse the etcd client and its underlying TCP connection across invocations. In Node.js, this means initializing the client outside the main handler function.

Here is the core structure of our etcd configuration service module. Note the use of an in-memory cache to avoid hitting etcd on every single warm invocation for the same tenant.

config/etcd-service.js

const { Etcd3 } = require('etcd3');
const { LRUCache } = require('lru-cache');

// --- Initialization (runs once per container initialization, i.e., on cold start) ---

// A common mistake is to place this inside the handler.
// By placing it here, the client and cache persist between warm invocations.
let etcdClient;
const tenantConfigCache = new LRUCache({
    max: 100, // Max number of tenant configs to cache
    ttl: 1000 * 60 * 5, // Cache for 5 minutes before forcing a re-fetch
});

const ETCD_ENDPOINTS = process.env.ETCD_ENDPOINTS.split(',');

/**
 * Initializes and returns a singleton Etcd3 client instance.
 * Handles lazy initialization to avoid creating a client if the module is imported
 * but not used.
 */
function getClient() {
    if (!etcdClient) {
        console.log('Initializing new etcd client...');
        try {
            etcdClient = new Etcd3({
                hosts: ETCD_ENDPOINTS,
                // Production-grade configuration should include authentication and TLS
                // auth: {
                //     username: process.env.ETCD_USERNAME,
                //     password: process.env.ETCD_PASSWORD,
                // },
                // credentials: {
                //     rootCertificate: Buffer.from(process.env.ETCD_CA_CERT, 'base64'),
                //     privateKey: Buffer.from(process.env.ETCD_CLIENT_KEY, 'base64'),
                //     certChain: Buffer.from(process.env.ETCD_CLIENT_CERT, 'base64'),
                // }
            });
            console.log('Etcd client initialized successfully.');
        } catch (error) {
            console.error('Failed to create etcd client', error);
            // In a real-world scenario, you might want to throw here to fail the cold start
            // and force a retry, preventing a misconfigured container from running.
            throw new Error('ETCD_CLIENT_INITIALIZATION_FAILED');
        }
    }
    return etcdClient;
}

/**
 * Fetches the configuration for a given tenant ID.
 * It first checks an in-memory LRU cache. If not found or expired,
 * it fetches from etcd and caches the result.
 *
 * @param {string} tenantId - The ID of the tenant.
 * @returns {Promise<object | null>} The parsed tenant configuration object or null if not found.
 */
async function getTenantConfig(tenantId) {
    const cacheKey = `config:${tenantId}`;

    if (tenantConfigCache.has(cacheKey)) {
        console.log(`[Cache HIT] for tenant: ${tenantId}`);
        return tenantConfigCache.get(cacheKey);
    }

    console.log(`[Cache MISS] for tenant: ${tenantId}. Fetching from etcd.`);

    const client = getClient();
    const configKey = `/tenants/${tenantId}/config`;

    try {
        const configValue = await client.get(configKey).string();

        if (configValue === null) {
            console.warn(`No configuration found in etcd for tenant: ${tenantId}`);
            // Cache the "not found" result to prevent hammering etcd for non-existent tenants
            tenantConfigCache.set(cacheKey, null);
            return null;
        }

        const config = JSON.parse(configValue);
        tenantConfigCache.set(cacheKey, config);
        console.log(`Successfully fetched and cached config for tenant: ${tenantId}`);
        return config;

    } catch (error) {
        console.error(`Failed to fetch config for tenant ${tenantId} from etcd`, error);
        // Fallback strategy: On etcd failure, we could return a default config
        // or re-throw to fail the request. Failing fast is often safer.
        throw new Error(`ETCD_FETCH_FAILED: ${error.message}`);
    }
}

// Although we opted for a polling-with-cache approach for simplicity in Lambda,
// a more advanced implementation could use the `watch` functionality.
// This is complex due to Lambda's lifecycle but could be achieved by managing
// the watcher process outside the handler. For this implementation, we decided
// the TTL-based cache was a pragmatic first step.
// We accepted a potential propagation delay of up to the cache TTL in exchange for
// architectural simplicity. The "real-time" aspect is achieved by setting a very low TTL
// (e.g., 5 seconds) or, more practically, by having a mechanism to invalidate the cache.

module.exports = { getTenantConfig };

The pitfall here is error handling. What happens if the etcd cluster is unreachable during a Lambda invocation? Our code throws an error, which will result in a 5xx response to the client. This is the correct “fail-fast” approach. An alternative could be to fall back to a default, system-wide configuration, but that can introduce subtle bugs if a tenant suddenly gets permissions they shouldn’t have.

3. Integrating Dynamic Policies into the GraphQL Layer

With the ability to fetch tenant configuration, the next step is to enforce these policies within the GraphQL execution pipeline. We used Apollo Server, and its middleware/plugin system is perfect for this. We created a custom plugin that hooks into the request lifecycle.

The plugin performs several key actions:

  1. Extracts the tenantId from the request context (populated by an upstream JWT authorizer).
  2. Fetches the tenant’s configuration using our etcd-service.
  3. Injects the configuration into the GraphQL context object, making it available to all resolvers.
  4. Performs pre-execution checks, like blocking disabled tenants or validating query depth.

graphql/apollo-plugins.js

const { getTenantConfig } = require('../config/etcd-service');
const { GraphQLError } = require('graphql');

const tenancyEnforcementPlugin = {
    async requestDidStart(requestContext) {
        const { contextValue } = requestContext;

        // Assumes an upstream authorizer has validated a token and placed tenantId on the context.
        const tenantId = contextValue.tenantId;
        if (!tenantId) {
            throw new GraphQLError('Missing tenant identifier.', {
                extensions: { code: 'UNAUTHENTICATED' },
            });
        }

        console.log(`Request started for tenant: ${tenantId}`);

        // Fetch config from our service (which uses a cache).
        const tenantConfig = await getTenantConfig(tenantId);

        if (!tenantConfig || tenantConfig.status !== 'ACTIVE') {
            throw new GraphQLError(`Tenant '${tenantId}' is inactive or not configured.`, {
                extensions: { code: 'FORBIDDEN' },
            });
        }
        
        // --- Inject config into context for resolver-level access ---
        contextValue.tenantConfig = tenantConfig;
        
        return {
            async didResolveOperation(context) {
                // This hook runs after the query is parsed and validated, but before execution.
                // It's a great place for checks based on query structure.
                
                const { queryConstraints } = tenantConfig;
                if (!queryConstraints) return;

                // Example: Query Depth Validation
                const queryDepth = getQueryDepth(context.document);
                if (queryDepth > queryConstraints.maxDepth) {
                    throw new GraphQLError(
                        `Query has depth of ${queryDepth}, which exceeds the maximum allowed depth of ${queryConstraints.maxDepth} for this tenant.`,
                        { extensions: { code: 'QUERY_TOO_DEEP' } }
                    );
                }
            },
            
            async executionDidStart(executionRequestContext) {
                // This hook allows modifying the execution itself, which could be used
                // for more advanced field-level access control.
            }
        };
    },
};

/**
 * A utility to calculate the depth of a GraphQL query AST.
 * This is a simplified implementation for demonstration.
 * Production libraries like `graphql-depth-limit` should be used.
 */
function getQueryDepth(document) {
    let maxDepth = 0;
    const selectionSets = document.definitions[0].selectionSet.selections;

    function calculateDepth(selections, currentDepth) {
        maxDepth = Math.max(maxDepth, currentDepth);
        for (const selection of selections) {
            if (selection.selectionSet) {
                calculateDepth(selection.selectionSet.selections, currentDepth + 1);
            }
        }
    }
    
    calculateDepth(selectionSets, 1);
    return maxDepth;
}


module.exports = { tenancyEnforcementPlugin };

Now, in our main Lambda handler, we assemble the Apollo Server with this plugin.

index.js (Lambda Handler)

const { ApolloServer } = require('@apollo/server');
const { startServerAndCreateLambdaHandler, handlers } = require('@as-integrations/aws-lambda');
const { tenancyEnforcementPlugin } = require('./graphql/apollo-plugins');
const typeDefs = require('./graphql/schema');
const resolvers = require('./graphql/resolvers');

const server = new ApolloServer({
    typeDefs,
    resolvers,
    plugins: [
        tenancyEnforcementPlugin,
        // Other plugins like logging, error reporting, etc.
    ],
});

// The context function is called for every request.
// It's where we'd extract the tenantId from the authorizer context.
const context = async ({ event }) => {
    // In a real API Gateway setup with a JWT authorizer, claims are on `event.requestContext.authorizer.claims`
    const tenantId = event.requestContext.authorizer?.claims?.tenantId || 'default-tenant'; // Fallback for testing
    return {
        tenantId,
        // This is where our plugin will add `tenantConfig`
    };
};

exports.graphqlHandler = startServerAndCreateLambdaHandler(
    server,
    {
        context,
        // The `express` request handler is the default, which is fine for API Gateway v1.
    },
);

4. Enforcing Data Isolation in MongoDB

Dynamic configuration is only one half of multi-tenancy. The other, more critical half is strict data isolation. A common mistake is to rely on developers to manually add a tenantId filter to every single database query. This is error-prone and a massive security risk.

We enforce this at the data access layer. Using Mongoose with MongoDB, we created a “tenanted” model factory that automatically applies the tenantId filter to every query (find, findOne, update, delete, etc.) using query middleware.

models/tenanted-model.js

const mongoose = require('mongoose');

/**
 * Creates a Mongoose schema and model that automatically enforces tenancy.
 * It adds a `tenantId` field to the schema and applies middleware to all
 * query and document operations to ensure data isolation.
 *
 * @param {string} modelName - The name of the Mongoose model.
 * @param {mongoose.SchemaDefinition} schemaDefinition - The schema definition object.
 * @returns {mongoose.Model} The tenanted Mongoose model.
 */
function createTenantedModel(modelName, schemaDefinition) {
    const schema = new mongoose.Schema({
        ...schemaDefinition,
        tenantId: {
            type: String,
            required: true,
            index: true,
        },
    }, { timestamps: true });

    const applyTenantScope = function(context, next) {
        // `this` refers to the Mongoose Query or Aggregate object.
        const tenantId = context?.options?.tenantId;

        if (!tenantId) {
            // This is a critical failure. We MUST NOT proceed.
            // Throw an error to prevent cross-tenant data access.
            return next(new Error('FATAL: Attempted to query a tenanted model without a tenantId in context.'));
        }

        this.where({ tenantId });
        next();
    };

    // Apply middleware to all relevant query hooks
    const queryHooks = ['find', 'findOne', 'findOneAndDelete', 'findOneAndRemove', 'findOneAndUpdate', 'update', 'updateOne', 'updateMany', 'countDocuments', 'deleteMany', 'deleteOne'];
    queryHooks.forEach(hook => {
        schema.pre(hook, function(next) {
            // `this` is the query object.
            applyTenantScope.call(this, this, next);
        });
    });

    // Also handle document creation
    schema.pre('save', function(next) {
        // `this` is the document being saved.
        const tenantId = this.tenantId;

        if (!tenantId) {
            // This is a developer error and must be caught.
            // The tenantId should be injected by the service layer.
            return next(new Error('FATAL: Attempted to save a document without a tenantId.'));
        }
        next();
    });

    return mongoose.model(modelName, schema);
}

module.exports = { createTenantedModel };

Now, defining and using a tenanted model becomes safe by default.

models/user.model.js

const { createTenantedModel } = require('./tenanted-model');

const UserSchemaDefinition = {
    name: String,
    email: { type: String, unique: false }, // Note: unique index must be composite with tenantId
    role: String,
};

const User = createTenantedModel('User', UserSchemaDefinition);
// Create a compound index for emails to be unique per tenant
User.schema.index({ email: 1, tenantId: 1 }, { unique: true });

module.exports = User;

In the GraphQL resolvers, we pass the tenantId from the context into the query options.

graphql/resolvers.js

const User = require('../models/user.model');

const resolvers = {
    Query: {
        // Example resolver that is now tenant-aware.
        me: async (_, __, context) => {
            const { userId, tenantId } = context.auth; // Assuming auth context is populated

            // The tenantId is passed in the options object. Our middleware will catch it.
            const user = await User.findOne({ _id: userId }, null, { tenantId });
            return user;
        },
        // A resolver that uses a feature flag from etcd.
        betaFeatureData: async (_, __, context) => {
            const { tenantConfig, tenantId } = context;

            if (!tenantConfig.featureFlags.betaFeatureX) {
                throw new GraphQLError('This feature is not enabled for your organization.');
            }
            
            // Proceed to fetch data for the beta feature, correctly scoped.
            // const data = await SomeModel.find({}, null, { tenantId });
            return { status: "OK", data: "..." };
        }
    },
};

module.exports = resolvers;

This combination of etcd-driven dynamic configuration and a strict, code-enforced data isolation layer provided the robust multi-tenant architecture we needed. An operator can now run etcdctl put ... and disable a feature for a tenant, and the very next API call from that tenant will reflect the change.

The final system is highly responsive to operational needs. However, the introduction of an etcd cluster is not without its costs. It is a critical, stateful component that we are now responsible for managing, monitoring, and backing up. Its availability is paramount; if etcd is down, our API cannot fetch configurations and will fail open or closed depending on the chosen error-handling strategy. The current implementation relies on a simple in-memory cache with a TTL, which means configuration changes are not truly instantaneous but are propagated within the TTL window. For true real-time push, a watch-based mechanism would be superior, but it presents significant implementation complexity within the Lambda execution model, potentially requiring a sidecar or external notifier service to manage persistent connections. This architecture represents a trade-off: we gained immense agility in configuration management at the cost of increased operational complexity.


  TOC