Implementing a Polyglot Persistence Strategy for Multi-Tenant IAM Using MySQL, Dgraph, and Cassandra


Designing the data persistence layer for a large-scale, multi-tenant Identity and Access Management (IAM) platform presents a fundamental conflict. The system must simultaneously satisfy three disparate requirements: transactional integrity for core identities, low-latency traversal of complex permission hierarchies, and elastic, high-throughput ingestion of immutable audit trails. Attempting to force a single database technology to handle all three workloads is a direct path to crippling performance bottlenecks and operational fragility. In a real-world project, this isn’t an academic exercise; it’s a critical architectural decision that dictates the system’s ultimate scalability and maintainability.

The core technical problem can be decomposed into three distinct data models and access patterns:

  1. Identity Core (The “Who”): This involves tenants, users, service accounts, and API keys. This data is relational, requires strong consistency, and benefits from transactional guarantees (e.g., creating a user and their initial settings must be an atomic operation). The query patterns are simple lookups by ID or unique emails.
  2. Authorization Graph (The “Can Do What”): This layer defines the relationships between identities and resources. A user belongs to multiple teams, a team has roles on several projects, and a project contains resources. Answering “Can this user edit this resource?” requires traversing this complex, many-to-many graph. Latency here is critical, as it’s in the hot path of every authorized API call.
  3. Audit Trail (The “Did What”): Every significant action—login attempts, token issuance, permission changes, resource access—must be logged for security and compliance. This is a write-heavy, append-only workload that generates billions of records. Queries are typically time-range based and must not impact the performance of the live transactional or authorization systems.

Analysis of a Monolithic Relational Approach

The default, and often initial, proposal is to model everything in a single relational database like MySQL. For the Identity Core, it’s a perfect fit. The challenge arises with the Authorization Graph and the Audit Trail.

Permissions can be modeled using adjacency lists or nested sets, but querying hierarchical data in SQL is notoriously inefficient at scale. A typical query to check permissions would involve multiple JOINs or a recursive Common Table Expression (CTE).

-- Example of a recursive CTE to find all roles for a user through team memberships.
-- This becomes prohibitively slow with deep hierarchies and millions of users.
WITH RECURSIVE UserGroups AS (
    -- Base case: Direct group memberships for the user
    SELECT group_id
    FROM user_group_memberships
    WHERE user_id = 'user-uuid-xyz'
    UNION ALL
    -- Recursive step: Parent groups of the groups the user is in
    SELECT g.parent_group_id
    FROM groups g
    INNER JOIN UserGroups ug ON g.id = ug.group_id
    WHERE g.parent_group_id IS NOT NULL
)
SELECT p.permission_name
FROM group_permissions gp
JOIN permissions p ON gp.permission_id = p.id
WHERE gp.group_id IN (SELECT group_id FROM UserGroups)
  AND gp.resource_id = 'resource-uuid-abc';

The pitfall here is that the performance of such queries degrades exponentially as the depth and breadth of the hierarchy increase. Database query planners struggle to optimize these, leading to table scans and high latency under concurrent load.

For the audit trail, a single massive audit_logs table in MySQL becomes an operational nightmare. The table’s indexes would grow to an unmanageable size, and the high write volume would cause constant contention, affecting the performance of the entire database instance, including the critical identity tables. Archiving and partitioning this data manually is complex and error-prone.

Analysis of a Polyglot Persistence Architecture

The alternative is to select the optimal data store for each distinct problem domain. This polyglot approach explicitly trades the simplicity of a single database for specialized performance and scalability.

  • MySQL for the Identity Core: It provides the ACID compliance and transactional integrity needed for managing user and tenant life cycles. Its maturity and robust ecosystem are well-understood.
  • Dgraph for the Authorization Graph: As a native graph database, it’s designed for exactly the kind of deep, recursive traversal required for permission checks. Queries are expressed in GraphQL+ (now known as DQL), which is purpose-built for graph traversal and is significantly more performant and expressive for these tasks than SQL CTEs.
  • Cassandra for the Audit Trail: Its log-structured merge-tree (LSM) architecture is optimized for extremely high write throughput. Its masterless design provides linear scalability and high availability. Data partitioning by a time-bucketed key allows for efficient time-range queries and straightforward data retention policies via Time-To-Live (TTL).

The primary drawback of this approach is the introduction of complexity. We now have three systems to deploy, monitor, and maintain. More importantly, we lose the guarantee of transactional consistency across these systems. A user created in MySQL might not immediately exist in Dgraph. This requires careful design in the application layer to manage eventual consistency.

Final Choice and Rationale

For a system intended to scale to millions of users and billions of audit events, the performance and operational limitations of the monolithic MySQL approach are unacceptable. The scaling ceiling would be hit too quickly, requiring a costly and high-risk migration down the line.

The pragmatic choice is the polyglot persistence architecture. The upfront investment in managing operational complexity is a calculated trade-off. It buys us a system where each component can be scaled independently according to its specific load profile. The audit service can scale its Cassandra cluster to handle a surge in traffic without affecting the latency of permission checks handled by Dgraph. This separation of concerns at the data layer is the key to long-term viability. The challenge of eventual consistency will be addressed through an event-driven synchronization mechanism.

Core Implementation Overview

We will structure our services in Go, a language well-suited for building high-performance network services. The implementation will focus on the data access layer for each component.

System Interaction Flow

The interaction between services and databases must be carefully choreographed, especially for write operations that span multiple systems.

sequenceDiagram
    participant API as API Gateway
    participant IdentitySvc as Identity Service
    participant MySQL
    participant MsgBroker as Message Broker
    participant AuthzSvc as Authorization Service
    participant Dgraph
    participant AuditSvc as Audit Service
    participant Cassandra

    API->>IdentitySvc: POST /users (Create User)
    IdentitySvc->>MySQL: BEGIN TRANSACTION
    IdentitySvc->>MySQL: INSERT INTO users
    MySQL-->>IdentitySvc: Success
    IdentitySvc->>MySQL: COMMIT
    IdentitySvc->>MsgBroker: PUBLISH user.created event
    IdentitySvc-->>API: 201 Created

    MsgBroker-->>AuthzSvc: DELIVER user.created event
    AuthzSvc->>Dgraph: CREATE (u:User {id: ...})
    Dgraph-->>AuthzSvc: Success

    API->>AuditSvc: POST /audit (Log Event)
    AuditSvc->>Cassandra: INSERT INTO audit_logs
    Cassandra-->>AuditSvc: Success
    AuditSvc-->>API: 202 Accepted

1. Configuration and Database Connectivity

A central configuration management system is essential. For this example, we’ll use a simple YAML structure.

config.yaml

mysql:
  dsn: "iam_user:password@tcp(mysql-host:3306)/iam_db?parseTime=true"
  max_open_conns: 100
  max_idle_conns: 25
  conn_max_lifetime_minutes: 5

dgraph:
  alpha_address: "dgraph-alpha-host:9080"

cassandra:
  hosts:
    - "cassandra-node1:9042"
    - "cassandra-node2:9042"
    - "cassandra-node3:9042"
  keyspace: "audit_service"
  consistency: "QUORUM"
  timeout_seconds: 10

pkg/database/connections.go

package database

import (
	"database/sql"
	"fmt"
	"log"
	"time"

	"github.com/dgraph-io/dgo/v210"
	"github.com/dgraph-io/dgo/v210/protos/api"
	"github.com/gocql/gocql"
	_ "github.com/go-sql-driver/mysql"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"
)

// Config structs would be defined here to hold YAML values

// NewMySQLConnection creates and configures a new MySQL database pool.
func NewMySQLConnection(cfg MySQLConfig) (*sql.DB, error) {
	db, err := sql.Open("mysql", cfg.DSN)
	if err != nil {
		return nil, fmt.Errorf("failed to open mysql connection: %w", err)
	}

	db.SetMaxOpenConns(cfg.MaxOpenConns)
	db.SetMaxIdleConns(cfg.MaxIdleConns)
	db.SetConnMaxLifetime(time.Duration(cfg.ConnMaxLifetimeMinutes) * time.Minute)

	// Verify connection is alive
	if err := db.Ping(); err != nil {
		db.Close()
		return nil, fmt.Errorf("failed to ping mysql: %w", err)
	}

	log.Println("Successfully connected to MySQL.")
	return db, nil
}

// NewDgraphClient creates a new client for Dgraph.
func NewDgraphClient(cfg DgraphConfig) (*dgo.Dgraph, func() error, error) {
	conn, err := grpc.Dial(cfg.AlphaAddress, grpc.WithTransportCredentials(insecure.NewCredentials()))
	if err != nil {
		return nil, nil, fmt.Errorf("failed to dial dgraph alpha: %w", err)
	}
	
	dgraphClient := dgo.NewDgraphClient(api.NewDgraphClient(conn))
	log.Println("Successfully connected to Dgraph.")
	
	cleanup := func() error {
		return conn.Close()
	}

	return dgraphClient, cleanup, nil
}

// NewCassandraSession creates a new Cassandra session.
func NewCassandraSession(cfg CassandraConfig) (*gocql.Session, error) {
	cluster := gocql.NewCluster(cfg.Hosts...)
	cluster.Keyspace = cfg.Keyspace
	
	// A common mistake is to not set the consistency level explicitly.
	// Relying on defaults can lead to unexpected behavior in a multi-node cluster.
	consistency, err := gocql.ParseConsistencyWrapper(cfg.Consistency)
	if err != nil {
		return nil, fmt.Errorf("invalid cassandra consistency level '%s': %w", cfg.Consistency, err)
	}
	cluster.Consistency = consistency
	cluster.Timeout = time.Duration(cfg.TimeoutSeconds) * time.Second
	cluster.ProtoVersion = 4 // Recommended for modern Cassandra versions

	session, err := cluster.CreateSession()
	if err != nil {
		return nil, fmt.Errorf("failed to create cassandra session: %w", err)
	}

	log.Println("Successfully connected to Cassandra.")
	return session, nil
}

2. Identity Service: MySQL Schema and Repository

The MySQL schema is straightforward, focusing on normalization and integrity.

schema.sql

CREATE TABLE tenants (
    id VARCHAR(36) PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    UNIQUE(name)
);

CREATE TABLE users (
    id VARCHAR(36) PRIMARY KEY,
    tenant_id VARCHAR(36) NOT NULL,
    email VARCHAR(255) NOT NULL,
    hashed_password VARCHAR(255) NOT NULL,
    is_active BOOLEAN DEFAULT TRUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    UNIQUE(tenant_id, email),
    FOREIGN KEY (tenant_id) REFERENCES tenants(id) ON DELETE CASCADE
);

The Go repository code emphasizes transactional safety.

identity/repository/user_repo.go

package repository

import (
	"context"
	"database/sql"
	"fmt"
	"time"

	"github.com/google/uuid"
)

type User struct {
	ID             string
	TenantID       string
	Email          string
	HashedPassword string
	IsActive       bool
	CreatedAt      time.Time
	UpdatedAt      time.Time
}

// A real-world project would use a more robust interface for transaction management.
type transactionnalDB interface {
	ExecContext(ctx context.Context, query string, args ...interface{}) (sql.Result, error)
	QueryRowContext(ctx context.Context, query string, args ...interface{}) *sql.Row
}

type UserRepository struct {
	db *sql.DB
}

func NewUserRepository(db *sql.DB) *UserRepository {
	return &UserRepository{db: db}
}

// CreateUser creates a new user within a provided transaction.
// By requiring a transaction as an argument, we ensure this operation
// can be composed with other database calls into a single atomic unit.
func (r *UserRepository) CreateUser(ctx context.Context, tx *sql.Tx, email, password, tenantID string) (*User, error) {
	user := &User{
		ID:             uuid.New().String(),
		TenantID:       tenantID,
		Email:          email,
		HashedPassword: password, // Hashing should happen in the service layer
		IsActive:       true,
	}

	query := `
		INSERT INTO users (id, tenant_id, email, hashed_password, is_active)
		VALUES (?, ?, ?, ?, ?)
	`
	_, err := tx.ExecContext(ctx, query, user.ID, user.TenantID, user.Email, user.HashedPassword, user.IsActive)
	if err != nil {
		// Specific error handling for duplicate entries is crucial.
		// The actual error type depends on the database driver.
		// if mysqlErr, ok := err.(*mysql.MySQLError); ok && mysqlErr.Number == 1062 { ... }
		return nil, fmt.Errorf("failed to insert user: %w", err)
	}
	
	// We don't need to re-fetch the user here since we have all the data,
	// but a real implementation might return the full struct with DB-generated timestamps.
	return user, nil
}

3. Authorization Service: Dgraph Schema and Client

Dgraph’s schema defines types and the relationships (edges) between them.

authz.schema

type User {
    id: string! @id @upsert
    memberOf: [Team] @reverse(hasMember: uid)
}

type Team {
    id: string! @id @upsert
    name: string!
    hasMember: [User] @reverse(memberOf: uid)
    hasRole: [Role] @reverse(assignedTo: uid)
}

type Role {
    id: string! @id @upsert
    name: string!
    hasPermission: [Permission]
    assignedTo: [Team] @reverse(hasRole: uid)
}

type Permission {
    action: string!
    onResource: Resource!
}

type Resource {
    id: string! @id @upsert
    type: string!
}

The permission check logic in Go leverages Dgraph’s query capabilities.

authz/checker.go

package authz

import (
	"context"
	"encoding/json"
	"fmt"
	"log"

	"github.com/dgraph-io/dgo/v210"
)

type PermissionChecker struct {
	dg *dgo.Dgraph
}

func NewPermissionChecker(dg *dgo.Dgraph) *PermissionChecker {
	return &PermissionChecker{dg: dg}
}

// HasPermission checks if a user has a specific permission on a resource.
// This is the core, latency-sensitive operation of this service.
func (c *PermissionChecker) HasPermission(ctx context.Context, userID, action, resourceID string) (bool, error) {
	// The query uses Dgraph's @cascade directive. It finds a path from the user to the
	// target permission and only returns data if the full path exists. If the response
	// contains a user, it means a valid path was found.
	query := `
		query check($userID: string, $action: string, $resourceID: string) {
			result(func: eq(User.id, $userID)) @cascade {
				User.id
				User.memberOf {
					Team.hasRole {
						Role.hasPermission @filter(eq(Permission.action, $action)) {
							Permission.onResource @filter(eq(Resource.id, $resourceID)) {
								Resource.id
							}
						}
					}
				}
			}
		}
	`
	vars := map[string]string{
		"$userID":     userID,
		"$action":    action,
		"$resourceID": resourceID,
	}

	txn := c.dg.NewReadOnlyTxn()
	defer txn.Discard(ctx)

	resp, err := txn.QueryWithVars(ctx, query, vars)
	if err != nil {
		return false, fmt.Errorf("dgraph query failed: %w", err)
	}

	var result struct {
		Result []map[string]interface{} `json:"result"`
	}

	if err := json.Unmarshal(resp.Json, &result); err != nil {
		return false, fmt.Errorf("failed to unmarshal dgraph response: %w", err)
	}

	// If the result array is not empty, it means a user node was found that satisfies
	// the entire path defined in the query. This is our authorization grant.
	log.Printf("Permission check for user %s: result length is %d", userID, len(result.Result))
	return len(result.Result) > 0, nil
}

4. Audit Service: Cassandra Schema and Writer

The Cassandra table is designed for writes and time-series queries. The partition key is a composite of tenant_id and a time bucket (e.g., the day) to ensure writes are distributed across the cluster and that queries for a specific tenant within a date range are efficient.

audit.cql

CREATE KEYSPACE IF NOT EXISTS audit_service
WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '3'};

CREATE TABLE IF NOT EXISTS audit_service.audit_logs (
    tenant_id text,
    date_bucket text, -- e.g., "2023-10-27"
    event_time timestamp,
    event_id timeuuid,
    actor_id text,
    actor_type text, -- 'user' or 'service_account'
    action text,
    resource_id text,
    resource_type text,
    status text, -- 'SUCCESS' or 'FAILURE'
    source_ip inet,
    details map<text, text>,
    PRIMARY KEY ((tenant_id, date_bucket), event_time, event_id)
) WITH CLUSTERING ORDER BY (event_time DESC, event_id DESC);

The Go writer should be asynchronous and handle batching. For simplicity, this example shows a single insert.

audit/writer.go

package audit

import (
	"context"
	"fmt"
	"time"

	"github.com/gocql/gocql"
)

type AuditEvent struct {
	TenantID      string
	ActorID       string
	ActorType     string
	Action        string
	ResourceID    string
	ResourceType  string
	Status        string
	SourceIP      string
	Details       map[string]string
}

type EventWriter struct {
	session *gocql.Session
}

func NewEventWriter(session *gocql.Session) *EventWriter {
	return &EventWriter{session: session}
}

// WriteEvent writes a single audit event to Cassandra.
// A production system would use a worker pool and batch inserts for higher throughput.
func (w *EventWriter) WriteEvent(ctx context.C, event *AuditEvent) error {
	// A common pitfall is using time.Now() directly. For distributed systems,
	// it's better to use a timestamp from a reliable source or ensure clocks are synced.
	eventTime := time.Now().UTC()
	dateBucket := eventTime.Format("2006-01-02")

	// gocql.TimeUUID() generates a v1 UUID, which is time-based and ideal for clustering keys.
	eventID := gocql.TimeUUID()

	query := `
		INSERT INTO audit_logs (tenant_id, date_bucket, event_time, event_id, actor_id, actor_type, action, resource_id, resource_type, status, source_ip, details)
		VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
	`

	// Using WithContext allows the request to be cancelled if the context is done.
	err := w.session.Query(query,
		event.TenantID,
		dateBucket,
		eventTime,
		eventID,
		event.ActorID,
		event.ActorType,
		event.Action,
		event.ResourceID,
		event.ResourceType,
		event.Status,
		event.SourceIP,
		event.Details,
	).WithContext(ctx).Exec()

	if err != nil {
		// Proper error handling should distinguish between transient and persistent errors.
		// For transient errors, a retry mechanism should be in place.
		return fmt.Errorf("failed to write audit event to cassandra: %w", err)
	}

	return nil
}

Extensibility and Limitations of the Architecture

This polyglot architecture is inherently more extensible than a monolithic one. Adding a new resource type or a more complex role hierarchy is a schema change in Dgraph that doesn’t require a risky migration on the core MySQL database. The audit schema can be evolved to include more metadata without impacting the other systems. If the audit write-load grows 100x, we can scale the Cassandra cluster independently.

However, this design is not without its limitations. The most significant is the operational overhead. It requires expertise in three different database technologies, each with its own failure modes, monitoring requirements, and backup strategies. The event-driven synchronization between MySQL and Dgraph introduces eventual consistency; there is a brief window of time after user creation where permission checks for that user might fail because their corresponding node has not yet been created in Dgraph. Applications must be designed to handle this possibility gracefully, perhaps by retrying permission checks after a short delay for newly created entities. Finally, the financial cost of running three separate, highly available database clusters is substantially higher than one, a factor that cannot be ignored in any real-world project planning.


  TOC