Architecting a Real-Time Feature Store with Cassandra for Personalizing LangChain RAG on Android

Architecture

Word Count: 3k

Read Times: 19 Min

The core challenge is not simply deploying a Retrieval-Augmented Generation (RAG) model for an Android application; it is achieving deep, real-time personalization for millions of users without compromising latency or system stability. A generic RAG pipeline fetches documents based on query semantics, but a truly intelligent system must retrieve and reason over information that is relevant to a specific user, leveraging their immediate context and long-term behavior. This requires a data infrastructure capable of ingesting a high-velocity stream of user interaction events from mobile clients and serving the resulting user features to the LangChain orchestration layer with single-digit millisecond latency.

A naive approach would be to query the application’s primary OLTP database (e.g., PostgreSQL, MySQL) directly from the LangChain backend. This is a common architectural mistake that quickly leads to disaster at scale. The query patterns for feature retrieval are fundamentally different from transactional workloads. They often involve wide rows or lookups across multiple tables, putting immense read pressure on a database optimized for writes and transactional consistency. Coupling the AI serving path with the core business transaction database creates a dangerous dependency; a spike in RAG queries could degrade the entire application’s performance. This path is operationally brittle and fails to scale.

A more robust solution involves creating a dedicated serving layer for this personalization data—a Feature Store. This decouples the AI workload from the primary application data store. The decision then shifts to selecting the right technology for the online component of this store. The requirements are stringent: extremely high write throughput to handle event streams from millions of Android devices, and ultra-low latency point-reads to fetch a user’s feature vector during the RAG chain’s execution. Relational databases are poorly suited for this. We need a system designed for this specific access pattern.

This leads us to Apache Cassandra. Its log-structured merge-tree (LSM-tree) architecture is optimized for high-throughput writes, and its distributed, masterless design provides linear scalability and fault tolerance. Data modeling in Cassandra, when done correctly, allows for fetching all required features for a given user in a single, fast query by partitioning data effectively.

The final architecture thus takes shape: An event stream from Android clients feeds a processing pipeline that computes and updates user features in a Cassandra cluster. A Node.js/TypeScript service, orchestrating a LangChain RAG pipeline, queries this Cassandra cluster to fetch a user’s feature vector in real-time. This vector is then used to augment the prompt or refine the retrieval query, infusing the LLM’s response with personal context. Rigorous testing of this data-intensive backend is non-negotiable, and Vitest, with its modern feature set and integration capabilities, provides the necessary framework to validate the entire data flow from the database to the LangChain components.

sequenceDiagram
    participant Android App
    participant Event Gateway
    participant Feature Engineering
    participant Cassandra as Online Feature Store
    participant LangChain Service
    participant LLM

    Android App->>+Event Gateway: User Interaction Event (e.g., view_product_X)
    Event Gateway->>-Feature Engineering: Raw Event
    Feature Engineering->>+Cassandra: Computes & writes feature (e.g., last_viewed_category='electronics')
    Cassandra-->>-Feature Engineering: Write Ack

    Note over Android App, LLM: Later, user makes a query...
    Android App->>+LangChain Service: Query: "Any recommendations?", UserID: "user-123"
    LangChain Service->>+Cassandra: READ features for "user-123"
    Cassandra-->>-LangChain Service: Feature Vector: {last_viewed_category: 'electronics', ...}
    LangChain Service->>LangChain Service: Augment Prompt with Features
    LangChain Service->>+LLM: Augmented Prompt: "User recently viewed 'electronics'. Any recommendations?"
    LLM-->>-LangChain Service: Personalized Response
    LangChain Service-->>-Android App: Response

Cassandra Data Model for Low-Latency Feature Retrieval

In Cassandra, schema design is query-driven. To achieve single-digit millisecond reads for a user’s entire feature set, we must model the data to avoid multi-key queries or server-side joins. The most effective pattern for this use case is a wide-row table partitioned by user_id. Each feature becomes a separate row within that partition, clustered by feature_name.

This design allows us to fetch all features for a user with a single SELECT statement: SELECT feature_name, feature_value, updated_at FROM user_features WHERE user_id = ?;. This is the single most important decision for ensuring the performance of the personalization layer.

Here is the production-grade CQL for setting up the keyspace and table. In a real-world project, replication factor and strategy would be tuned based on the multi-datacenter deployment topology.

-- Filename: schema.cql

-- A keyspace for our feature store. In production, NetworkTopologyStrategy is essential
-- for multi-DC replication and fault tolerance. For this example, SimpleStrategy is sufficient.
CREATE KEYSPACE IF NOT EXISTS feature_store
WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '1'
};

USE feature_store;

-- This table stores user features in a wide-row format.
-- The partition key is user_id, ensuring all features for a single user are co-located on one node (and its replicas)
-- for extremely fast retrieval.
-- The clustering key, feature_name, orders the features within the partition, though for our
-- primary query (fetching all features), this ordering is less critical than its role in defining the cell's uniqueness.
CREATE TABLE IF NOT EXISTS user_features (
    user_id uuid,                      -- The unique identifier for the user. Partition Key.
    feature_name text,                 -- The name of the feature (e.g., 'last_viewed_category', 'session_duration_avg'). Clustering Key.
    feature_value text,                -- The value of the feature, stored as text for flexibility. Type casting happens in the application layer.
    updated_at timestamp,               -- The timestamp of the last update, crucial for TTL, debugging, and feature freshness analysis.
    PRIMARY KEY (user_id, feature_name)
) WITH CLUSTERING ORDER BY (feature_name ASC)
AND comment = 'Stores real-time user features for personalizing LLM responses.'
AND gc_grace_seconds = 864000; -- Default is 10 days. Tune based on deletion patterns and repair schedules.

-- A common pitfall is not considering data TTL. For session-based features, a TTL can be set on insert
-- to automatically purge stale data and manage storage growth.
-- Example insert with TTL:
-- INSERT INTO user_features (user_id, feature_name, feature_value, updated_at) VALUES (uuid(), 'last_search_term', 'cassandra', toTimestamp(now())) USING TTL 3600;

Backend Service: LangChain Orchestration and Feature Retrieval

The core of the backend is a TypeScript service using LangChain.js. We will create a custom Runnable that encapsulates the logic for fetching features from Cassandra and injecting them into the prompt context. This approach adheres to LangChain’s declarative and composable paradigm.

The service requires a robust Cassandra driver client. We will configure it with appropriate connection pooling, load balancing, and retry policies, which are critical for production stability.

src/config/cassandra.client.ts

import { Client, auth, policies } from 'cassandra-driver';
import { Logger } from '../utils/logger'; // A simple logger utility

const CASSANDRA_CONTACT_POINTS = process.env.CASSANDRA_CONTACT_POINTS?.split(',') || ['127.0.0.1'];
const CASSANDRA_DATACENTER = process.env.CASSANDRA_DATACENTER || 'datacenter1';
const CASSANDRA_KEYSPACE = process.env.CASSANDRA_KEYSPACE || 'feature_store';

const logger = new Logger('CassandraClient');

// Production-grade configuration is essential for stability.
// A common mistake is using default settings which are not tuned for high throughput.
const client = new Client({
  contactPoints: CASSANDRA_CONTACT_POINTS,
  localDataCenter: CASSANDRA_DATACENTER,
  keyspace: CASSANDRA_KEYSPACE,
  authProvider: new auth.PlainTextAuthProvider(
    process.env.CASSANDRA_USER || 'cassandra',
    process.env.CASSANDRA_PASSWORD || 'cassandra'
  ),
  policies: {
    // DCAwareRoundRobinPolicy is critical for multi-DC setups to ensure requests are routed to the local DC.
    loadBalancing: new policies.loadBalancing.DCAwareRoundRobinPolicy(CASSANDRA_DATACENTER),
    // ExponentialReconnectionPolicy prevents hammering the DB during an outage.
    reconnection: new policies.reconnection.ExponentialReconnectionPolicy(1000, 10 * 60 * 1000),
    // DefaultRetryPolicy is generally safe, but for specific idempotent queries, a more aggressive policy could be used.
    retry: new policies.retry.DefaultRetryPolicy(),
  },
  queryOptions: {
    consistency: 1, // LOCAL_ONE is often sufficient for feature reads to prioritize latency.
    prepare: true,  // Preparing queries is a major performance optimization.
  },
});

client.on('log', (level, className, message) => {
  if (level === 'error' || level === 'warning') {
    logger.log(`${level.toUpperCase()}: ${className} - ${message}`);
  }
});

export const connectCassandra = async () => {
  try {
    await client.connect();
    logger.log(`Successfully connected to Cassandra cluster at ${CASSANDRA_CONTACT_POINTS.join(',')}`);
  } catch (error) {
    logger.log(`Failed to connect to Cassandra: ${error}`);
    process.exit(1); // Fail fast if the database connection fails on startup.
  }
};

export default client;

With the client configured, we can build the feature retrieval component.

src/features/feature.service.ts

import { Client, types } from 'cassandra-driver';
import cassandraClient from '../config/cassandra.client';
import { Logger } from '../utils/logger';

export interface UserFeature {
  name: string;
  value: string;
  updatedAt: Date;
}

const logger = new Logger('FeatureService');

export class FeatureService {
  private client: Client;
  private preparedQuery: { query: string; params?: any[] } | null = null;

  constructor(clientInstance: Client = cassandraClient) {
    this.client = clientInstance;
  }

  private async prepareQuery() {
    // Preparing queries is a one-time cost and drastically improves performance.
    // The driver caches the prepared statement info and reuses it.
    if (!this.preparedQuery) {
      this.preparedQuery = {
        query: 'SELECT feature_name, feature_value, updated_at FROM user_features WHERE user_id = ?',
      };
    }
  }

  public async getFeaturesByUserId(userId: string): Promise<UserFeature[]> {
    await this.prepareQuery();
    
    try {
      // The UUID must be correctly formatted. The driver can handle this conversion.
      const userUuid = types.Uuid.fromString(userId);
      
      const result = await this.client.execute(this.preparedQuery!.query, [userUuid], { prepare: true });

      if (!result.rows || result.rows.length === 0) {
        return [];
      }

      return result.rows.map(row => ({
        name: row.feature_name,
        value: row.feature_value,
        updatedAt: row.updated_at,
      }));
    } catch (error) {
      logger.log(`Error fetching features for user ${userId}: ${error}`);
      // In a production system, you might return an empty array or throw a specific error
      // that can be handled upstream (e.g., to fallback to a non-personalized response).
      // Propagating errors blindly can crash the entire request chain.
      return [];
    }
  }
}

Now, we integrate this service into a LangChain chain.

src/rag/personalized.chain.ts

import { ChatPromptTemplate } from "@langchain/core/prompts";
import { ChatOpenAI } from "@langchain/openai";
import { RunnablePassthrough, RunnableSequence } from "@langchain/core/runnables";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { FeatureService, UserFeature } from "../features/feature.service";

// Assume OpenAI API key is set in environment variables
const llm = new ChatOpenAI({ modelName: "gpt-4-turbo-preview", temperature: 0.3 });

const featureService = new FeatureService();

// This function formats the fetched features into a string for the prompt.
// A common mistake is just dumping a JSON blob. A clean, human-readable format
// often yields better results from the LLM.
const formatFeaturesForPrompt = (features: UserFeature[]): string => {
  if (features.length === 0) {
    return "No specific user context available.";
  }
  const featureString = features
    .map(f => `- ${f.name}: ${f.value}`)
    .join('\n');

  return `Here is some context about the user:\n${featureString}`;
};

const personalizationRunnable = RunnableSequence.from([
  // This step takes the input object { user_id: string }
  // and fetches the features from Cassandra.
  // The result is passed to the next step.
  {
    features: async (input: { user_id: string }) => {
      const features = await featureService.getFeaturesByUserId(input.user_id);
      return formatFeaturesForPrompt(features);
    },
    // We must pass through the original question.
    question: (input: { question: string }) => input.question,
  },
  // The output of the previous step is { features: string, question: string }
]);

const promptTemplate = ChatPromptTemplate.fromMessages([
  ["system", "You are an expert assistant. Use the following user context to provide a personalized and helpful answer.\n\n{features}"],
  ["human", "{question}"],
]);

// The full chain combines personalization, prompt formatting, the LLM call, and output parsing.
export const personalizedRAGChain = RunnableSequence.from([
  {
    // The initial call to the chain expects { question: string, user_id: string }.
    // We pass this through to the personalization runnable.
    features: personalizationRunnable.pipe((output) => output.features),
    question: (input: { question: string, user_id: string }) => input.question,
  },
  promptTemplate,
  llm,
  new StringOutputParser(),
]);

/*
// Example Usage:
async function run() {
  const result = await personalizedRAGChain.invoke({
    question: "What should I look at next?",
    user_id: "a7d0e4a0-1b2c-4d3e-8f5a-6b7c8d9e0f10" // A valid UUID
  });
  console.log(result);
}
*/

Testing the Data Pipeline with Vitest

Testing code that interacts with a database is crucial. A unit test can mock the database client, but this doesn’t verify the actual queries or the data mapping. For a data-intensive service like this, integration testing is mandatory. We will use Vitest to write both unit and integration tests. The integration test will use a Docker container running Cassandra, managed via a library like testcontainers, to ensure our service works against a real Cassandra instance.

src/features/feature.service.test.ts

import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
import { FeatureService } from './feature.service';
import { Client, types } from 'cassandra-driver';

// --- Unit Test with Mocking ---
describe('FeatureService - Unit Tests', () => {
  let mockCassandraClient: Client;
  let featureService: FeatureService;

  beforeEach(() => {
    // Using vi.fn() for deep mocking of the Cassandra client.
    mockCassandraClient = {
      execute: vi.fn(),
    } as unknown as Client;
    featureService = new FeatureService(mockCassandraClient);
  });

  it('should return an array of features for a valid user ID', async () => {
    const userId = 'a7d0e4a0-1b2c-4d3e-8f5a-6b7c8d9e0f10';
    const mockDbResponse = {
      rows: [
        { feature_name: 'last_viewed_category', feature_value: 'electronics', updated_at: new Date() },
        { feature_name: 'login_count', feature_value: '42', updated_at: new Date() },
      ],
    };
    (mockCassandraClient.execute as vi.Mock).mockResolvedValue(mockDbResponse);

    const features = await featureService.getFeaturesByUserId(userId);

    expect(mockCassandraClient.execute).toHaveBeenCalledWith(
      'SELECT feature_name, feature_value, updated_at FROM user_features WHERE user_id = ?',
      [types.Uuid.fromString(userId)],
      { prepare: true }
    );
    expect(features).toHaveLength(2);
    expect(features[0].name).toBe('last_viewed_category');
    expect(features[1].value).toBe('42');
  });

  it('should return an empty array if the user has no features', async () => {
    const userId = 'b8e1f5b1-2c3d-5e4f-9g6b-7c8d9e0f1a21';
    const mockDbResponse = { rows: [] };
    (mockCassandraClient.execute as vi.Mock).mockResolvedValue(mockDbResponse);

    const features = await featureService.getFeaturesByUserId(userId);

    expect(features).toEqual([]);
  });

  it('should return an empty array and log an error on database failure', async () => {
    const userId = 'c9f2g6c2-3d4e-6f5g-0h7c-8d9e0f1b32c2';
    const dbError = new Error('Cassandra connection timeout');
    (mockCassandraClient.execute as vi.Mock).mockRejectedValue(dbError);

    // Mock console.log to check if error is logged
    const consoleSpy = vi.spyOn(console, 'log').mockImplementation(() => {});

    const features = await featureService.getFeaturesByUserId(userId);

    expect(features).toEqual([]);
    expect(consoleSpy).toHaveBeenCalledWith(expect.stringContaining(`Error fetching features for user ${userId}: ${dbError}`));
    consoleSpy.mockRestore();
  });
});

This unit test is useful, but the integration test provides true confidence. Setting up testcontainers requires Docker and can be complex, but for this article, we’ll outline the test logic, assuming a helper function setupCassandraContainer handles the Docker interaction.

src/rag/personalized.chain.integration.test.ts

import { describe, it, expect, beforeAll, afterAll } from 'vitest';
import { Client, types } from 'cassandra-driver';
import { personalizedRAGChain } from './personalized.chain';
import { vi } from 'vitest';

// In a real project, this would be a utility that uses a library like 'testcontainers'.
// It would start a Cassandra Docker container and return a configured client.
// For this example, we will assume this function exists and works.
// const { client, stopContainer } = await setupCassandraContainer();

// We will mock this setup for demonstration purposes.
const setupIntegrationTest = async () => {
    // This is a placeholder for a real Testcontainers setup.
    const testClient = new Client({
        contactPoints: ['127.0.0.1'], // Assuming a local Cassandra for testing
        localDataCenter: 'datacenter1',
        keyspace: 'feature_store',
    });
    await testClient.connect();

    // Ensure the keyspace and table exist for the test run.
    await testClient.execute("CREATE KEYSPACE IF NOT EXISTS feature_store WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};");
    await testClient.execute(`
        CREATE TABLE IF NOT EXISTS feature_store.user_features (
            user_id uuid,
            feature_name text,
            feature_value text,
            updated_at timestamp,
            PRIMARY KEY (user_id, feature_name)
        );
    `);
    
    // Clean up data before test.
    await testClient.execute('TRUNCATE feature_store.user_features;');
    
    return {
        client: testClient,
        stop: async () => await testClient.shutdown(),
    };
};


describe('PersonalizedRAGChain - Integration Test', () => {
  let testClient: Client;
  let stopContainer: () => Promise<void>;
  
  // Mock the LLM call to avoid actual API costs and non-determinism during tests.
  // We are testing the data pipeline, not the LLM's reasoning ability.
  vi.mock('@langchain/openai', () => ({
    ChatOpenAI: vi.fn(() => ({
      invoke: vi.fn().mockImplementation(prompt => {
        // The mock can inspect the prompt to ensure it was augmented correctly.
        const promptString = prompt.toChatMessages()[1].content;
        return Promise.resolve(`Mocked response based on prompt: ${promptString}`);
      })
    }))
  }));

  beforeAll(async () => {
    // This timeout is important for tests involving container startup.
    vi.setConfig({ testTimeout: 30000 });
    const { client, stop } = await setupIntegrationTest();
    testClient = client;
    stopContainer = stop;
  }, 30000);

  afterAll(async () => {
    if (stopContainer) {
      await stopContainer();
    }
  });

  it('should correctly fetch user features from Cassandra and augment the prompt', async () => {
    const userId = 'c3a3e6f0-5d6e-4f7a-8b9c-0a1b2c3d4e5f';
    const userUuid = types.Uuid.fromString(userId);

    // Seed the test database with data for this specific test case.
    const insertQuery = 'INSERT INTO feature_store.user_features (user_id, feature_name, feature_value, updated_at) VALUES (?, ?, ?, ?)';
    await testClient.execute(insertQuery, [userUuid, 'last_purchase', 'laptop', new Date()], { prepare: true });
    await testClient.execute(insertQuery, [userUuid, 'subscription_tier', 'premium', new Date()], { prepare: true });

    const result = await personalizedRAGChain.invoke({
      question: 'Any deals for me?',
      user_id: userId,
    });

    // The key assertion is on the output of the mocked LLM, which reveals the input it received.
    expect(result).toContain("User recently viewed 'premium' tier and purchased a 'laptop'"); // Example assertion string
    expect(result).toContain('Any deals for me?');
  });

  it('should handle users with no features gracefully', async () => {
    const userIdWithoutFeatures = 'd4b4f7g1-6e7f-5g8b-9c0d-1b2c3d4e5f6g';

    const result = await personalizedRAGChain.invoke({
      question: 'What is new?',
      user_id: userIdWithoutFeatures,
    });
    
    expect(result).toContain('No specific user context available.');
  });
});

The current architecture successfully establishes a scalable, low-latency pipeline for personalizing RAG responses. It effectively decouples the read-heavy AI serving load from transactional systems by introducing a Cassandra-backed online feature store. The implementation correctly models data for fast retrieval and uses LangChain’s composability to create a clean, testable service.

However, this implementation represents only the online serving component of a complete feature store. A production-grade system would also require an offline store (e.g., a data lake with Parquet files) for generating features via batch processing, a feature registry for discoverability and governance, and robust mechanisms to ensure consistency between online and offline feature values. The current text-based feature_value is flexible but lacks type safety; a real system might use more structured serialization formats like Avro or Protobuf. Furthermore, Cassandra’s eventual consistency model, while excellent for availability and performance, may not be suitable for features that require transactional updates or strong consistency, forcing architectural trade-offs for different types of personalization data.

Vitest Android MLOps Cassandra RAG Feature Store LangChain

Orchestrating Dynamic Mobile Feature Delivery with Puppet, Kong, DVC, and Jetpack Compose

2023-10-27 Platform Engineering

DVC Kong Android MLOps Jetpack Compose Puppet

Decoupling Scikit-learn Model Inference to Reduce Cold Start Latency in Knative-Powered Server-Side Rendering

2023-10-27 Cloud Native

Knative Performance Scikit-learn Kubernetes SSR