Implementing a Tiered Data Storage Architecture for an Embeddable RAG System

Data Engineering

Word Count: 3.3k

Read Times: 20 Min

The initial attempt to build our embeddable RAG (Retrieval-Augmented Generation) engine within a Java desktop application relied on a single SQLite database. The thinking was straightforward: keep the deployment simple, a single file on disk. We stored document metadata in standard relational tables, the raw text chunks in a TEXT column, and tried to manage vector embeddings in a BLOB column. This approach failed catastrophically as soon as we loaded the first few thousand documents. Relational queries on metadata became sluggish due to page bloating from the large text and blob fields, and performing any kind of vector similarity search required loading the entire vector set into memory, which was a non-starter. A fundamental architectural rethink was necessary, moving away from a one-size-fits-all database to a polyglot persistence model tailored to the specific access patterns of a RAG workload.

Our revised architecture identifies three distinct types of data, each with its own optimal storage strategy:

Structured Metadata & Application State: Document sources, ingestion timestamps, user query history, and processing flags. This is classic relational data, requiring transactional integrity and complex querying capabilities. SQLite managed via JPA/Hibernate remains the best-in-class choice for this in an embeddable context.
Vector Embeddings: High-dimensional floating-point vectors requiring specialized indexing (like HNSW) for fast approximate nearest neighbor search. A dedicated vector database is non-negotiable. We selected ChromaDB for its simplicity and ability to be run as a local, self-contained server, aligning with our “embeddable” ethos.
Raw Document Chunks: The actual text content corresponding to each vector. After a vector search identifies relevant document IDs, we need to retrieve this text with minimal latency. Storing it in SQLite was inefficient. A simple, high-throughput key-value store is the perfect tool for this job. LevelDB, with its Log-Structured Merge-Tree design, offers excellent write performance and fast key-based reads, making it ideal for this retrieval pattern.

This tiered approach forms a hybrid storage engine. The central challenge becomes orchestrating these three disparate systems to maintain data consistency and provide a unified data access layer for the application.

Orchestrating the Polyglot Persistence Layer

The core of the implementation is a HybridRAGDataService that encapsulates all interactions with the three underlying data stores. Its primary responsibility is to ensure that an operation, like adding a new document, is executed correctly across all tiers. In a real-world project, the lack of distributed transaction support across JPA, LevelDB, and a network call to ChromaDB is the most significant risk. Our strategy is to define a primary store (SQLite/JPA) and treat the others as secondary. If a write to a secondary store fails, we must implement compensatory logic to roll back the primary store’s change.

Here is the ingestion flow for a new document:

sequenceDiagram
    participant App as Application
    participant Service as HybridRAGDataService
    participant JPA as JPA (SQLite)
    participant LevelDB as RawContentStore
    participant Chroma as VectorStore (ChromaDB)

    App->>Service: addDocument(content, metadata)
    Service->>JPA: 1. Persist DocumentMetadata
    JPA-->>Service: Return generated ID (e.g., 42)
    Service->>Service: 2. Generate embedding for content
    alt Successful Ingestion
        Service->>LevelDB: 3. Put(key=42, value=content)
        LevelDB-->>Service: OK
        Service->>Chroma: 4. Add(vector, metadata={"doc_id": "42"})
        Chroma-->>Service: OK
        Service-->>App: DocumentAddedSuccessfully
    else LevelDB Write Fails
        Service->>LevelDB: 3. Put(key=42, value=content)
        LevelDB-->>Service: FAILED
        Service->>JPA: 5a. Rollback: Delete DocumentMetadata(id=42)
        JPA-->>Service: OK
        Service-->>App: IngestionFailedException
    else ChromaDB Add Fails
        Service->>Chroma: 4. Add(vector, metadata={"doc_id": "42"})
        Chroma-->>Service: FAILED
        Service->>JPA: 5b. Rollback: Delete DocumentMetadata(id=42)
        JPA-->>Service: OK
        Service->>LevelDB: 5c. Rollback: Delete(key=42)
        LevelDB-->>Service: OK
        Service-->>App: IngestionFailedException
    end

The corresponding query flow leverages each store for what it does best:

sequenceDiagram
    participant App as Application
    participant Service as HybridRAGDataService
    participant Chroma as VectorStore (ChromaDB)
    participant LevelDB as RawContentStore
    participant JPA as JPA (SQLite)

    App->>Service: retrieveContext(queryText)
    Service->>Service: 1. Generate embedding for queryText
    Service->>Chroma: 2. Query for top-k nearest vectors
    Chroma-->>Service: Return list of doc_ids ["7", "101", "34"]
    Service->>LevelDB: 3. Batch Get(keys=["7", "101", "34"])
    LevelDB-->>Service: Return map of {id -> raw_content}
    Service->>JPA: 4. (Optional) Batch FindByIds(["7", "101", "34"])
    JPA-->>Service: Return map of {id -> metadata}
    Service->>Service: 5. Combine content and metadata
    Service-->>App: Return RichContext objects

Project Foundation and Dependencies

We use Maven to manage the project. The pom.xml must include dependencies for the SQLite dialect, JPA/Hibernate, the LevelDB JNI wrapper, and the ChromaDB Java client.

<project ...>
    <properties>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <hibernate.version>6.2.7.Final</hibernate.version>
        <sqlite.driver.version>3.42.0.0</sqlite.driver.version>
    </properties>

    <dependencies>
        <!-- JPA and Hibernate for SQLite -->
        <dependency>
            <groupId>org.hibernate.orm</groupId>
            <artifactId>hibernate-core</artifactId>
            <version>${hibernate.version}</version>
        </dependency>
        <dependency>
            <groupId>org.xerial</groupId>
            <artifactId>sqlite-jdbc</artifactId>
            <version>${sqlite.driver.version}</version>
        </dependency>
        <dependency>
            <groupId>org.glassfish.jaxb</groupId>
            <artifactId>jaxb-runtime</artifactId>
            <version>4.0.3</version>
        </dependency>

        <!-- LevelDB JNI Wrapper -->
        <dependency>
            <groupId>org.fusesource.leveldbjni</groupId>
            <artifactId>leveldbjni-all</artifactId>
            <version>1.8</version>
        </dependency>

        <!-- ChromaDB Java Client -->
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-chroma</artifactId>
            <version>0.24.0</version>
        </dependency>
        
        <!-- A simple embedding model client -->
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-open-ai</artifactId>
            <version>0.24.0</version>
        </dependency>

        <!-- Logging -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>2.0.7</version>
        </dependency>
    </dependencies>
</project>

A common pitfall in modern Java is the removal of JAXB from the JDK. We must explicitly add it for Hibernate to function correctly.

Tier 1: Relational Metadata with JPA and SQLite

The first layer is the “source of truth” for document existence. We define a DocumentMetadata entity. Its primary key will serve as the unique identifier across all three storage systems.

package com.example.rag.storage;

import jakarta.persistence.*;
import java.time.Instant;

@Entity
@Table(name = "document_metadata")
public class DocumentMetadata {

    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(name = "source_id", nullable = false, unique = true)
    private String sourceId;

    @Column(name = "chunk_index", nullable = false)
    private int chunkIndex;

    @Column(name = "char_count", nullable = false)
    private int charCount;

    @Column(name = "ingested_at", nullable = false)
    private Instant ingestedAt;

    // Standard getters and setters
    
    public Long getId() { return id; }
    public void setId(Long id) { this.id = id; }
    public String getSourceId() { return sourceId; }
    public void setSourceId(String sourceId) { this.sourceId = sourceId; }
    public int getChunkIndex() { return chunkIndex; }
    public void setChunkIndex(int chunkIndex) { this.chunkIndex = chunkIndex; }
    public int getCharCount() { return charCount; }
    public void setCharCount(int charCount) { this.charCount = charCount; }
    public Instant getIngestedAt() { return ingestedAt; }
    public void setIngestedAt(Instant ingestedAt) { this.ingestedAt = ingestedAt; }

    @PrePersist
    public void onPrePersist() {
        this.ingestedAt = Instant.now();
    }
}

The JPA configuration is handled via persistence.xml placed in src/main/resources/META-INF. We configure it to connect to a local file-based SQLite database and automatically create/update the schema.

<persistence xmlns="https://jakarta.ee/xml/ns/persistence"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="https://jakarta.ee/xml/ns/persistence https://jakarta.ee/xml/ns/persistence/persistence_3_0.xsd"
             version="3.0">
    <persistence-unit name="rag-pu" transaction-type="RESOURCE_LOCAL">
        <provider>org.hibernate.jpa.HibernatePersistenceProvider</provider>
        <class>com.example.rag.storage.DocumentMetadata</class>
        <properties>
            <property name="jakarta.persistence.jdbc.driver" value="org.sqlite.JDBC"/>
            <property name="jakarta.persistence.jdbc.url" value="jdbc:sqlite:rag_metadata.db"/>
            <property name="jakarta.persistence.jdbc.user" value=""/>
            <property name="jakarta.persistence.jdbc.password" value=""/>
            <property name="hibernate.dialect" value="org.hibernate.community.dialect.SQLiteDialect"/>
            <property name="hibernate.hbm2ddl.auto" value="update"/>
            <property name="hibernate.show_sql" value="false"/>
            <property name="hibernate.format_sql" value="true"/>
        </properties>
    </persistence-unit>
</persistence>

The repository class for this entity handles the transaction boilerplate.

package com.example.rag.storage;

import jakarta.persistence.EntityManager;
import jakarta.persistence.EntityManagerFactory;
import jakarta.persistence.EntityTransaction;

import java.util.function.Function;

public class MetadataRepository {

    private final EntityManagerFactory emf;

    public MetadataRepository(EntityManagerFactory emf) {
        this.emf = emf;
    }

    public <T> T executeInTransaction(Function<EntityManager, T> action) {
        EntityManager em = emf.createEntityManager();
        EntityTransaction tx = em.getTransaction();
        try {
            tx.begin();
            T result = action.apply(em);
            tx.commit();
            return result;
        } catch (RuntimeException e) {
            if (tx != null && tx.isActive()) {
                tx.rollback();
            }
            throw e;
        } finally {
            em.close();
        }
    }
    
    public void executeInTransaction(java.util.function.Consumer<EntityManager> action) {
        executeInTransaction(em -> {
            action.accept(em);
            return null;
        });
    }
}

Tier 2: High-Speed Raw Text Access with LevelDB

The second layer is a RawContentStore that wraps LevelDB. Its interface is minimal: put, get, delete. Lifecycle management (open, close) is critical to prevent database corruption.

package com.example.rag.storage;

import org.iq80.leveldb.DB;
import org.iq80.leveldb.DBException;
import org.iq80.leveldb.Options;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Optional;

import static org.iq80.leveldb.impl.Iq80DBFactory.factory;

public class RawContentStore implements AutoCloseable {

    private static final Logger logger = LoggerFactory.getLogger(RawContentStore.class);
    private final DB levelDb;

    public RawContentStore(String path) {
        Options options = new Options();
        options.createIfMissing(true);
        try {
            this.levelDb = factory.open(new File(path), options);
            logger.info("LevelDB raw content store opened at {}", path);
        } catch (IOException e) {
            throw new RuntimeException("Failed to open LevelDB at " + path, e);
        }
    }

    public void put(Long key, String value) {
        if (key == null || value == null) {
            throw new IllegalArgumentException("Key and value cannot be null");
        }
        try {
            levelDb.put(
                String.valueOf(key).getBytes(StandardCharsets.UTF_8),
                value.getBytes(StandardCharsets.UTF_8)
            );
        } catch (DBException e) {
            logger.error("Failed to write to LevelDB for key: {}", key, e);
            throw new StorageException("LevelDB put operation failed", e);
        }
    }

    public Optional<String> get(Long key) {
        if (key == null) {
            return Optional.empty();
        }
        try {
            byte[] valueBytes = levelDb.get(String.valueOf(key).getBytes(StandardCharsets.UTF_8));
            return (valueBytes != null)
                ? Optional.of(new String(valueBytes, StandardCharsets.UTF_8))
                : Optional.empty();
        } catch (DBException e) {
            logger.error("Failed to read from LevelDB for key: {}", key, e);
            throw new StorageException("LevelDB get operation failed", e);
        }
    }
    
    public void delete(Long key) {
        if (key == null) return;
        try {
            levelDb.delete(String.valueOf(key).getBytes(StandardCharsets.UTF_8));
        } catch (DBException e) {
            logger.error("Failed to delete from LevelDB for key: {}", key, e);
            // In a production system, you might schedule a retry instead of throwing
            throw new StorageException("LevelDB delete operation failed", e);
        }
    }

    @Override
    public void close() {
        try {
            levelDb.close();
            logger.info("LevelDB raw content store closed.");
        } catch (IOException e) {
            logger.error("Error closing LevelDB", e);
        }
    }
    
    // Custom exception for storage layer
    public static class StorageException extends RuntimeException {
        public StorageException(String message, Throwable cause) {
            super(message, cause);
        }
    }
}

A custom StorageException allows the service layer to catch errors specific to persistence operations. The keys are the Long IDs from the SQLite database, converted to byte arrays.

Tier 3: Semantic Search with ChromaDB

The vector store component interfaces with a ChromaDB instance. For an embeddable system, we can launch the Chroma server as a subprocess or connect to a local instance started manually (chroma run --path /path/to/db). The langchain4j-chroma library provides a convenient client.

package com.example.rag.storage;

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.chroma.ChromaEmbeddingStore;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;

public class VectorStore {

    private static final Logger logger = LoggerFactory.getLogger(VectorStore.class);
    public static final String DOC_ID_METADATA_KEY = "doc_id";
    
    private final EmbeddingStore<TextSegment> embeddingStore;

    public VectorStore(String collectionName, String baseUrl, int timeoutSeconds) {
        this.embeddingStore = ChromaEmbeddingStore.builder()
                .baseUrl(baseUrl)
                .collectionName(collectionName)
                .timeout(java.time.Duration.ofSeconds(timeoutSeconds))
                .build();
        logger.info("ChromaDB vector store connected to {} for collection '{}'", baseUrl, collectionName);
    }
    
    public String add(Long documentId, Embedding embedding, String originalText) {
        TextSegment segment = TextSegment.from(originalText);
        // We embed the documentId into the metadata. This is the crucial link back to our other stores.
        segment.metadata().put(DOC_ID_METADATA_KEY, String.valueOf(documentId));
        
        try {
            return embeddingStore.add(embedding, segment);
        } catch (Exception e) {
            logger.error("Failed to add embedding to ChromaDB for doc ID: {}", documentId, e);
            throw new StorageException("ChromaDB add operation failed", e);
        }
    }
    
    public List<EmbeddingMatch<TextSegment>> findRelevant(Embedding queryEmbedding, int maxResults) {
        try {
            return embeddingStore.findRelevant(queryEmbedding, maxResults);
        } catch (Exception e) {
            logger.error("Failed to query ChromaDB", e);
            throw new StorageException("ChromaDB query operation failed", e);
        }
    }

    // Custom exception for storage layer, can be the same as RawContentStore's
    public static class StorageException extends RuntimeException {
        public StorageException(String message, Throwable cause) {
            super(message, cause);
        }
    }
}

The most critical design decision here is embedding the documentId from SQLite into the ChromaDB metadata. This allows us to retrieve the primary key after a vector search and then use it to fetch the full data from JPA and LevelDB.

The Hybrid Service: Tying It All Together

The HybridRAGDataService orchestrates these three components. Its methods implement the logic from the sequence diagrams, including the crucial compensation logic for handling failures.

package com.example.rag.service;

import com.example.rag.storage.*;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;

public class HybridRAGDataService {

    private static final Logger logger = LoggerFactory.getLogger(HybridRAGDataService.class);

    private final MetadataRepository metadataRepo;
    private final RawContentStore contentStore;
    private final VectorStore vectorStore;
    private final EmbeddingModel embeddingModel;

    public HybridRAGDataService(
            MetadataRepository metadataRepo,
            RawContentStore contentStore,
            VectorStore vectorStore,
            EmbeddingModel embeddingModel) {
        this.metadataRepo = metadataRepo;
        this.contentStore = contentStore;
        this.vectorStore = vectorStore;
        this.embeddingModel = embeddingModel;
    }

    public DocumentMetadata addDocument(String sourceId, int chunkIndex, String content) {
        logger.info("Starting ingestion for sourceId: {}, chunkIndex: {}", sourceId, chunkIndex);
        
        // Step 1: Persist metadata to get a generated ID. This is our anchor.
        DocumentMetadata metadata = new DocumentMetadata();
        metadata.setSourceId(sourceId);
        metadata.setChunkIndex(chunkIndex);
        metadata.setCharCount(content.length());
        
        DocumentMetadata persistedMetadata = metadataRepo.executeInTransaction(em -> {
            em.persist(metadata);
            return metadata;
        });
        
        Long docId = persistedMetadata.getId();
        logger.debug("Persisted metadata for docId: {}", docId);

        try {
            // Step 2: Store raw content in LevelDB
            contentStore.put(docId, content);
            logger.debug("Stored raw content in LevelDB for docId: {}", docId);

            // Step 3: Generate embedding and store in ChromaDB
            Embedding embedding = embeddingModel.embed(content).content();
            vectorStore.add(docId, embedding, content);
            logger.debug("Stored vector in ChromaDB for docId: {}", docId);

            return persistedMetadata;

        } catch (Exception e) {
            logger.error("Ingestion failed for docId: {}. Initiating rollback.", docId, e);
            // Compensation logic: Rollback previous steps
            // A more robust implementation might use a state machine or a job queue
            // to handle retries for transient errors.
            try {
                contentStore.delete(docId); // Attempt to delete from LevelDB
            } catch(Exception cleanupEx) {
                logger.error("Rollback failed for LevelDB content on docId: {}. Manual cleanup may be required.", docId, cleanupEx);
            }
            try {
                 metadataRepo.executeInTransaction(em -> {
                    DocumentMetadata toDelete = em.find(DocumentMetadata.class, docId);
                    if (toDelete != null) {
                        em.remove(toDelete);
                    }
                });
            } catch(Exception cleanupEx) {
                 logger.error("CRITICAL: Rollback failed for metadata on docId: {}. Database is in an inconsistent state.", docId, cleanupEx);
            }
            throw new IngestionException("Failed to ingest document, rollback initiated.", e);
        }
    }

    public List<QueryResult> search(String query, int topK) {
        Embedding queryEmbedding = embeddingModel.embed(query).content();
        List<EmbeddingMatch<TextSegment>> matches = vectorStore.findRelevant(queryEmbedding, topK);

        return matches.stream()
            .map(match -> {
                String docIdStr = match.embedded().metadata().getString(VectorStore.DOC_ID_METADATA_KEY);
                Long docId = Long.parseLong(docIdStr);
                
                // Fetch the raw content from LevelDB. This should be very fast.
                Optional<String> content = contentStore.get(docId);
                
                return new QueryResult(
                    docId,
                    match.score(),
                    content.orElse(""),
                    match.embedded().text() // Chroma can also store the text, but we rely on LevelDB as primary source
                );
            })
            .collect(Collectors.toList());
    }

    public static class QueryResult {
        public final Long documentId;
        public final double score;
        public final String content;
        public final String matchedSegment; // from Chroma
        // Constructor, getters, etc.
        public QueryResult(Long documentId, double score, String content, String matchedSegment) {
            this.documentId = documentId;
            this.score = score;
            this.content = content;
            this.matchedSegment = matchedSegment;
        }
    }

    public static class IngestionException extends RuntimeException {
        public IngestionException(String message, Throwable cause) {
            super(message, cause);
        }
    }
}

The error handling in addDocument is the most important part of this service. A failure in any step after the initial database insert triggers a series of compensating actions to clean up the other stores. A production-grade system would need to be even more robust, potentially flagging records for a separate cleanup process if a rollback itself fails, to avoid leaving orphaned data.

Testing the Implementation

A unit or integration test can validate the entire flow. This test would initialize all three storage layers, add a document, and then perform a search to ensure the correct data is retrieved.

// Example usage in a main method or test
public static void main(String[] args) throws IOException {
    // Setup - In a real app, this would be managed by a DI framework
    EntityManagerFactory emf = Persistence.createEntityManagerFactory("rag-pu");
    MetadataRepository metadataRepo = new MetadataRepository(emf);
    
    // Use temporary directories for testing
    Path tempDir = Files.createTempDirectory("rag_test");
    RawContentStore contentStore = new RawContentStore(tempDir.resolve("leveldb").toString());
    VectorStore vectorStore = new VectorStore("test_collection", "http://localhost:8000", 60);

    // Mock or real embedding model
    EmbeddingModel embeddingModel = new OpenAiEmbeddingModel.OpenAiEmbeddingModelBuilder()
        .apiKey(System.getenv("OPENAI_API_KEY"))
        .modelName("text-embedding-3-small")
        .build();

    HybridRAGDataService service = new HybridRAGDataService(metadataRepo, contentStore, vectorStore, embeddingModel);

    // Run test
    try {
        String content1 = "The Catcher in the Rye is a novel by J. D. Salinger.";
        service.addDocument("salinger_bio_1", 0, content1);
        
        String content2 = "SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.";
        service.addDocument("tech_doc_1", 0, content2);

        System.out.println("Ingestion successful.");

        List<HybridRAGDataService.QueryResult> results = service.search("structured query language", 1);
        
        System.out.println("Search Results:");
        results.forEach(r -> {
            System.out.printf("ID: %d, Score: %.4f, Content: %s%n", r.documentId, r.score, r.content);
        });

    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        // Cleanup
        contentStore.close();
        emf.close();
    }
}

Before running this, one must have a ChromaDB server running locally (chroma run). The test demonstrates the successful orchestration: data is written across three systems, and a query correctly identifies the relevant document ID from the vector store and retrieves the corresponding raw content from the key-value store.

The primary limitation of this architecture is its explicit design for single-node, embeddable deployments. The manual compensation logic for transaction failures introduces complexity and potential for inconsistency if a rollback action fails. It cannot scale horizontally without significant redesign; for a distributed system, one would likely converge on a managed database solution like PostgreSQL with pgvector, or fully distributed services for each data type. Furthermore, the lifecycle management of the ChromaDB process needs to be robustly handled by the parent Java application, including startup, shutdown, and monitoring, which adds operational overhead. This architecture, however, represents a highly optimized solution for its specific niche: creating powerful, self-contained AI applications that do not require external database infrastructure.

LevelDB ChromaDB RAG Java SQLite JPA

Offloading LLM Embedding Generation to a Go gRPC Service for a High-Throughput Azure Functions RAG Pipeline

2023-11-15 Cloud Architecture

gRPC-Go Pulumi LLM Azure Functions Qdrant

Implementing a Polyglot ML Model Deployment Controller Using etcd as a Declarative State Store on Azure

2023-11-15 MLOps

DevOps Keras Azure etcd Go Rollup MLOps