The mandate was clear: integrate semantic search and question-answering capabilities into our existing product suite. The technical constraint, however, was rigid. Our entire data infrastructure is built on MariaDB, and introducing a new, dedicated vector database like Pinecone or Weaviate was vetoed due to operational overhead, budget constraints, and the unsolved complexities of real-time data synchronization. The challenge became an architectural one: could we force a traditional relational database to perform acceptably as a vector store for a Retrieval-Augmented Generation (RAG) pipeline? This is the build log of that effort—a story of compromises, workarounds, and pragmatic engineering.
Our initial concept was a self-contained ASP.NET Core service that would serve as the brain for this new capability. The architecture had to be clean and decouple the core concerns: model inference, data persistence, and orchestration.
graph TD subgraph "ASP.NET Core RAG Service" A[API Controller] --> B{Orchestration Service}; B --> C[Embedding Service]; B --> D[Vector Search Service]; C --> E[Hugging Face Model]; D --> F[(MariaDB)]; end User --> A; E -- ONNX Runtime --> C; style F fill:#f9f,stroke:#333,stroke-width:2px
The flow is standard for a RAG pipeline:
- An ingestion process (not pictured) would process source documents, chunk them, and use the
Embedding Service
to generate vectors. - These text chunks, their metadata, and the generated vectors would be stored in MariaDB.
- At query time, a user’s question would hit the
API Controller
. - The
Orchestration Service
would first send the query to theEmbedding Service
to get its vector representation. - It would then pass this query vector to the
Vector Search Service
, which would find the most similar document vectors within MariaDB. - Finally, the orchestrator would retrieve the corresponding text chunks, construct a prompt containing the original question and the retrieved context, and prepare it for a generative model.
The real engineering challenge resides entirely within the Vector Search Service
and its interaction with MariaDB. Everything else is relatively straightforward application development.
Phase 1: MariaDB Schema and Data Persistence
Before any search could happen, we had to solve the problem of storing vectors in MariaDB. After some debate, we settled on a VARBINARY
column. It’s more space-efficient than storing a JSON array of floats and avoids character set conversion issues that could corrupt the raw vector data.
Our schema for document chunks looked like this:
CREATE TABLE `document_chunks` (
`id` BIGINT NOT NULL AUTO_INCREMENT,
`document_source_id` VARCHAR(255) NOT NULL,
`content` TEXT NOT NULL,
`embedding` VARBINARY(6144) NOT NULL, -- For a 384-dim float vector: 384 * 4 bytes = 1536. We give ample room.
`created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
INDEX `idx_document_source_id` (`document_source_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
In a real-world project, this table would be more complex, with foreign keys and rich metadata columns for filtering. For this build, the focus is on the vector. The embedding
column stores the raw bytes of a 384-dimensional floating-point vector, which is a common output size for sentence-transformer models like all-MiniLM-L6-v2
.
Phase 2: The ASP.NET Core Service Foundation
We began with a standard ASP.NET Core 8 Web API project. The core dependencies were Microsoft.ML.OnnxRuntime
for running the embedding model and Dapper
for data access. We chose Dapper over Entity Framework Core for this specific task because we anticipated writing highly optimized, and possibly complex, SQL for the vector search, where EF Core’s abstractions could become a hindrance.
The setup in Program.cs
involves registering our services and, crucially, managing the lifecycle of the ONNX InferenceSession
. This session is thread-safe and expensive to initialize, making it a perfect candidate for a singleton.
// Program.cs
using RAGWithMariaDB.Services;
var builder = WebApplication.CreateBuilder(args);
// Configuration
builder.Services.AddControllers();
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();
// --- Service Registration ---
// Load model path from configuration
var modelPath = builder.Configuration["EmbeddingModel:Path"];
if (string.IsNullOrEmpty(modelPath) || !File.Exists(modelPath))
{
throw new FileNotFoundException("Embedding model not found at the specified path.", modelPath);
}
// Register the ONNX inference session as a singleton
builder.Services.AddSingleton<EmbeddingService>(sp =>
{
var logger = sp.GetRequiredService<ILogger<EmbeddingService>>();
return new EmbeddingService(modelPath, logger);
});
// Register data access and orchestration services
builder.Services.AddScoped<IDocumentRepository>(sp =>
{
var connectionString = sp.GetRequiredService<IConfiguration>().GetConnectionString("MariaDB");
return new DocumentRepository(connectionString!);
});
builder.Services.AddScoped<RAGOrchestrator>();
var app = builder.Build();
if (app.Environment.IsDevelopment())
{
app.UseSwagger();
app.UseSwaggerUI();
}
app.UseHttpsRedirection();
app.UseAuthorization();
app.MapControllers();
app.Run();
The configuration in appsettings.json
points to the local path of our ONNX model.
// appsettings.json
{
"Logging": {
"LogLevel": {
"Default": "Information",
"Microsoft.AspNetCore": "Warning"
}
},
"AllowedHosts": "*",
"ConnectionStrings": {
"MariaDB": "Server=localhost;Port=3306;Database=rag_db;Uid=root;Pwd=your_password;"
},
"EmbeddingModel": {
"Path": "./Model/all-MiniLM-L6-v2.onnx"
}
}
Phase 3: Embedding Generation
The EmbeddingService
is the bridge to the Hugging Face model. We download a pre-trained sentence-transformer model, convert it to ONNX format, and package it with our application. This service is responsible for loading the model, processing text input through the required tokenization steps (which for many sentence-transformers is handled by libraries that we must replicate or use), and executing the inference. For brevity, we’ll assume a simplified tokenizer or a model that requires minimal preprocessing. The core logic revolves around OnnxRuntime
.
// Services/EmbeddingService.cs
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
public class EmbeddingService : IDisposable
{
private readonly InferenceSession _session;
private readonly ILogger<EmbeddingService> _logger;
public EmbeddingService(string modelPath, ILogger<EmbeddingService> logger)
{
_logger = logger;
try
{
// Use CUDA provider if available, otherwise fallback to CPU
var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider_CUDA(); // Will fallback if CUDA is not available
_session = new InferenceSession(modelPath, sessionOptions);
_logger.LogInformation("ONNX InferenceSession created successfully for model: {ModelPath}", modelPath);
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed to create ONNX InferenceSession.");
throw;
}
}
// A real implementation requires a proper tokenizer.
// This is a placeholder for the logic that converts text to model input IDs.
private long[] Tokenize(string text)
{
// This is a gross oversimplification. In a production system, you would use
// a library like HuggingFace's Tokenizers library (or a .NET port) that matches
// the exact tokenizer configuration of the model (e.g., WordPiece for BERT-based models).
// For demonstration, we create a dummy tensor of a fixed size.
// Let's assume a max sequence length of 128.
var random = new Random();
return Enumerable.Range(0, 128).Select(_ => (long)random.Next(1000, 30000)).ToArray();
}
public float[] GenerateEmbedding(string text)
{
var inputIds = Tokenize(text);
var attentionMask = Enumerable.Repeat(1L, inputIds.Length).ToArray();
var inputIdsTensor = new DenseTensor<long>(inputIds, new[] { 1, inputIds.Length });
var attentionMaskTensor = new DenseTensor<long>(attentionMask, new[] { 1, attentionMask.Length });
// Model inputs usually require specific names, e.g., 'input_ids', 'attention_mask'
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input_ids", inputIdsTensor),
NamedOnnxValue.CreateFromTensor("attention_mask", attentionMaskTensor)
};
try
{
using var results = _session.Run(inputs);
// The output is typically a list of tensors. We are interested in the first one,
// which often represents the last hidden state or pooler output.
var outputTensor = results.FirstOrDefault()?.AsTensor<float>();
if (outputTensor == null)
{
throw new InvalidOperationException("Model output is null or not a float tensor.");
}
// The output tensor might have dimensions like [1, SequenceLength, EmbeddingDim].
// Sentence-transformers typically perform a mean pooling operation. Let's simulate that.
// A well-converted model might already have this pooling step included.
// For this example, we'll assume the output is [1, 384] and we just need to get the vector.
return outputTensor.ToArray();
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed during model inference for text: {Text}", text.Substring(0, 50));
throw;
}
}
public void Dispose()
{
_session?.Dispose();
}
}
A common mistake here is underestimating the complexity of tokenization. It’s not just splitting words. It involves vocabulary lookups, special tokens, padding, and truncation. Getting this wrong leads to nonsensical embeddings and catastrophic failure of the entire system.
Phase 4: The Brute-Force Search - A Painful First Step
With data in the database, our first attempt at searching was the most direct: a brute-force calculation of cosine similarity. The plan was to create a MariaDB stored procedure that would take a query vector as a parameter, iterate through every single row in document_chunks
, compute the similarity, and return the top N results.
First, we need helper functions to compute dot product and magnitude.
-- MariaDB Stored Functions for Vector Math
-- This assumes a fixed vector length of 384 dimensions (1536 bytes)
DELIMITER $$
CREATE FUNCTION `DOT_PRODUCT`(vec1 VARBINARY(6144), vec2 VARBINARY(6144))
RETURNS FLOAT
DETERMINISTIC
BEGIN
DECLARE dot_product FLOAT DEFAULT 0;
DECLARE i INT DEFAULT 1;
-- A 384-dim float vector. Each float is 4 bytes.
DECLARE max_offset INT DEFAULT 1533; -- 384 * 4 - 4 + 1
WHILE i <= max_offset DO
-- CAST(SUBSTRING(...) AS FLOAT) is not standard SQL and won't work.
-- We must use CONV and some bitwise manipulation to convert 4 bytes to a float.
-- This is highly complex and error-prone in pure SQL.
-- A C/C++ UDF is the proper way to do this for performance.
-- Let's pretend for a moment a simpler version exists for demonstration.
-- A REALISTIC but less performant approach is to store vectors as JSON arrays of numbers.
-- For this article, we'll write a pseudo-function to illustrate the logic.
-- DO NOT USE THIS SQL IN PRODUCTION AS-IS.
-- The logic would be:
-- val1 = bytes_to_float(SUBSTRING(vec1, i, 4));
-- val2 = bytes_to_float(SUBSTRING(vec2, i, 4));
-- dot_product = dot_product + (val1 * val2);
SET i = i + 4;
END WHILE;
-- THIS IS A MOCK. MariaDB lacks native functions for this.
-- A UDF written in C would be required.
-- For now, we will return a random value to allow the procedure to be created.
RETURN RAND();
END$$
-- Similarly, a MAGNITUDE function would be needed.
CREATE FUNCTION `VECTOR_MAGNITUDE`(vec VARBINARY(6144))
RETURNS FLOAT
DETERMINISTIC
BEGIN
-- Logic: SQRT(DOT_PRODUCT(vec, vec))
RETURN RAND();
END$$
CREATE FUNCTION `COSINE_SIMILARITY`(vec1 VARBINARY(6144), vec2 VARBINARY(6144))
RETURNS FLOAT
DETERMINISTIC
BEGIN
-- Logic: DOT_PRODUCT(vec1, vec2) / (VECTOR_MAGNITUDE(vec1) * VECTOR_MAGNITUDE(vec2))
-- Due to the limitations above, this is also a mock.
RETURN RAND();
END$$
DELIMITER ;
The pitfall here is immediately obvious. Pure SQL is horrifically unsuited for the low-level byte manipulation required to convert parts of a VARBINARY
blob into floats for calculation. The correct way to implement this in MariaDB/MySQL is with a User-Defined Function (UDF) written in C/C++. This adds significant deployment and maintenance complexity, which was precisely what we were trying to avoid.
As a pragmatic compromise, we decided to pull the vectors into the application layer and perform the calculation there. This is terribly inefficient as it requires transferring all vectors over the network, but it served as a functional baseline.
// Services/VectorSearchService_BruteForce.cs
public class VectorSearchService
{
private readonly IDocumentRepository _repository;
public VectorSearchService(IDocumentRepository repository)
{
_repository = repository;
}
public async Task<IEnumerable<SearchResult>> FindSimilarAsync(float[] queryVector, int topK)
{
var allChunks = await _repository.GetAllChunksAsync(); // Inefficient!
var results = new List<(long Id, double Similarity)>();
foreach (var chunk in allChunks)
{
var similarity = CalculateCosineSimilarity(queryVector, chunk.Embedding);
results.Add((chunk.Id, similarity));
}
return results.OrderByDescending(r => r.Similarity)
.Take(topK)
.Select(r => new SearchResult { ChunkId = r.Id, Score = r.Similarity });
}
// Standard cosine similarity calculation
private double CalculateCosineSimilarity(float[] vecA, float[] vecB)
{
double dotProduct = 0.0;
double magA = 0.0;
double magB = 0.0;
for (int i = 0; i < vecA.Length; i++)
{
dotProduct += vecA[i] * vecB[i];
magA += Math.Pow(vecA[i], 2);
magB += Math.Pow(vecB[i], 2);
}
if (magA == 0 || magB == 0) return 0;
return dotProduct / (Math.Sqrt(magA) * Math.Sqrt(magB));
}
}
This worked. For a corpus of 1,000 documents, it was slow. For 10,000, it was unacceptable. The latency was measured in seconds, and memory usage on the app server spiked dramatically. This approach was a dead end for any real-world project.
Phase 5: A Pragmatic Index - Approximate Nearest Neighbor with IVF
We had to get smarter. The core problem is the O(N)
complexity of the brute-force search. We needed a way to prune the search space. This is the domain of Approximate Nearest Neighbor (ANN) algorithms. Since we couldn’t use a pre-built library like HNSW, we implemented a simplified version of the Inverted File Index (IVF) algorithm using only standard MariaDB tables and SQL.
The IVF concept is simple:
- Clustering: Use an algorithm like K-Means to group all document vectors into
k
clusters. The center of each cluster is called a “centroid.” - Indexing: For each document vector, determine which centroid it’s closest to and store this association.
- Searching: At query time, first find the
n_probe
closest centroids to the query vector. Then, only perform the expensive brute-force search on the document vectors that belong to those few clusters.
This reduces the search space from N
to roughly (N / k) * n_probe
.
Here is the architecture we implemented:
New Tables:
CREATE TABLE `vector_centroids` (
`id` INT NOT NULL AUTO_INCREMENT,
`centroid_vector` VARBINARY(6144) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
-- A mapping table between chunks and their nearest centroid
CREATE TABLE `document_chunk_clusters` (
`document_chunk_id` BIGINT NOT NULL,
`centroid_id` INT NOT NULL,
PRIMARY KEY (`document_chunk_id`),
INDEX `idx_centroid_id` (`centroid_id`),
CONSTRAINT `fk_chunk_id` FOREIGN KEY (`document_chunk_id`) REFERENCES `document_chunks` (`id`) ON DELETE CASCADE,
CONSTRAINT `fk_centroid_id` FOREIGN KEY (`centroid_id`) REFERENCES `vector_centroids` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB;
Implementation Steps:
Offline Training: We wrote a separate utility console application. It would:
- Load all existing document vectors from MariaDB.
- Use a .NET ML library (like Accord.NET or a simple custom K-Means implementation) to perform clustering and generate
k
centroids (e.g.,k = 100
). - Save these centroids to the
vector_centroids
table. - For each document vector, find its nearest centroid and populate the
document_chunk_clusters
mapping table.
Online Search Procedure: We created a new stored procedure in MariaDB to orchestrate the two-step search. This avoids multiple round trips between the application and the database.
DELIMITER $$
CREATE PROCEDURE `SearchIVF`(
IN query_vector VARBINARY(6144),
IN top_k INT,
IN n_probe INT
)
BEGIN
-- Step 1: Find the n_probe nearest centroids to the query vector.
-- This step is still brute-force, but over a much smaller table (k centroids).
-- Again, this relies on a functional COSINE_SIMILARITY UDF. For this example,
-- we'll assume the logic is executed in the application layer, which passes in the
-- IDs of the best centroids.
-- A more realistic procedure would take a list of centroid IDs as a parameter.
-- Let's redefine the procedure signature.
-- Let's assume we call this from the app after finding the best centroids.
-- The app passes a comma-separated list of centroid IDs.
-- The application logic would look like this:
// 1. Fetch all centroids from DB (small number, e.g., 100)
// 2. Calculate similarity between queryVector and each centroid in-memory.
// 3. Select top `n_probe` centroid IDs.
// 4. Call the *new* stored procedure below with these IDs.
END $$
DELIMITER ;
DELIMITER $$
CREATE PROCEDURE `SearchChunksInClusters`(
IN query_vector_bytes VARBINARY(6144),
IN top_k INT,
IN centroid_ids_csv VARCHAR(1024)
)
BEGIN
-- The real workhorse.
-- Again, requires a proper C/C++ UDF for COSINE_SIMILARITY for performance.
-- The following query illustrates the logic.
SELECT
dc.id,
dc.content,
dc.document_source_id,
-- This is the critical part that MUST be a high-performance UDF
COSINE_SIMILARITY(dc.embedding, query_vector_bytes) AS similarity
FROM document_chunks dc
JOIN document_chunk_clusters dcc ON dc.id = dcc.document_chunk_id
-- FIND_IN_SET is a slow way to check membership in a CSV list.
-- A temporary table or passing a JSON array would be better in modern MariaDB/MySQL.
WHERE FIND_IN_SET(dcc.centroid_id, centroid_ids_csv)
-- Order by the calculated similarity and take the top results.
ORDER BY similarity DESC
LIMIT top_k;
END$$
DELIMITER ;
Our VectorSearchService
was refactored to implement this two-step logic. It now held the centroids in a short-lived cache, found the best n_probe
centroids in-memory, and then called the SearchChunksInClusters
stored procedure with the relevant cluster IDs.
The performance improvement was dramatic. A search that previously took 5-10 seconds now completed in under 200 milliseconds for a corpus of 100,000 documents. It’s not the sub-50ms performance you’d get from a dedicated Rust-based vector search engine, but it was well within our acceptable latency budget.
The final orchestration logic in RAGOrchestrator.cs
ties everything together.
// Services/RAGOrchestrator.cs
public class RAGOrchestrator
{
private readonly EmbeddingService _embeddingService;
private readonly VectorSearchServiceIVF _searchService; // The improved service
private readonly ILogger<RAGOrchestrator> _logger;
public RAGOrchestrator(EmbeddingService embeddingService, VectorSearchServiceIVF searchService, ILogger<RAGOrchestrator> logger)
{
_embeddingService = embeddingService;
_searchService = searchService;
_logger = logger;
}
public async Task<string> GenerateResponseAsync(string query)
{
_logger.LogInformation("Generating RAG response for query: {Query}", query);
// 1. Embed the query
var queryVector = _embeddingService.GenerateEmbedding(query);
// 2. Search for relevant context
var searchResults = await _searchService.FindSimilarAsync(queryVector, topK: 5);
// In a real system, you'd fetch the content for these chunk IDs.
var context = string.Join("\n---\n", searchResults.Select(r => r.Content));
// 3. Build the prompt for a generative model
var prompt = $@"
Context:
{context}
Question: {query}
Answer based only on the context provided:
";
_logger.LogInformation("Generated prompt for LLM.");
// 4. Send to a generative LLM and return the response
// For this example, we'll just return the generated prompt.
return prompt;
}
}
The solution, while architecturally more complex than just using an off-the-shelf vector DB, successfully met the project’s constraints. It demonstrates a key principle of software architecture: the “best” solution is often not the most technologically pure, but the one that delivers the required functionality within the given operational, financial, and political boundaries.
The limitations of this approach are clear. The IVF index quality is highly dependent on the initial K-Means clustering, which must be periodically re-run as the data distribution changes. It offers a trade-off between recall and speed (controlled by n_probe
) but lacks the more sophisticated graph-based traversal of modern ANN algorithms like HNSW. Scaling this beyond a single, powerful MariaDB instance would require sharding logic managed at the application layer, which is a significant undertaking. This solution is a pragmatic patch, not a silver bullet, and its applicability is confined to medium-scale use cases where the cost of introducing a new database technology is prohibitive.