Standard vector-based retrieval systems for Retrieval-Augmented Generation (RAG) pipelines often fail to capture the structural context inherent in a corpus of documents. A query for “performance tuning for the asset pipeline” might retrieve semantically similar text chunks from entirely different projects or documentation versions, leading to a large language model generating a plausible but incorrect answer. The core technical pain point is that vector similarity alone is context-blind; it understands “what” but not “where” or “how it’s connected.” Our initial attempts using a pure vector database confirmed this limitation, forcing a re-evaluation of the retrieval architecture.
The proposed solution was to fuse semantic search with graph-based relational context. This requires a data store capable of efficiently handling both vector indexes and graph traversals within a single query. ArangoDB was selected for its native multi-model capabilities, allowing us to represent documents and their embeddings alongside a knowledge graph of their interconnections. For ML model serving, BentoML was chosen to decouple the embedding model’s lifecycle from our main Ruby on Rails application, a non-negotiable for production stability. Rails serves as the robust orchestrator, managing the multi-step retrieval process: embedding the user query, executing the hybrid query against ArangoDB, and preparing the context for the LLM.
This document details the build process, focusing on the critical implementation details of each component.
The Decoupled Embedding Service with BentoML
In a real-world project, embedding the ML inference runtime directly into the monolithic Rails application is a significant anti-pattern. It couples deployment cycles, bloats the application’s memory footprint, and makes scaling the ML and web components independently impossible. BentoML provides a clean separation of concerns.
The service definition is straightforward. It loads a pre-trained sentence-transformer model and exposes a single endpoint to generate embeddings.
service.py
:
import bentoml
import numpy as np
from sentence_transformers import SentenceTransformer
# A common mistake is to use a model name that requires a download every time the
# container starts. It's better to download it once and reference the local path,
# or use BentoML's model management to package it.
MODEL_NAME = "all-MiniLM-L6-v2"
MODEL_DIMENSION = 384 # Hardcoded for this model to ensure consistency.
@bentoml.service(
resources={"cpu": "1"},
traffic={"timeout": 60},
)
class EmbeddingService:
def __init__(self) -> None:
"""
Service initialization. This runs once when the BentoML server starts.
"""
try:
self.model = SentenceTransformer(MODEL_NAME)
print(f"Embedding model '{MODEL_NAME}' loaded successfully.")
except Exception as e:
# Proper logging is critical for diagnosing startup failures in production.
print(f"FATAL: Failed to load sentence transformer model: {e}")
raise
@bentoml.api
def embed(self, texts: list[str]) -> np.ndarray:
"""
Generates embeddings for a batch of texts.
The API expects a JSON list of strings.
"""
if not texts or not isinstance(texts, list):
# Input validation is crucial for a robust API.
# Here we might raise a specific BentoML exception for bad requests.
raise bentoml.exceptions.BentoMLException("Input must be a non-empty list of strings.")
try:
# The core logic. Note the `normalize_embeddings=True` which is important
# for using cosine similarity later on, as it makes it equivalent to a dot product.
embeddings = self.model.encode(
texts,
convert_to_numpy=True,
normalize_embeddings=True
)
return embeddings.astype(np.float32)
except Exception as e:
# Generic error handling for unexpected issues during inference.
print(f"ERROR: Exception during embedding process: {e}")
raise bentoml.exceptions.BentoMLException("Internal server error during embedding.")
The bentofile.yaml
defines the service’s dependencies and packaging instructions. In a production setup, this file is the source of truth for building a containerized image of the service.
bentofile.yaml
:
service: "service:EmbeddingService"
labels:
owner: data-science-team
project: context-aware-rag
include:
- "*.py"
python:
packages:
- sentence-transformers
- numpy
- torch # sentence-transformers has a dependency on torch
models: [] # For this example, the model is downloaded at runtime. In production, use `bentoml models pull` and reference it here.
To run this service locally for development: bentoml serve service.py:EmbeddingService
. It exposes a Swagger UI for testing, but a simple curl request demonstrates its function:
curl -X POST \
-H "Content-Type: application/json" \
--data '["This is a test sentence.", "This is another one."]' \
http://127.0.0.1:3000/embed
This setup gives us a scalable, independent microservice for our embedding needs, which the Rails application can consume via a simple HTTP client.
ArangoDB Schema for Hybrid Data
The power of this architecture lies in ArangoDB’s schema design. We need two collections and one ArangoSearch view.
-
text_chunks
(Document Collection): Stores the actual text content along with its pre-computed vector embedding. -
relationships
(Edge Collection): Stores the connections between chunks, for example,parent_document
,see_also
,authored_by
. -
chunk_vector_view
(ArangoSearch View): Provides the inverted file index (IVF) necessary for fast approximate nearest neighbor (ANN) vector search.
Here is a sample document structure for a chunk in the text_chunks
collection:
{
"_key": "chunk_abc_123",
"document_id": "doc_abc",
"text": "The Rails asset pipeline provides a framework to concatenate and minify...",
"chunk_seq": 1,
"embedding": [0.0123, -0.0456, ..., 0.0789]
}
An edge in the relationships
collection connecting two chunks might look like this:
{
"_from": "text_chunks/chunk_abc_123",
"_to": "text_chunks/chunk_abc_124",
"label": "is_followed_by"
}
The ArangoSearch View configuration is critical. It defines how the embedding
field is indexed.
// This AQL would be run once during initial setup
db._createView("chunk_vector_view", "arangosearch", {
"links": {
"text_chunks": {
"fields": {
"embedding": {
"analyzers": [ "identity" ] // Use identity since we are indexing raw vectors
}
},
"storedValues": ["text", "document_id"] // Store these fields in the view for faster lookups
}
},
"primarySort": [
{ "field": "embedding", "direction": "asc" } // This is just an example, sort options are complex
],
"consolidationIntervalMsec": 10000,
"commitIntervalMsec": 5000
});
This is a simplified view definition. A production setup requires a deeper dive into consolidationPolicy
and other performance-tuning parameters.
Ingestion Pipeline: A Ruby Rake Task
With the schema defined, we need to populate it. A Rake task in our Rails application is a suitable tool for a batch ingestion process. This task reads source documents, chunks them, calls the BentoML service for embeddings, and persists the data and relationships to ArangoDB.
lib/tasks/ingest.rake
:
require 'faraday'
require 'json'
namespace :ingest do
desc "Ingest documents into ArangoDB with embeddings from BentoML"
task :process_corpus, [:path] => :environment do |_task, args|
# --- Configuration ---
# In a real app, these come from Rails credentials or ENV vars.
BENTOML_ENDPOINT = 'http://localhost:3000/embed'.freeze
ARANGO_COORDINATOR = 'http://localhost:8529'.freeze
ARANGO_DATABASE = '_system'.freeze
ARANGO_USER = 'root'.freeze
ARANGO_PASSWORD = 'your_password'.freeze # Replace with actual password
# --- Setup HTTP and ArangoDB Clients ---
# Use a persistent connection for performance
bento_client = Faraday.new(url: BENTOML_ENDPOINT) do |faraday|
faraday.request :json
faraday.response :json, content_type: /\bjson$/
faraday.adapter Faraday.default_adapter
faraday.options.timeout = 120 # Increase timeout for large batches
end
Arango.configure do |config|
config.hosts = ARANGO_COORDINATOR
config.database = ARANGO_DATABASE
config.username = ARANGO_USER
config.password = ARANGO_PASSWORD
end
db = Arango.current_database
chunks_collection = db.collection('text_chunks')
relationships_collection = db.collection('relationships')
# --- Core Logic ---
puts "Starting ingestion from path: #{args[:path]}"
# This is a placeholder for actual document parsing logic.
# It should yield structured data with content and relationships.
documents_to_process = parse_source_documents(args[:path])
documents_to_process.each_slice(32) do |batch| # Process in batches
puts "Processing batch of #{batch.size} documents..."
texts_to_embed = batch.flat_map { |doc| doc[:chunks].map { |c| c[:text] } }
# 1. Get embeddings from BentoML
begin
response = bento_client.post do |req|
req.body = texts_to_embed
end
unless response.success?
puts "ERROR: BentoML service returned status #{response.status}. Body: #{response.body}"
next # Skip this batch
end
embeddings = response.body
rescue Faraday::ConnectionFailed => e
puts "FATAL: Could not connect to BentoML service at #{BENTOML_ENDPOINT}. #{e.message}"
break
end
# 2. Prepare documents and edges for ArangoDB
arango_docs = []
arango_edges = []
embedding_index = 0
batch.each do |doc|
chunk_keys = []
doc[:chunks].each do |chunk|
chunk_key = "#{doc[:id]}_#{chunk[:seq]}"
chunk_keys << chunk_key
arango_docs << {
_key: chunk_key,
document_id: doc[:id],
text: chunk[:text],
chunk_seq: chunk[:seq],
embedding: embeddings[embedding_index]
}
embedding_index += 1
end
# Create sequential relationship edges
(0...chunk_keys.size - 1).each do |i|
arango_edges << {
_from: "text_chunks/#{chunk_keys[i]}",
_to: "text_chunks/#{chunk_keys[i+1]}",
label: "is_followed_by"
}
end
end
# 3. Import into ArangoDB
# The `import` method is far more performant than inserting one by one.
chunks_collection.import(arango_docs)
relationships_collection.import(arango_edges)
puts "Successfully imported #{arango_docs.size} chunks and #{arango_edges.size} edges."
end
end
# Dummy method for demonstration. Real implementation would be complex.
def parse_source_documents(path)
# e.g., read markdown files, split them into chunks, and identify links
# to form 'see_also' relationships.
[{
id: "doc_1",
chunks: [
{ seq: 0, text: "Introduction to Rails." },
{ seq: 1, text: "Models in Rails represent data." }
]
}]
end
end
This task demonstrates a production-oriented pattern: batching requests, using performant bulk import APIs, and basic error handling for network services.
The Hybrid Retrieval Service in Rails
This is where the components converge. A Rails service object encapsulates the logic for performing the hybrid graph-vector query.
graph TD A[User Query] --> B{Rails Controller}; B --> C[BentoML Embedding Service]; C --> D{Query Vector}; B -- Query Text & Vector --> E[RetrievalService]; E --> F[ArangoDB]; F -- Hybrid AQL Query --> G[Contextual Chunks]; E --> B; B --> H[Formatted Response];
The RetrievalService
has one primary public method, retrieve
. It takes a query string and returns a ranked list of relevant text chunks.
app/services/retrieval_service.rb
:
# frozen_string_literal: true
class RetrievalService
# A common mistake is to instantiate clients on every call.
# We should use a shared, thread-safe client instance.
BENTO_CLIENT = Faraday.new(url: ENV.fetch('BENTOML_ENDPOINT')) do |f|
f.request :json
f.response :json, content_type: /\bjson$/
f.adapter :net_http_persistent # Use a persistent connection adapter
end
ARANGO_DB = Arango.current_database
# Custom error classes for better handling upstream
class BentoServiceError < StandardError; end
class ArangoQueryError < StandardError; end
def self.retrieve(query_text:, limit: 10, graph_depth: 2)
new(query_text: query_text, limit: limit, graph_depth: graph_depth).execute
end
def initialize(query_text:, limit:, graph_depth:)
@query_text = query_text
@limit = limit
@graph_depth = graph_depth
@query_vector = nil
end
def execute
# 1. Get query embedding
@query_vector = fetch_embedding(@query_text)
# 2. Execute the hybrid query
results = execute_hybrid_aql_query
# 3. Format and return
format_results(results)
end
private
def fetch_embedding(text)
response = BENTO_CLIENT.post('embed', [text])
raise BentoServiceError, "Failed to get embedding: #{response.body}" unless response.success?
response.body.first
rescue Faraday::Error => e
Rails.logger.error "BentoML connection error: #{e.message}"
raise BentoServiceError, "Could not connect to embedding service."
end
# The heart of the implementation
def execute_hybrid_aql_query
# This AQL query is the core innovation. It performs two actions:
# 1. A vector search to find semantically similar seed nodes.
# 2. A graph traversal starting from those seed nodes to find related context.
# The results from both are combined and ranked.
aql = <<-AQL
LET queryVector = @query_vector
// Part 1: Vector Search to find initial seed chunks
LET vector_matches = (
FOR doc IN chunk_vector_view
SEARCH ANALYZER(doc.embedding IN TO_ARRAY(queryVector), "identity")
SORT BM25(doc) DESC // Or another scoring function
LIMIT @limit
RETURN { doc: doc, score: BM25(doc), type: "vector" }
)
// Part 2: Graph Traversal from the top vector matches
LET graph_matches = (
FOR seed IN SLICE(vector_matches, 0, 3) // Take top 3 as seeds for traversal
FOR vertex, edge IN 1..@graph_depth ANY seed.doc._id
GRAPH 'knowledge_graph' // Assuming a named graph exists
OPTIONS { uniqueVertices: 'global' }
FILTER vertex != seed.doc // Don't include the seed itself
// We can add more complex filtering on edge labels or vertex properties here
LIMIT @limit
RETURN { doc: vertex, score: 0.5 * seed.score, type: "graph" } // De-prioritize graph results slightly
)
// Part 3: Combine, deduplicate, and return final results
LET combined = UNIQUE(APPEND(vector_matches, graph_matches))
FOR item IN combined
SORT item.score DESC
LIMIT @limit
RETURN {
text: item.doc.text,
document_id: item.doc.document_id,
score: item.score,
retrieval_type: item.type
}
AQL
# In a production environment, bind variables are essential to prevent AQL injection.
bind_vars = {
query_vector: @query_vector,
limit: @limit,
graph_depth: @graph_depth
}
cursor = ARANGO_DB.query(aql, bind_vars: bind_vars, count: true)
cursor.to_a
rescue Arango::Error => e
Rails.logger.error "ArangoDB query failed: #{e.message}. AQL: #{aql}"
raise ArangoQueryError, "Failed to execute retrieval query."
end
def format_results(arango_results)
# Simple formatting. Could involve further business logic.
arango_results
end
end
The AQL query is the most complex part. It first uses the chunk_vector_view
to find the top N
documents most semantically similar to the user’s query. Then, using the top few of those results as starting points (seed
nodes), it performs a graph traversal to find connected nodes up to a certain depth. This pulls in context that is structurally related but might not have been a top hit in the pure vector search. The final result set is a combination of both, deduplicated and ranked.
A controller would then use this service:app/controllers/api/v1/search_controller.rb
class Api::V1::SearchController < ApplicationController
def query
query_text = params.require(:q)
results = RetrievalService.retrieve(query_text: query_text)
render json: { data: results }, status: :ok
rescue RetrievalService::BentoServiceError, RetrievalService::ArangoQueryError => e
render json: { error: e.message }, status: :service_unavailable
rescue ActionController::ParameterMissing => e
render json: { error: e.message }, status: :bad_request
end
end
The current implementation has limitations. The AQL query’s performance depends heavily on the graph’s structure and the chosen traversal depth; for highly connected graphs, it could become slow. The scoring logic, which simply halves the score for graph-retrieved items, is naive and would need to be replaced by a more sophisticated ranking algorithm, potentially involving a dedicated reranker model served by BentoML. Furthermore, the ingestion process is batch-oriented and doesn’t account for real-time updates to the source documents, which would necessitate a more complex event-driven architecture, possibly leveraging ArangoDB’s streaming capabilities or a CDC pipeline. The cost profile of a large ArangoDB cluster capable of holding both graph indexes and vector data in memory must be carefully modeled against separate, specialized solutions.