Constructing a Hybrid Graph and Vector Retrieval System with ArangoDB BentoML and a Ruby on Rails Orchestrator


Standard vector-based retrieval systems for Retrieval-Augmented Generation (RAG) pipelines often fail to capture the structural context inherent in a corpus of documents. A query for “performance tuning for the asset pipeline” might retrieve semantically similar text chunks from entirely different projects or documentation versions, leading to a large language model generating a plausible but incorrect answer. The core technical pain point is that vector similarity alone is context-blind; it understands “what” but not “where” or “how it’s connected.” Our initial attempts using a pure vector database confirmed this limitation, forcing a re-evaluation of the retrieval architecture.

The proposed solution was to fuse semantic search with graph-based relational context. This requires a data store capable of efficiently handling both vector indexes and graph traversals within a single query. ArangoDB was selected for its native multi-model capabilities, allowing us to represent documents and their embeddings alongside a knowledge graph of their interconnections. For ML model serving, BentoML was chosen to decouple the embedding model’s lifecycle from our main Ruby on Rails application, a non-negotiable for production stability. Rails serves as the robust orchestrator, managing the multi-step retrieval process: embedding the user query, executing the hybrid query against ArangoDB, and preparing the context for the LLM.

This document details the build process, focusing on the critical implementation details of each component.

The Decoupled Embedding Service with BentoML

In a real-world project, embedding the ML inference runtime directly into the monolithic Rails application is a significant anti-pattern. It couples deployment cycles, bloats the application’s memory footprint, and makes scaling the ML and web components independently impossible. BentoML provides a clean separation of concerns.

The service definition is straightforward. It loads a pre-trained sentence-transformer model and exposes a single endpoint to generate embeddings.

service.py:

import bentoml
import numpy as np
from sentence_transformers import SentenceTransformer

# A common mistake is to use a model name that requires a download every time the
# container starts. It's better to download it once and reference the local path,
# or use BentoML's model management to package it.
MODEL_NAME = "all-MiniLM-L6-v2"
MODEL_DIMENSION = 384 # Hardcoded for this model to ensure consistency.

@bentoml.service(
    resources={"cpu": "1"},
    traffic={"timeout": 60},
)
class EmbeddingService:
    def __init__(self) -> None:
        """
        Service initialization. This runs once when the BentoML server starts.
        """
        try:
            self.model = SentenceTransformer(MODEL_NAME)
            print(f"Embedding model '{MODEL_NAME}' loaded successfully.")
        except Exception as e:
            # Proper logging is critical for diagnosing startup failures in production.
            print(f"FATAL: Failed to load sentence transformer model: {e}")
            raise

    @bentoml.api
    def embed(self, texts: list[str]) -> np.ndarray:
        """
        Generates embeddings for a batch of texts.
        The API expects a JSON list of strings.
        """
        if not texts or not isinstance(texts, list):
            # Input validation is crucial for a robust API.
            # Here we might raise a specific BentoML exception for bad requests.
            raise bentoml.exceptions.BentoMLException("Input must be a non-empty list of strings.")

        try:
            # The core logic. Note the `normalize_embeddings=True` which is important
            # for using cosine similarity later on, as it makes it equivalent to a dot product.
            embeddings = self.model.encode(
                texts,
                convert_to_numpy=True,
                normalize_embeddings=True
            )
            return embeddings.astype(np.float32)

        except Exception as e:
            # Generic error handling for unexpected issues during inference.
            print(f"ERROR: Exception during embedding process: {e}")
            raise bentoml.exceptions.BentoMLException("Internal server error during embedding.")

The bentofile.yaml defines the service’s dependencies and packaging instructions. In a production setup, this file is the source of truth for building a containerized image of the service.

bentofile.yaml:

service: "service:EmbeddingService"
labels:
  owner: data-science-team
  project: context-aware-rag
include:
  - "*.py"
python:
  packages:
    - sentence-transformers
    - numpy
    - torch # sentence-transformers has a dependency on torch
models: [] # For this example, the model is downloaded at runtime. In production, use `bentoml models pull` and reference it here.

To run this service locally for development: bentoml serve service.py:EmbeddingService. It exposes a Swagger UI for testing, but a simple curl request demonstrates its function:

curl -X POST \
  -H "Content-Type: application/json" \
  --data '["This is a test sentence.", "This is another one."]' \
  http://127.0.0.1:3000/embed

This setup gives us a scalable, independent microservice for our embedding needs, which the Rails application can consume via a simple HTTP client.

ArangoDB Schema for Hybrid Data

The power of this architecture lies in ArangoDB’s schema design. We need two collections and one ArangoSearch view.

  1. text_chunks (Document Collection): Stores the actual text content along with its pre-computed vector embedding.
  2. relationships (Edge Collection): Stores the connections between chunks, for example, parent_document, see_also, authored_by.
  3. chunk_vector_view (ArangoSearch View): Provides the inverted file index (IVF) necessary for fast approximate nearest neighbor (ANN) vector search.

Here is a sample document structure for a chunk in the text_chunks collection:

{
  "_key": "chunk_abc_123",
  "document_id": "doc_abc",
  "text": "The Rails asset pipeline provides a framework to concatenate and minify...",
  "chunk_seq": 1,
  "embedding": [0.0123, -0.0456, ..., 0.0789]
}

An edge in the relationships collection connecting two chunks might look like this:

{
  "_from": "text_chunks/chunk_abc_123",
  "_to": "text_chunks/chunk_abc_124",
  "label": "is_followed_by"
}

The ArangoSearch View configuration is critical. It defines how the embedding field is indexed.

// This AQL would be run once during initial setup
db._createView("chunk_vector_view", "arangosearch", {
  "links": {
    "text_chunks": {
      "fields": {
        "embedding": {
          "analyzers": [ "identity" ] // Use identity since we are indexing raw vectors
        }
      },
      "storedValues": ["text", "document_id"] // Store these fields in the view for faster lookups
    }
  },
  "primarySort": [
    { "field": "embedding", "direction": "asc" } // This is just an example, sort options are complex
  ],
  "consolidationIntervalMsec": 10000,
  "commitIntervalMsec": 5000
});

This is a simplified view definition. A production setup requires a deeper dive into consolidationPolicy and other performance-tuning parameters.

Ingestion Pipeline: A Ruby Rake Task

With the schema defined, we need to populate it. A Rake task in our Rails application is a suitable tool for a batch ingestion process. This task reads source documents, chunks them, calls the BentoML service for embeddings, and persists the data and relationships to ArangoDB.

lib/tasks/ingest.rake:

require 'faraday'
require 'json'

namespace :ingest do
  desc "Ingest documents into ArangoDB with embeddings from BentoML"
  task :process_corpus, [:path] => :environment do |_task, args|
    # --- Configuration ---
    # In a real app, these come from Rails credentials or ENV vars.
    BENTOML_ENDPOINT = 'http://localhost:3000/embed'.freeze
    ARANGO_COORDINATOR = 'http://localhost:8529'.freeze
    ARANGO_DATABASE = '_system'.freeze
    ARANGO_USER = 'root'.freeze
    ARANGO_PASSWORD = 'your_password'.freeze # Replace with actual password

    # --- Setup HTTP and ArangoDB Clients ---
    # Use a persistent connection for performance
    bento_client = Faraday.new(url: BENTOML_ENDPOINT) do |faraday|
      faraday.request :json
      faraday.response :json, content_type: /\bjson$/
      faraday.adapter Faraday.default_adapter
      faraday.options.timeout = 120 # Increase timeout for large batches
    end

    Arango.configure do |config|
      config.hosts = ARANGO_COORDINATOR
      config.database = ARANGO_DATABASE
      config.username = ARANGO_USER
      config.password = ARANGO_PASSWORD
    end
    db = Arango.current_database
    chunks_collection = db.collection('text_chunks')
    relationships_collection = db.collection('relationships')

    # --- Core Logic ---
    puts "Starting ingestion from path: #{args[:path]}"
    # This is a placeholder for actual document parsing logic.
    # It should yield structured data with content and relationships.
    documents_to_process = parse_source_documents(args[:path])

    documents_to_process.each_slice(32) do |batch| # Process in batches
      puts "Processing batch of #{batch.size} documents..."
      texts_to_embed = batch.flat_map { |doc| doc[:chunks].map { |c| c[:text] } }

      # 1. Get embeddings from BentoML
      begin
        response = bento_client.post do |req|
          req.body = texts_to_embed
        end
        unless response.success?
          puts "ERROR: BentoML service returned status #{response.status}. Body: #{response.body}"
          next # Skip this batch
        end
        embeddings = response.body
      rescue Faraday::ConnectionFailed => e
        puts "FATAL: Could not connect to BentoML service at #{BENTOML_ENDPOINT}. #{e.message}"
        break
      end

      # 2. Prepare documents and edges for ArangoDB
      arango_docs = []
      arango_edges = []
      embedding_index = 0

      batch.each do |doc|
        chunk_keys = []
        doc[:chunks].each do |chunk|
          chunk_key = "#{doc[:id]}_#{chunk[:seq]}"
          chunk_keys << chunk_key
          arango_docs << {
            _key: chunk_key,
            document_id: doc[:id],
            text: chunk[:text],
            chunk_seq: chunk[:seq],
            embedding: embeddings[embedding_index]
          }
          embedding_index += 1
        end

        # Create sequential relationship edges
        (0...chunk_keys.size - 1).each do |i|
          arango_edges << {
            _from: "text_chunks/#{chunk_keys[i]}",
            _to: "text_chunks/#{chunk_keys[i+1]}",
            label: "is_followed_by"
          }
        end
      end

      # 3. Import into ArangoDB
      # The `import` method is far more performant than inserting one by one.
      chunks_collection.import(arango_docs)
      relationships_collection.import(arango_edges)
      puts "Successfully imported #{arango_docs.size} chunks and #{arango_edges.size} edges."
    end
  end

  # Dummy method for demonstration. Real implementation would be complex.
  def parse_source_documents(path)
    # e.g., read markdown files, split them into chunks, and identify links
    # to form 'see_also' relationships.
    [{
      id: "doc_1",
      chunks: [
        { seq: 0, text: "Introduction to Rails." },
        { seq: 1, text: "Models in Rails represent data." }
      ]
    }]
  end
end

This task demonstrates a production-oriented pattern: batching requests, using performant bulk import APIs, and basic error handling for network services.

The Hybrid Retrieval Service in Rails

This is where the components converge. A Rails service object encapsulates the logic for performing the hybrid graph-vector query.

graph TD
    A[User Query] --> B{Rails Controller};
    B --> C[BentoML Embedding Service];
    C --> D{Query Vector};
    B -- Query Text & Vector --> E[RetrievalService];
    E --> F[ArangoDB];
    F -- Hybrid AQL Query --> G[Contextual Chunks];
    E --> B;
    B --> H[Formatted Response];

The RetrievalService has one primary public method, retrieve. It takes a query string and returns a ranked list of relevant text chunks.

app/services/retrieval_service.rb:

# frozen_string_literal: true

class RetrievalService
  # A common mistake is to instantiate clients on every call.
  # We should use a shared, thread-safe client instance.
  BENTO_CLIENT = Faraday.new(url: ENV.fetch('BENTOML_ENDPOINT')) do |f|
    f.request :json
    f.response :json, content_type: /\bjson$/
    f.adapter :net_http_persistent # Use a persistent connection adapter
  end

  ARANGO_DB = Arango.current_database

  # Custom error classes for better handling upstream
  class BentoServiceError < StandardError; end
  class ArangoQueryError < StandardError; end

  def self.retrieve(query_text:, limit: 10, graph_depth: 2)
    new(query_text: query_text, limit: limit, graph_depth: graph_depth).execute
  end

  def initialize(query_text:, limit:, graph_depth:)
    @query_text = query_text
    @limit = limit
    @graph_depth = graph_depth
    @query_vector = nil
  end

  def execute
    # 1. Get query embedding
    @query_vector = fetch_embedding(@query_text)

    # 2. Execute the hybrid query
    results = execute_hybrid_aql_query

    # 3. Format and return
    format_results(results)
  end

  private

  def fetch_embedding(text)
    response = BENTO_CLIENT.post('embed', [text])
    raise BentoServiceError, "Failed to get embedding: #{response.body}" unless response.success?
    response.body.first
  rescue Faraday::Error => e
    Rails.logger.error "BentoML connection error: #{e.message}"
    raise BentoServiceError, "Could not connect to embedding service."
  end

  # The heart of the implementation
  def execute_hybrid_aql_query
    # This AQL query is the core innovation. It performs two actions:
    # 1. A vector search to find semantically similar seed nodes.
    # 2. A graph traversal starting from those seed nodes to find related context.
    # The results from both are combined and ranked.
    aql = <<-AQL
      LET queryVector = @query_vector

      // Part 1: Vector Search to find initial seed chunks
      LET vector_matches = (
        FOR doc IN chunk_vector_view
          SEARCH ANALYZER(doc.embedding IN TO_ARRAY(queryVector), "identity")
          SORT BM25(doc) DESC // Or another scoring function
          LIMIT @limit
          RETURN { doc: doc, score: BM25(doc), type: "vector" }
      )

      // Part 2: Graph Traversal from the top vector matches
      LET graph_matches = (
        FOR seed IN SLICE(vector_matches, 0, 3) // Take top 3 as seeds for traversal
          FOR vertex, edge IN 1..@graph_depth ANY seed.doc._id
          GRAPH 'knowledge_graph' // Assuming a named graph exists
          OPTIONS { uniqueVertices: 'global' }
          FILTER vertex != seed.doc // Don't include the seed itself
          // We can add more complex filtering on edge labels or vertex properties here
          LIMIT @limit
          RETURN { doc: vertex, score: 0.5 * seed.score, type: "graph" } // De-prioritize graph results slightly
      )

      // Part 3: Combine, deduplicate, and return final results
      LET combined = UNIQUE(APPEND(vector_matches, graph_matches))
      FOR item IN combined
        SORT item.score DESC
        LIMIT @limit
        RETURN {
          text: item.doc.text,
          document_id: item.doc.document_id,
          score: item.score,
          retrieval_type: item.type
        }
    AQL

    # In a production environment, bind variables are essential to prevent AQL injection.
    bind_vars = {
      query_vector: @query_vector,
      limit: @limit,
      graph_depth: @graph_depth
    }

    cursor = ARANGO_DB.query(aql, bind_vars: bind_vars, count: true)
    cursor.to_a
  rescue Arango::Error => e
    Rails.logger.error "ArangoDB query failed: #{e.message}. AQL: #{aql}"
    raise ArangoQueryError, "Failed to execute retrieval query."
  end

  def format_results(arango_results)
    # Simple formatting. Could involve further business logic.
    arango_results
  end
end

The AQL query is the most complex part. It first uses the chunk_vector_view to find the top N documents most semantically similar to the user’s query. Then, using the top few of those results as starting points (seed nodes), it performs a graph traversal to find connected nodes up to a certain depth. This pulls in context that is structurally related but might not have been a top hit in the pure vector search. The final result set is a combination of both, deduplicated and ranked.

A controller would then use this service:
app/controllers/api/v1/search_controller.rb

class Api::V1::SearchController < ApplicationController
  def query
    query_text = params.require(:q)
    results = RetrievalService.retrieve(query_text: query_text)
    render json: { data: results }, status: :ok
  rescue RetrievalService::BentoServiceError, RetrievalService::ArangoQueryError => e
    render json: { error: e.message }, status: :service_unavailable
  rescue ActionController::ParameterMissing => e
    render json: { error: e.message }, status: :bad_request
  end
end

The current implementation has limitations. The AQL query’s performance depends heavily on the graph’s structure and the chosen traversal depth; for highly connected graphs, it could become slow. The scoring logic, which simply halves the score for graph-retrieved items, is naive and would need to be replaced by a more sophisticated ranking algorithm, potentially involving a dedicated reranker model served by BentoML. Furthermore, the ingestion process is batch-oriented and doesn’t account for real-time updates to the source documents, which would necessitate a more complex event-driven architecture, possibly leveraging ArangoDB’s streaming capabilities or a CDC pipeline. The cost profile of a large ArangoDB cluster capable of holding both graph indexes and vector data in memory must be carefully modeled against separate, specialized solutions.


  TOC