Implementing a Go-Based RAG Pipeline with a SciPy Reranking Service and OpenSearch Vector Search

Microservices SciPy gRPC Go Zustand OpenSearch RAG

Backend Development

Word Count: 2.6k

Read Times: 16 Min

The standard k-Nearest Neighbor (k-NN) search against a vector database provides a powerful baseline for retrieval-augmented generation (RAG) systems. However, its effectiveness often plateaus. The core issue is that raw vector similarity, typically cosine similarity or L2 distance, captures semantic closeness but frequently fails to account for nuanced relevance, query intent, or the diversity of results. A typical OpenSearch k-NN query might return the top 50 document chunks, but the most contextually valuable chunk for the language model could be buried at rank 20, or worse, the top 5 results might be semantically redundant. This high-recall, variable-precision problem directly degrades the quality of the final generated output.

Our initial attempt to solve this involved a monolithic Go service. The plan was to fetch a large candidate set from OpenSearch and then implement a more sophisticated reranking algorithm directly in Go. This approach was quickly abandoned. The algorithms we needed to explore involved complex numerical and statistical operations—pairwise distance matrices, diversity metrics based on clustering, or signal decay functions—that are trivial to implement with libraries like Python’s SciPy and NumPy but would require a massive, error-prone effort to build from scratch in Go. This realization forced a pivot to a polyglot microservice architecture, a decision that introduced its own set of trade-offs around latency, complexity, and operational overhead.

The architecture we settled on consists of three primary components:

OpenSearch: The vector store for the initial, high-recall retrieval phase.
Go Orchestrator Service: A high-performance API layer that handles external requests, queries OpenSearch for the initial candidate set, and orchestrates the call to the reranking service.
Python/SciPy Reranker Service: A specialized gRPC service that accepts the candidate set and applies a computationally intensive algorithm to reorder them based on a more sophisticated relevance model.

Communication between the Go and Python services is handled via gRPC for performance and type safety, which is non-negotiable in a production-grade polyglot system.

sequenceDiagram
    participant FE as Frontend (React/Zustand)
    participant GO as Go Orchestrator API
    participant OS as OpenSearch
    participant PY as Python/SciPy Reranker

    FE->>+GO: POST /api/search (query vector)
    GO->>+OS: k-NN search for top 100 candidates
    OS-->>-GO: 100 candidate documents & vectors
    GO->>+PY: gRPC Rerank(query_vector, candidate_vectors)
    PY-->>-GO: Top 5 reranked document IDs
    GO-->>-FE: Final ranked results

This design isolates the specialized numerical computation from the core API logic, allowing each component to be developed and scaled independently. In a real-world project, this separation is critical for maintainability. The Go team shouldn’t need to understand the intricacies of statistical distance metrics, and the data science team shouldn’t need to worry about Go’s concurrency patterns.

1. OpenSearch Index Configuration

The foundation of the system is a properly configured OpenSearch index. For k-NN search, the mapping must define a field with the knn_vector type. The choice of engine (nmslib, faiss) and parameters (space_type, m, ef_construction) is critical and depends on the specific trade-offs between indexing speed, query latency, and accuracy. For this implementation, we use the default nmslib engine.

A common pitfall here is neglecting to store the source text along with the vector. While you could store vectors and text in separate places, co-locating them simplifies the retrieval logic immensely, avoiding a second lookup to fetch document content after the vector search is complete.

Here is the index mapping:

{
  "settings": {
    "index.knn": true,
    "index.knn.algo_param.ef_search": 100
  },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      },
      "text": {
        "type": "text"
      },
      "doc_id": {
        "type": "keyword"
      }
    }
  }
}

We can populate this index using a simple Go client. This is a one-off task for setup, but in a real system, this would be part of a larger data ingestion pipeline.

// pkg/opensearch/client.go
package opensearch

import (
	"bytes"
	"context"
	"encoding/json"
	"fmt"
	"log/slog"
	"time"

	"github.com/opensearch-project/opensearch-go/v2"
	"github.com/opensearch-project/opensearch-go/v2/opensearchapi"
)

type Document struct {
	DocID     string    `json:"doc_id"`
	Text      string    `json:"text"`
	Embedding []float32 `json:"embedding"`
}

// IndexDocument indexes a single document into OpenSearch.
func IndexDocument(ctx context.Context, osClient *opensearch.Client, indexName string, doc Document) error {
	body, err := json.Marshal(doc)
	if err != nil {
		return fmt.Errorf("failed to marshal document: %w", err)
	}

	req := opensearchapi.IndexRequest{
		Index:      indexName,
		DocumentID: doc.DocID,
		Body:       bytes.NewReader(body),
		Refresh:    "true",
	}

	res, err := req.Do(ctx, osClient)
	if err != nil {
		return fmt.Errorf("failed to execute index request: %w", err)
	}
	defer res.Body.Close()

	if res.IsError() {
		return fmt.Errorf("indexing error: %s", res.String())
	}

	slog.Info("Successfully indexed document", "doc_id", doc.DocID)
	return nil
}

This is a simplified indexing function. A production system would batch requests for efficiency and include more robust error handling and retry logic.

2. The SciPy Reranking Microservice

This is where the unique logic resides. The service exposes a single gRPC method, Rerank, which takes a query vector and a list of candidate document vectors. Its job is to re-score and re-order these candidates.

First, the protocol buffer definition (reranker.proto):

// protos/reranker.proto
syntax = "proto3";

package reranker;

option go_package = "gen/reranker";

message Vector {
  repeated float values = 1;
}

message RerankRequest {
  Vector query_vector = 1;
  map<string, Vector> candidate_vectors = 2; // Key is doc_id
}

message RerankedDocument {
  string doc_id = 1;
  float score = 2;
}

message RerankResponse {
  repeated RerankedDocument documents = 1;
}

service Reranker {
  rpc Rerank(RerankRequest) returns (RerankResponse) {}
}

The design choice to use a map<string, Vector> for candidate_vectors is intentional. It ensures we can correlate the vectors back to their original document IDs without relying on list indices, which is a common source of bugs.

The Python server implementation leverages scipy and numpy. Our reranking algorithm will be a hypothetical but illustrative one: Maximal Marginal Relevance (MMR). MMR aims to reduce redundancy in the results by selecting documents that are both relevant to the query and dissimilar to documents already selected.

# reranker_server.py
import grpc
import numpy as np
from concurrent import futures
from scipy.spatial.distance import cdist

import reranker_pb2
import reranker_pb2_grpc

class RerankerService(reranker_pb2_grpc.RerankerServicer):
    def Rerank(self, request, context):
        """
        Reranks candidate documents using a Maximal Marginal Relevance (MMR) approach.
        """
        query_vector = np.array(request.query_vector.values, dtype=np.float32).reshape(1, -1)
        
        # A common mistake is to fail validation on empty inputs.
        if not request.candidate_vectors:
            return reranker_pb2.RerankResponse(documents=[])

        candidate_ids = list(request.candidate_vectors.keys())
        candidate_vectors_list = [np.array(v.values, dtype=np.float32) for v in request.candidate_vectors.values()]
        candidate_vectors = np.vstack(candidate_vectors_list)

        # In a real-world project, lambda would be a tunable parameter.
        lambda_param = 0.5
        num_results = 5 # Return top 5

        # Calculate similarity between query and all candidates
        query_candidate_similarity = 1 - cdist(query_vector, candidate_vectors, 'cosine')
        query_candidate_similarity = query_candidate_similarity.flatten()

        # Calculate similarity between all pairs of candidates
        candidate_similarity_matrix = 1 - cdist(candidate_vectors, candidate_vectors, 'cosine')

        remaining_indices = list(range(len(candidate_ids)))
        selected_indices = []

        # Greedily select the best documents
        if not remaining_indices:
             return reranker_pb2.RerankResponse(documents=[])
        
        # First document is the one most similar to the query
        first_selection_idx = np.argmax(query_candidate_similarity)
        selected_indices.append(first_selection_idx)
        remaining_indices.remove(first_selection_idx)

        # Iteratively select remaining documents based on MMR score
        while len(selected_indices) < num_results and remaining_indices:
            mmr_scores = []
            for i in remaining_indices:
                similarity_to_query = query_candidate_similarity[i]
                
                # Find max similarity to already selected documents
                max_similarity_to_selected = 0
                if selected_indices:
                    # The pitfall here is performance. For large candidate sets, this inner loop is costly.
                    # We are accessing rows of a potentially large matrix.
                    max_similarity_to_selected = np.max(candidate_similarity_matrix[i, selected_indices])
                
                mmr_score = lambda_param * similarity_to_query - (1 - lambda_param) * max_similarity_to_selected
                mmr_scores.append((mmr_score, i))
            
            if not mmr_scores:
                break
                
            best_mmr_idx = max(mmr_scores, key=lambda x: x[0])[1]
            selected_indices.append(best_mmr_idx)
            remaining_indices.remove(best_mmr_idx)

        # Prepare response
        response = reranker_pb2.RerankResponse()
        final_docs = []
        for idx in selected_indices:
            final_docs.append(reranker_pb2.RerankedDocument(
                doc_id=candidate_ids[idx],
                score=float(query_candidate_similarity[idx]) # Score is original similarity to query
            ))
        
        response.documents.extend(final_docs)
        return response

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    reranker_pb2_grpc.add_RerankerServicer_to_server(RerankerService(), server)
    server.add_insecure_port('[::]:50051')
    print("Starting reranker server on port 50051...")
    server.start()
    server.wait_for_termination()

if __name__ == '__main__':
    serve()

The code is heavily commented because the logic is non-trivial. It demonstrates how SciPy’s cdist can efficiently compute the required cosine similarity matrices, forming the core of the MMR calculation. This logic would have been far more verbose and less performant if implemented manually in Go.

3. The Go Orchestration Service

This service is the system’s brain. It exposes a RESTful API, performs the initial data retrieval from OpenSearch, and calls the Python gRPC service for the final ranking.

The core logic resides in the HTTP handler.

// cmd/server/handler.go
package main

import (
	"context"
	"encoding/json"
	"log/slog"
	"net/http"
	"time"
	
	"gen/reranker" // Generated gRPC code
)

type SearchRequest struct {
	QueryVector []float32 `json:"query_vector"`
	TopK        int       `json:"top_k"`
}

type SearchHandler struct {
	osClient     *opensearch.Client
	rerankerClient reranker.RerankerClient
	osIndexName  string
}

func (h *SearchHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	if r.Method != http.MethodPost {
		http.Error(w, "Only POST method is allowed", http.StatusMethodNotAllowed)
		return
	}

	var req SearchRequest
	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
		http.Error(w, "Invalid request body", http.StatusBadRequest)
		return
	}
	
	// A common mistake is not setting timeouts for downstream calls.
	// This context will propagate to both OpenSearch and gRPC calls.
	ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
	defer cancel()

	// 1. Retrieval Phase from OpenSearch (high recall)
	// We fetch a much larger set (e.g., 100) than the user requested (e.g., 5).
	const candidateSetSize = 100
	initialHits, err := h.searchOpenSearch(ctx, req.QueryVector, candidateSetSize)
	if err != nil {
		slog.Error("OpenSearch query failed", "error", err)
		http.Error(w, "Failed to retrieve initial candidates", http.StatusInternalServerError)
		return
	}
	
	if len(initialHits) == 0 {
		w.Header().Set("Content-Type", "application/json")
		w.Write([]byte("[]")) // Return empty list, not an error
		return
	}

	// 2. Reranking Phase via gRPC (high precision)
	rerankReq := h.prepareRerankRequest(req.QueryVector, initialHits)

	rerankResp, err := h.rerankerClient.Rerank(ctx, rerankReq)
	if err != nil {
		slog.Error("Reranker gRPC call failed", "error", err)
		http.Error(w, "Failed during result reranking", http.StatusInternalServerError)
		return
	}

	// 3. Final Response Assembly
	// We must fetch the full document content for the final, reranked IDs.
	// This demonstrates a trade-off: sending full text over gRPC is inefficient.
	// Sending only IDs is better, but requires a final fetch from the source map.
	finalResults := h.assembleFinalResults(rerankResp.Documents, initialHits)

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(finalResults)
}

// searchOpenSearch performs the k-NN query.
func (h *SearchHandler) searchOpenSearch(ctx context.Context, vector []float32, size int) (map[string]Document, error) {
    // ... implementation for OpenSearch k-NN query ...
    // This function would build the JSON query and parse the response.
    // It should return a map[string]Document for easy lookup by doc_id.
    return make(map[string]Document), nil // Placeholder
}

// prepareRerankRequest converts OpenSearch hits to the gRPC request format.
func (h *SearchHandler) prepareRerankRequest(queryVector []float32, hits map[string]Document) *reranker.RerankRequest {
	candidateVectors := make(map[string]*reranker.Vector)
	for id, doc := range hits {
		candidateVectors[id] = &reranker.Vector{Values: doc.Embedding}
	}
	return &reranker.RerankRequest{
		QueryVector: &reranker.Vector{Values: queryVector},
		CandidateVectors: candidateVectors,
	}
}

// assembleFinalResults constructs the final ordered list of documents.
func (h *SearchHandler) assembleFinalResults(rerankedDocs []*reranker.RerankedDocument, initialHits map[string]Document) []Document {
	var results []Document
	for _, rerankedDoc := range rerankedDocs {
		if doc, ok := initialHits[rerankedDoc.DocId]; ok {
			results = append(results, doc)
		}
	}
	return results
}

The handler demonstrates a robust flow: context with timeout, a multi-stage pipeline (retrieve then rerank), and careful data transformation between stages. The searchOpenSearch implementation is omitted for brevity but would use the official Go client to build and execute a k-NN query. The assembleFinalResults function highlights a key architectural detail: we only pass IDs and vectors to the reranker, not the full document text, to minimize network payload. The final result is then re-assembled in the Go service.

4. Frontend State Management with Zustand

On the frontend, the user experience for a potentially slow, multi-stage search is critical. We cannot simply show a loading spinner for the entire duration. Zustand’s simplicity is a major asset here for managing the complex state transitions: idle -> loading_initial -> loading_rerank -> success | error.

A basic Zustand store for this flow:

// store.js
import { create } from 'zustand';

export const useSearchStore = create((set) => ({
  // State machine: 'idle', 'loading', 'success', 'error'
  status: 'idle', 
  results: [],
  error: null,
  
  // The main action to trigger the search flow
  performSearch: async (queryVector) => {
    set({ status: 'loading', results: [], error: null });
    
    try {
      const response = await fetch('/api/search', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ query_vector: queryVector, top_k: 5 }),
      });
      
      if (!response.ok) {
        throw new Error(`API Error: ${response.statusText}`);
      }
      
      const data = await response.json();
      set({ status: 'success', results: data });
    } catch (err) {
      // A common mistake is not providing user-friendly error messages.
      set({ status: 'error', error: err.message });
    }
  },
}));

The corresponding React component can then reactively render the UI based on the store’s status.

// SearchResults.jsx
import React from 'react';
import { useSearchStore } from './store';

function SearchResults() {
  const status = useSearchStore((state) => state.status);
  const results = useSearchStore((state) => state.results);
  const error = useSearchStore((state) => state.error);

  if (status === 'loading') {
    return <div>Loading sophisticated results...</div>;
  }
  
  if (status === 'error') {
    return <div style={{ color: 'red' }}>Error: {error}</div>;
  }
  
  if (status === 'success' && results.length === 0) {
      return <div>No results found.</div>
  }

  return (
    <ul>
      {results.map((doc) => (
        <li key={doc.doc_id}>
          <strong>Document ID: {doc.doc_id}</strong>
          <p>{doc.text}</p>
        </li>
      ))}
    </ul>
  );
}

While this example uses a single loading state, a more advanced implementation could involve the Go API streaming results. For instance, it could return the top 10 initial OpenSearch results immediately, then push the final reranked results over a WebSocket. Zustand would be perfectly capable of managing this progressive data enhancement, providing a much better user experience.

Limitations and Future Paths

The primary limitation of this architecture is the added latency from the network hop to the Python service and the computational cost of the reranking algorithm itself. A synchronous API call, as implemented here, puts the entire round-trip time on the critical path of the user request. For systems requiring sub-100ms response times, this would be unacceptable. A potential optimization is to make the reranking step asynchronous. The API could immediately return the top-k results from OpenSearch and then notify the client via WebSocket or polling when the superior, reranked results are ready.

Furthermore, the Python gRPC service is a single point of failure and a performance bottleneck. In a production environment, it would need to be replicated and placed behind a load balancer. Careful monitoring of its CPU usage would be essential, as the cdist and matrix operations can be resource-intensive, especially with large candidate sets (>1000) or high-dimensional vectors.

Finally, the MMR algorithm is just one example. The true power of this architecture is its flexibility. The Python service can be hot-swapped with different models—from simple heuristic-based ones to complex neural network-based rerankers—without ever touching the robust, high-performance Go orchestration layer. This separation of concerns is the architecture’s main strength, but it comes at the cost of operational complexity that must be managed.

Microservices SciPy gRPC Go Zustand OpenSearch RAG

Implementing L7-Aware Multi-Tenancy for MLflow Using Cilium and a Relay-Powered GraphQL Gateway

2023-10-27 Cloud Native

Relay Kubernetes GraphQL MLflow eBPF Cilium

Dynamic Secret Injection for Serverless Functions via GitLab CI and Vault with Front-End State Reflection Using Valtio

2023-10-27 DevSecOps

GitLab CI/CD Serverless HashiCorp Vault GitHub Valtio