Architecting an Offline-First ML Prediction System with a Rails Monolith and a Python DVC-Managed Service

System Architecture

Word Count: 2.5k

Read Times: 15 Min

The core business logic resides in a mature, stable Ruby on Rails monolith. This application manages product inventory, and a new requirement emerged: automatically categorize new products using a machine learning model. The data science team operates exclusively in the Python ecosystem and, for regulatory reasons, requires a verifiable and reproducible link between the exact dataset, model artifact, and the prediction served to the user. Our initial, naive attempts to bridge this gap failed spectacularly. Shelling out to a Python script from Rails was a performance and security disaster. Trying to use a Ruby-based ML library lacked the ecosystem support and led to inferior models. We were faced with a classic heterogeneous system problem, compounded by MLOps and UX requirements for our warehouse staff who operate on tablets with unreliable network connectivity.

This forced a fundamental rethink, leading to a decoupled, service-oriented architecture. The goal was to allow the Python and Rails stacks to evolve independently while creating a resilient, offline-capable user interface. The technical pain point was not just serving a prediction, but doing so in a way that was reproducible, performant, and robust against network failures.

Our final architecture consists of four distinct components:

A DVC-managed Python Inference Service: A lightweight FastAPI service responsible for loading a specific model version and serving predictions. DVC guarantees that the model artifact is version-controlled in lockstep with the code and data.
A Ruby on Rails Backend-for-Frontend (BFF): An API endpoint within the existing monolith that acts as a secure facade. It handles user authentication and communicates with the Python service over the internal network.
A React/Ant Design Frontend Component: A modern UI component, embedded in a classic Rails view, providing a rich and responsive user experience for the prediction task.
A Service Worker for Offline Caching: This browser-level proxy intercepts network requests, caching both the UI assets and API responses, making the entire feature functional even when the user goes offline.

Here is a high-level overview of the request flow when the system is online.

sequenceDiagram
    participant User
    participant Browser (React/AntD)
    participant Service Worker
    participant Rails Monolith (BFF)
    participant Python Service (Inference)
    participant DVC Storage (S3/GCS)

    User->>Browser (React/AntD): Interacts with product page
    Browser (React/AntD)->>Service Worker: Fetch prediction for product_id: 123
    Service Worker->>Rails Monolith (BFF): GET /api/v1/product_predictions/123
    Note right of Service Worker: Network is available, pass through
    Rails Monolith (BFF)->>Python Service (Inference): POST /predict (product_data)
    Note over Python Service (Inference): On startup, pulled model from DVC Storage
    Python Service (Inference)-->>Rails Monolith (BFF): { "category": "electronics" }
    Rails Monolith (BFF)-->>Service Worker: 200 OK with JSON payload
    Service Worker->>Browser (React/AntD): Returns response
    Note left of Service Worker: Caches the response for future offline use
    Browser (React/AntD)->>User: Displays "electronics" category

The true value is realized when the network fails. The Service Worker intercepts the request and serves the data from its cache, providing an uninterrupted experience.

The DVC-Managed Python Inference Service

The cornerstone of reproducibility is Data Version Control (DVC). It allows us to treat data and models as first-class citizens in our Git workflow.

First, the project structure for the Python service:

/inference_service
├── .dvc/
├── data/
│   └── product_model.pkl.dvc  # DVC pointer file
├── app/
│   ├── main.py                # FastAPI application
│   ├── models.py              # Pydantic models
│   └── ml_loader.py           # Logic to load the model
├── tests/
│   └── test_api.py
├── .gitignore
├── Dockerfile
├── dvc.yaml                   # DVC pipeline definition
└── requirements.txt

The dvc.yaml file defines the ML pipeline. This ensures that anyone checking out the repository can reproduce the model artifact.

# dvc.yaml
stages:
  train:
    cmd: python scripts/train.py --data-path data/raw_products.csv --output-path data/product_model.pkl
    deps:
      - scripts/train.py
      - data/raw_products.csv
    outs:
      - data/product_model.pkl

When dvc repro is run, it executes the training script and tracks the output product_model.pkl with DVC. The actual file is stored in remote storage (like S3), and a small .dvc pointer file is committed to Git.

The FastAPI service is designed for one job: inference.

# app/models.py
from pydantic import BaseModel
from typing import List

class ProductFeatures(BaseModel):
    product_id: int
    description: str
    dimensions: List[float]

class PredictionResponse(BaseModel):
    product_id: int
    category: str
    confidence: float

# app/ml_loader.py
import pickle
import logging
from pathlib import Path
from functools import lru_cache

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

MODEL_PATH = Path("./data/product_model.pkl")

# In a real-world project, this would be more complex, handling different model types.
# The lru_cache ensures we only load the model from disk once.
@lru_cache(maxsize=1)
def load_model():
    """
    Loads the ML model from the path defined by MODEL_PATH.
    This function is cached to prevent reloading on every request.
    """
    if not MODEL_PATH.exists():
        logger.error(f"Model file not found at {MODEL_PATH}")
        raise FileNotFoundError(f"Model file not found at {MODEL_PATH}")
    
    try:
        with open(MODEL_PATH, "rb") as f:
            model = pickle.load(f)
        logger.info("ML model loaded successfully.")
        return model
    except Exception as e:
        logger.error(f"Failed to load model: {e}", exc_info=True)
        raise

The main application file ties everything together. We use dependency injection for the model to make testing easier.

# app/main.py
import logging
from fastapi import FastAPI, Depends, HTTPException
from app.models import ProductFeatures, PredictionResponse
from app.ml_loader import load_model

app = FastAPI(title="Product Classification Service")
logger = logging.getLogger(__name__)

# This startup event is a good place for initial checks, but model loading
# is deferred to the first request thanks to lru_cache for faster starts.
@app.on_event("startup")
async def startup_event():
    logger.info("Inference service is starting up.")
    # You could perform a test load here if you want to fail fast
    # try:
    #     load_model()
    # except Exception:
    #     logger.critical("CRITICAL: Model could not be loaded on startup. Exiting.")
    #     # In a k8s environment, this would cause the pod to crash-loop,
    #     # which is the desired behavior for a critical failure.

@app.post("/predict", response_model=PredictionResponse)
def predict(features: ProductFeatures, model=Depends(load_model)):
    """
    Accepts product features and returns a classification.
    The model is injected as a dependency.
    """
    try:
        # In a real model, you'd preprocess features into a numpy array or similar
        model_input = [features.description] # Simplified for demonstration
        
        # The model's predict() and predict_proba() methods are hypothetical
        prediction = model.predict(model_input)[0]
        confidence = model.predict_proba(model_input).max()

        return PredictionResponse(
            product_id=features.product_id,
            category=str(prediction),
            confidence=float(confidence)
        )
    except FileNotFoundError:
        # This case is handled by the loader, but as a safeguard:
        logger.error("Model file not found during prediction request.")
        raise HTTPException(status_code=503, detail="Model is not available")
    except Exception as e:
        logger.error(f"Prediction failed for product {features.product_id}: {e}", exc_info=True)
        raise HTTPException(status_code=500, detail="Internal server error during prediction")

@app.get("/health")
def health_check():
    """Simple health check endpoint."""
    return {"status": "ok"}

The final piece is the Dockerfile. A key step here is running dvc pull to fetch the model artifact from remote storage during the build. This ensures the Docker image is self-contained with the exact version of the model specified in dvc.lock.

# Dockerfile
FROM python:3.9-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install DVC
RUN pip install dvc[s3] # Or [gcs], [azure], etc.

# Copy application code
COPY . .

# -- IMPORTANT STEP --
# Pull the DVC-tracked data. This fetches the model from remote storage.
# The credentials for the remote storage (e.g., AWS keys) must be available
# at build time, e.g., via build secrets or an assumed role.
RUN dvc pull -v

# Expose the port the app runs on
EXPOSE 8000

# Run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

The Rails Backend-for-Frontend (BFF)

The Rails monolith should not expose the Python service directly to the internet. Instead, we create a BFF endpoint that acts as a proxy. This provides a single point of authentication and control, prevents CORS issues, and abstracts the implementation details of the ML service from the frontend.

In config/routes.rb:

# config/routes.rb
namespace :api do
  namespace :v1 do
    resources :products, only: [] do
      get :prediction, to: 'product_predictions#show'
    end
  end
end

The controller handles the logic. A common mistake is to make blocking HTTP calls without proper error handling. In a real-world project, we must account for timeouts, network failures, and non-200 responses from the downstream service. Using a gem like Faraday with middleware is a robust approach.

# app/controllers/api/v1/product_predictions_controller.rb
class Api::V1::ProductPredictionsController < ApplicationController
  # Assume some form of authentication, e.g., Devise or Doorkeeper
  before_action :authenticate_user!
  before_action :set_product

  # A simple service object to encapsulate the logic of calling the ML service.
  # This keeps the controller lean.
  class InferenceService
    include HTTParty # Or Faraday for more advanced features like middleware
    base_uri ENV.fetch('INFERENCE_SERVICE_URL', 'http://localhost:8000')
    format :json
    
    # In production, timeouts are critical.
    default_timeout 5 # seconds

    def self.predict(product)
      body = {
        product_id: product.id,
        description: product.description,
        dimensions: [product.width, product.height, product.depth]
      }.to_json

      headers = { 'Content-Type' => 'application/json' }
      
      begin
        response = post('/predict', body: body, headers: headers)
        handle_response(response)
      rescue Net::ReadTimeout, Net::OpenTimeout => e
        # Log the error properly
        Rails.logger.error("Inference service timeout for product #{product.id}: #{e.message}")
        { error: 'Prediction service timed out', status: :gateway_timeout }
      rescue StandardError => e
        Rails.logger.error("Inference service connection error for product #{product.id}: #{e.message}")
        { error: 'Prediction service is unavailable', status: :service_unavailable }
      end
    end

    private

    def self.handle_response(response)
      if response.success?
        response.parsed_response
      else
        Rails.logger.warn("Inference service returned error for product: #{response.code} - #{response.body}")
        { error: "Prediction failed with status #{response.code}", status: response.code }
      end
    end
  end

  def show
    result = InferenceService.predict(@product)
    
    if result[:error]
      render json: { error: result[:error] }, status: result[:status] || :internal_server_error
    else
      render json: result, status: :ok
    end
  end

  private

  def set_product
    @product = Product.find_by(id: params[:product_id])
    render json: { error: 'Product not found' }, status: :not_found unless @product
  end
end

This controller is clean. The InferenceService encapsulates the messy details of the HTTP call, including timeouts and error handling, which is crucial for system stability.

The React/Ant Design Frontend with Service Worker

On the client side, we embed a React component into a Rails ERB view. This component will fetch data from the Rails BFF endpoint and use Ant Design for a polished UI.

The component manages its own state: loading, success (data), and error.

// app/javascript/components/ProductPredictor.jsx
import React, 'useState', useEffect' from 'react';
import { Button, Spin, Alert, Descriptions, Tag } from 'antd';
import axios from 'axios';

const ProductPredictor = ({ productId }) => {
  const [prediction, setPrediction] = useState(null);
  const [isLoading, setIsLoading] = useState(false);
  const [error, setError] = useState(null);
  const [isFromCache, setIsFromCache] = useState(false);

  const fetchPrediction = async () => {
    setIsLoading(true);
    setError(null);
    setPrediction(null);

    try {
      const response = await axios.get(`/api/v1/products/${productId}/prediction`);
      setPrediction(response.data);
      // A custom header or property can be added by the Service Worker to indicate a cached response.
      // This is an advanced pattern for providing feedback to the user.
      if (response.headers['x-from-sw-cache']) {
        setIsFromCache(true);
      }
    } catch (err) {
      const errorMessage = err.response?.data?.error || 'Failed to fetch prediction.';
      setError(errorMessage);
    } finally {
      setIsLoading(false);
    }
  };

  useEffect(() => {
    // Automatically fetch prediction on component mount.
    fetchPrediction();
  }, [productId]);

  const renderContent = () => {
    if (isLoading) {
      return <Spin tip="Analyzing product..." />;
    }

    if (error) {
      return <Alert message="Error" description={error} type="error" showIcon />;
    }

    if (prediction) {
      return (
        <div>
          {isFromCache && <Alert message="Displaying cached data. You appear to be offline." type="warning" banner />}
          <Descriptions title="Prediction Result" bordered>
            <Descriptions.Item label="Predicted Category">
              <Tag color="blue">{prediction.category}</Tag>
            </Descriptions.Item>
            <Descriptions.Item label="Confidence">
              {(prediction.confidence * 100).toFixed(2)}%
            </Descriptions.Item>
          </Descriptions>
        </div>
      );
    }

    return null;
  };

  return (
    <div style={{ padding: '24px', backgroundColor: '#fff' }}>
      {renderContent()}
      <Button onClick={fetchPrediction} disabled={isLoading} style={{ marginTop: '16px' }}>
        Re-analyze Product
      </Button>
    </div>
  );
};

export default ProductPredictor;

Now for the Service Worker. We can use Google’s Workbox library to greatly simplify the implementation. In our webpack configuration (config/webpack/environment.js for a standard Rails setup), we add the workbox-webpack-plugin.

// config/webpack/environment.js
const { environment } = require('@rails/webpacker')
const { GenerateSW } = require('workbox-webpack-plugin');

environment.plugins.append('Workbox', new GenerateSW({
  clientsClaim: true,
  skipWaiting: true,
  runtimeCaching: [
    {
      // Cache the prediction API calls.
      // NetworkFirst: try network, if it fails, use the cache.
      // This is good for data that changes but can be stale.
      urlPattern: new RegExp('/api/v1/products/\\d+/prediction'),
      handler: 'NetworkFirst',
      options: {
        cacheName: 'api-predictions-cache',
        expiration: {
          maxEntries: 50,
          maxAgeSeconds: 24 * 60 * 60, // 1 day
        },
        networkTimeoutSeconds: 4, // Fail fast if network is slow
      },
    },
    {
      // Cache application assets (JS, CSS)
      // StaleWhileRevalidate: serve from cache for speed, then update in the background.
      urlPattern: /\.(?:js|css|png|jpg|jpeg|svg)$/,
      handler: 'StaleWhileRevalidate',
      options: {
        cacheName: 'asset-cache',
      },
    },
  ],
}));

module.exports = environment

Finally, we need to register the service worker in our main JavaScript entry point.

// app/javascript/packs/application.js
// ... other imports

if ('serviceWorker' in navigator) {
  window.addEventListener('load', () => {
    navigator.serviceWorker.register('/service-worker.js').then(registration => {
      console.log('SW registered: ', registration);
    }).catch(registrationError => {
      console.log('SW registration failed: ', registrationError);
    });
  });
}

With this in place, the first time a user visits the page, the Service Worker is installed. It caches the API response. If the user then disconnects from the network and revisits the page, the Service Worker will intercept the axios GET request. Because the network fails, the NetworkFirst strategy will fall back to serving the last successfully fetched prediction from its cache. The user sees the data instead of a browser error page.

This architecture successfully decoupled the ML lifecycle from the main application. The data science team can update models, push new versions through DVC, and deploy the Python service independently. The Rails application remains stable, only needing to know about the API contract of its BFF facade. The frontend delivers a modern, resilient experience that is critical for users in challenging network environments.

The current implementation, however, has limitations. The NetworkFirst caching strategy means that if a new model is deployed, an offline user will continue to see stale predictions until they come back online. For scenarios requiring fresher data, a push notification from the server could trigger a background sync event in the service worker. Furthermore, the communication between Rails and Python is synchronous REST. For high-volume prediction workloads, switching to an asynchronous model with message queues or using a more performant protocol like gRPC would be the next logical optimization path. The current design prioritizes reproducibility and offline resilience over low-latency, real-time inference, a trade-off that was appropriate for this specific business context.

Ant Design DVC Python Service Workers Ruby on Rails

Building a Collaborative Text Editor with a CRDT-Powered Elixir Backend and a Dart Frontend

2023-10-27 Distributed Systems

Tailwind CSS Dart Elixir CRDT Real-time

Orchestrating Containerized SciPy Computations and Nuxt.js Deployments with a Unified Tekton Pipeline

2023-10-27 DevOps

SciPy Kubernetes GitOps MLOps Nuxt.js Tekton