Implementing a Fault-Tolerant Replicated State Machine Using Raft and Embedded SQLite


The project required a fault-tolerant data store with SQL capabilities for a set of services running in a resource-constrained edge environment. The primary constraint was operational simplicity. Deploying and managing a full-fledged distributed database cluster like CockroachDB or even a PostgreSQL cluster with streaming replication was deemed too complex and resource-intensive. Initial explorations with Dgraph were promising due to its native distributed nature, but its graph model and GraphQL API were an awkward fit for our simple relational data structures, and its operational footprint was still heavier than desired.

The core need was simple: replicate a small set of relational tables across three to five nodes and survive the failure of any single node without data loss for committed transactions. This led to the concept of treating a database as a replicated state machine. The state is the database file itself, and the state transitions are the SQL commands (INSERT, UPDATE, DELETE). The Raft consensus algorithm is the canonical solution for managing a replicated log of such commands.

The final architectural decision was to build a thin layer over an embedded SQLite database, using HashiCorp’s Raft implementation to replicate write operations. This approach trades the feature-richness of a mature distributed database for extreme operational simplicity and a minimal resource footprint. Observability was not an afterthought; structured logging shipped to a central Loki instance was built-in from the start to diagnose the inevitable complexities of a distributed consensus system.

The Core Architecture

The system consists of a cluster of identical nodes. Each node runs a Go application that manages three core components: an HTTP server for client interaction, a Raft consensus module, and an SQLite database.

graph TD
    subgraph Cluster
        Node1(Node 1 - Leader)
        Node2(Node 2 - Follower)
        Node3(Node 3 - Follower)
    end

    subgraph Node 1
        HTTP1[HTTP API]
        Raft1[Raft Module]
        SQLite1[SQLite DB]
    end

    subgraph Node 2
        HTTP2[HTTP API]
        Raft2[Raft Module]
        SQLite2[SQLite DB]
    end

    subgraph Node 3
        HTTP3[HTTP API]
        Raft3[Raft Module]
        SQLite3[SQLite DB]
    end
    
    Frontend[Client Frontend] -- Write (SQL) --> HTTP1
    HTTP1 -- Propose Log --> Raft1
    Raft1 -- Replicate Log --> Raft2
    Raft1 -- Replicate Log --> Raft3
    
    Raft2 -- Acknowledge --> Raft1
    Raft3 -- Acknowledge --> Raft1
    
    Raft1 -- Commit & Apply --> SQLite1
    Raft2 -- Commit & Apply --> SQLite2
    Raft3 -- Commit & Apply --> SQLite3

    Frontend -- Read (SQL) --> HTTP2
    HTTP2 -- Query Local --> SQLite2

    subgraph Observability
        Loki
        Promtail
    end

    Node1 -- Structured Logs --> Promtail
    Node2 -- Structured Logs --> Promtail
    Node3 -- Structured Logs --> Promtail
    Promtail -- Scrape & Ship --> Loki

A write request (INSERT, UPDATE, DELETE) sent to any node is automatically forwarded to the current leader. The leader proposes the SQL command as a new entry in the Raft log. Once a majority of nodes acknowledge the entry, it is considered committed. At that point, each node applies the command to its local SQLite database instance. Read requests (SELECT), by default, are served directly by the local SQLite database of whichever node receives the request, accepting a degree of potential staleness.

The Finite State Machine Implementation

The integration point between the Raft library and the application logic is the raft.FSM interface. This interface dictates how the application’s state machine consumes committed log entries and how it provides snapshots to bring slow or new nodes up to speed.

Our FSM’s state is the SQLite database file itself.

// fsm.go

package main

import (
	"database/sql"
	"encoding/json"
	"fmt"
	"io"
	"os"
	"path/filepath"
	"sync"
	"time"

	"github.comcom/hashicorp/raft"
	_ "github.com/mattn/go-sqlite3"
)

type command struct {
	SQL       string   `json:"sql"`
	Args      []any `json:"args"`
	Timestamp int64    `json:"ts"`
}

// fsm is the finite state machine that applies Raft logs to an SQLite database.
type fsm struct {
	mu   sync.Mutex
	db   *sql.DB
	path string // Path to the SQLite file
}

func newFSM(path string) (*fsm, error) {
	db, err := sql.Open("sqlite3", fmt.Sprintf("%s?_journal_mode=WAL", path))
	if err != nil {
		return nil, fmt.Errorf("failed to open sqlite db: %w", err)
	}

	// In a real-world project, schema migrations would be managed here.
	// For simplicity, we create the table if it doesn't exist.
	_, err = db.Exec("CREATE TABLE IF NOT EXISTS key_value (key TEXT PRIMARY KEY, value TEXT)")
	if err != nil {
		db.Close()
		return nil, fmt.Errorf("failed to create initial table: %w", err)
	}

	return &fsm{
		db:   db,
		path: path,
	}, nil
}

// Apply applies a Raft log entry to the SQLite database.
// This is the core of the FSM, where state transitions happen.
func (f *fsm) Apply(log *raft.Log) interface{} {
	f.mu.Lock()
	defer f.mu.Unlock()

	var cmd command
	if err := json.Unmarshal(log.Data, &cmd); err != nil {
		logger.Error("failed to unmarshal command", "error", err, "data", string(log.Data))
		// In a production system, this kind of error is critical and may indicate
		// a bug or data corruption. Panicking might be a valid strategy to
		// force a restart and potential state recovery.
		panic(fmt.Sprintf("cannot unmarshal command: %s", err))
	}
	
	// A common mistake is to not use transactions. Each log application should
	// be atomic.
	tx, err := f.db.Begin()
	if err != nil {
		logger.Error("failed to begin transaction", "error", err)
		return err // Returning an error signals failure to the Raft layer.
	}
	defer tx.Rollback() // Rollback on error is implicit

	res, err := tx.Exec(cmd.SQL, cmd.Args...)
	if err != nil {
		logger.Error("failed to execute sql command", "sql", cmd.SQL, "args", cmd.Args, "error", err)
		return err
	}
	
	if err := tx.Commit(); err != nil {
		logger.Error("failed to commit transaction", "error", err)
		return err
	}
	
	rowsAffected, _ := res.RowsAffected()
	return rowsAffected // Return value can be used for client responses if needed
}

// Snapshot creates a snapshot of the current state. For SQLite, this means
// creating a copy of the database file. This is a critical but potentially
// slow operation.
func (f *fsm) Snapshot() (raft.FSMSnapshot, error) {
	f.mu.Lock()
	defer f.mu.Unlock()
	
	// The pitfall here is consistency. While a file copy might seem simple,
	// the database could be modified during the copy. SQLite's WAL mode helps,
	// but a proper backup API would be better. For this implementation, we
	// are simply copying the file bytes.
	file, err := os.Open(f.path)
	if err != nil {
		return nil, fmt.Errorf("failed to open db for snapshot: %w", err)
	}

	logger.Info("creating fsm snapshot")
	return &fsmSnapshot{reader: file}, nil
}

// Restore is called to restore the FSM from a snapshot. This replaces the
// local state with the state from the snapshot.
func (f *fsm) Restore(snapshot io.ReadCloser) error {
	f.mu.Lock()
	defer f.mu.Unlock()
	defer snapshot.Close()
	
	// Close existing DB connection before replacing the file
	if err := f.db.Close(); err != nil {
		return fmt.Errorf("failed to close existing db before restore: %w", err)
	}

	// Create a temporary file to write the snapshot to
	tempPath := f.path + ".restore." + fmt.Sprintf("%d", time.Now().UnixNano())
	file, err := os.Create(tempPath)
	if err != nil {
		return fmt.Errorf("failed to create temp file for restore: %w", err)
	}
	defer file.Close()
	defer os.Remove(tempPath)

	_, err = io.Copy(file, snapshot)
	if err != nil {
		return fmt.Errorf("failed to copy snapshot to temp file: %w", err)
	}
	
	// Atomically move the restored file into place
	if err := os.Rename(tempPath, f.path); err != nil {
		return fmt.Errorf("failed to rename restored db file: %w", err)
	}

	// Re-open the database with the new file
	newDb, err := sql.Open("sqlite3", fmt.Sprintf("%s?_journal_mode=WAL", f.path))
	if err != nil {
		return fmt.Errorf("failed to reopen sqlite db after restore: %w", err)
	}
	f.db = newDb

	logger.Info("successfully restored fsm from snapshot")
	return nil
}

// fsmSnapshot implements the raft.FSMSnapshot interface.
type fsmSnapshot struct {
	reader *os.File
}

func (s *fsmSnapshot) Persist(sink raft.SnapshotSink) error {
	defer s.reader.Close()
	
	if _, err := io.Copy(sink, s.reader); err != nil {
		sink.Cancel()
		return fmt.Errorf("failed to write snapshot to sink: %w", err)
	}

	return sink.Close()
}

func (s *fsmSnapshot) Release() {
	// The reader is closed in Persist. Nothing more to do.
}

The Apply method is synchronous and blocking. Its performance directly impacts the replication throughput of the entire cluster. The use of a transaction (tx.Begin(), tx.Commit()) ensures that each log entry is applied atomically. The Snapshot and Restore methods are simple file operations. A real-world project would need a more robust snapshotting mechanism, perhaps using SQLite’s online backup API to avoid contention and ensure consistency without locking the entire FSM.

Node and Cluster Setup

Each node requires configuration for its own identity (NodeID), its network address, and the location for its Raft log and SQLite data. The bootstrap process is critical: on the very first start of the cluster, one node must be designated to bootstrap itself as the leader.

// server.go

package main

import (
	"context"
	"database/sql"
	"encoding/json"
	"fmt"
	"log/slog"
	"net"
	"net/http"
	"os"
	"path/filepath"
	"time"

	"github.com/hashicorp/raft"
	raftboltdb "github.comcom/hashicorp/raft-boltdb"
)

var logger *slog.Logger

type Server struct {
	NodeID      string
	HTTPAddr    string
	RaftAddr    string
	RaftDir     string
	raft        *raft.Raft
	fsm         *fsm
	enableDebug bool
}

func NewServer(nodeID, httpAddr, raftAddr, raftDir string, enableDebug bool) *Server {
	return &Server{
		NodeID:      nodeID,
		HTTPAddr:    httpAddr,
		RaftAddr:    raftAddr,
		RaftDir:     raftDir,
		enableDebug: enableDebug,
	}
}

func (s *Server) Start(bootstrap bool) error {
	// Setup Raft configuration.
	config := raft.DefaultConfig()
	config.LocalID = raft.ServerID(s.NodeID)
	// Reduce election timeouts for faster recovery in demos.
	// In production, these values should be tuned based on network latency.
	config.HeartbeatTimeout = 1000 * time.Millisecond
	config.ElectionTimeout = 1000 * time.Millisecond

	// Setup Raft network transport.
	addr, err := net.ResolveTCPAddr("tcp", s.RaftAddr)
	if err != nil {
		return fmt.Errorf("failed to resolve raft addr: %w", err)
	}
	transport, err := raft.NewTCPTransport(s.RaftAddr, addr, 3, 10*time.Second, os.Stderr)
	if err != nil {
		return fmt.Errorf("failed to create raft transport: %w", err)
	}

	// Create snapshot store.
	snapshots, err := raft.NewFileSnapshotStore(s.RaftDir, 2, os.Stderr)
	if err != nil {
		return fmt.Errorf("failed to create snapshot store: %w", err)
	}

	// Create log store and stable store.
	logStore, err := raftboltdb.NewBoltStore(filepath.Join(s.RaftDir, "raft.db"))
	if err != nil {
		return fmt.Errorf("failed to create bolt store: %w", err)
	}

	// Create the FSM.
	s.fsm, err = newFSM(filepath.Join(s.RaftDir, "app.db"))
	if err != nil {
		return fmt.Errorf("failed to create fsm: %w", err)
	}

	// Instantiate the Raft system.
	ra, err := raft.NewRaft(config, s.fsm, logStore, logStore, snapshots, transport)
	if err != nil {
		return fmt.Errorf("failed to create raft instance: %w", err)
	}
	s.raft = ra

	if bootstrap {
		logger.Info("bootstrapping cluster")
		configuration := raft.Configuration{
			Servers: []raft.Server{
				{
					ID:      config.LocalID,
					Address: transport.LocalAddr(),
				},
			},
		}
		s.raft.BootstrapCluster(configuration)
	}
	
	logger.Info("server started", "node_id", s.NodeID, "http_addr", s.HTTPAddr, "raft_addr", s.RaftAddr)
	return s.startHTTPServer()
}

func (s *Server) startHTTPServer() error {
	mux := http.NewServeMux()
	mux.HandleFunc("/execute", s.handleExecute)
	mux.HandleFunc("/query", s.handleQuery)
	mux.HandleFunc("/join", s.handleJoin)
	
	// A simple static file server for the frontend
	fs := http.FileServer(http.Dir("./frontend/"))
    mux.Handle("/", fs)

	return http.ListenAndServe(s.HTTPAddr, mux)
}

// ... handlers defined below ...

The server initialization code wires together all the components: the BoltDB-backed log store, the file-based snapshot store, the TCP transport, and our SQLite FSM. The bootstrap flag is crucial. Only the first node in a new cluster should be started with this flag. Subsequent nodes must be explicitly joined to the existing cluster.

Handling Client Requests and Cluster Membership

The HTTP server exposes endpoints for writes, reads, and cluster management. A key responsibility of the write endpoint (/execute) is to handle requests that arrive at a follower node. Such requests must be transparently forwarded to the leader.

// handlers.go

func (s *Server) handleExecute(w http.ResponseWriter, r *http.Request) {
	if r.Method != http.MethodPost {
		http.Error(w, "invalid method", http.StatusMethodNotAllowed)
		return
	}

	// If we are not the leader, forward the request.
	if s.raft.State() != raft.Leader {
		leaderAddr := s.raft.Leader()
		if leaderAddr == "" {
			http.Error(w, "no leader elected", http.StatusServiceUnavailable)
			return
		}
		
		// In a real project, the client-facing address is different from the Raft address.
		// Here we assume a simple mapping. You would typically use a discovery service.
		// Let's assume HTTP is on port N and Raft is on port N+1.
		// A common mistake is to not have a robust leader discovery mechanism.
		leaderHTTPAddr := string(leaderAddr) // This needs proper translation. Here we cheat.
		s.forwardRequest(w, r, leaderHTTPAddr)
		return
	}

	var reqBody struct {
		SQL  string   `json:"sql"`
		Args []any `json:"args"`
	}

	if err := json.NewDecoder(r.Body).Decode(&reqBody); err != nil {
		http.Error(w, "bad request body", http.StatusBadRequest)
		return
	}

	cmd, err := json.Marshal(command{
		SQL:       reqBody.SQL,
		Args:      reqBody.Args,
		Timestamp: time.Now().UnixNano(),
	})
	if err != nil {
		http.Error(w, "failed to serialize command", http.StatusInternalServerError)
		return
	}

	applyFuture := s.raft.Apply(cmd, 500*time.Millisecond)
	if err := applyFuture.Error(); err != nil {
		logger.Error("failed to apply raft log", "error", err)
		http.Error(w, "failed to apply command", http.StatusInternalServerError)
		return
	}

	// The response from Apply() could be returned to the client.
	response := applyFuture.Response()
	w.WriteHeader(http.StatusOK)
	json.NewEncoder(w).Encode(map[string]interface{}{"result": response})
}

func (s *Server) handleQuery(w http.ResponseWriter, r *http.Request) {
    // Queries are served from the local FSM. This provides high read availability
    // at the cost of potential staleness (stale reads).
    var reqBody struct {
		SQL  string   `json:"sql"`
		Args []any `json:"args"`
	}
	if err := json.NewDecoder(r.Body).Decode(&reqBody); err != nil {
		http.Error(w, "bad request body", http.StatusBadRequest)
		return
	}
	
	s.fsm.mu.Lock()
	defer s.fsm.mu.Unlock()

	rows, err := s.fsm.db.Query(reqBody.SQL, reqBody.Args...)
	if err != nil {
		http.Error(w, fmt.Sprintf("query failed: %v", err), http.StatusInternalServerError)
		return
	}
	defer rows.Close()

	columns, _ := rows.Columns()
    results := make([]map[string]interface{}, 0)
    for rows.Next() {
        row := make(map[string]interface{})
        values := make([]interface{}, len(columns))
        scanArgs := make([]interface{}, len(columns))
        for i := range values {
            scanArgs[i] = &values[i]
        }
        rows.Scan(scanArgs...)
        for i, col := range columns {
            row[col] = values[i]
        }
        results = append(results, row)
    }

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(results)
}

func (s *Server) handleJoin(w http.ResponseWriter, r *http.Request) {
    var reqBody struct {
		NodeID   string `json:"nodeId"`
		RaftAddr string `json:"raftAddr"`
	}
	if err := json.NewDecoder(r.Body).Decode(&reqBody); err != nil {
		http.Error(w, "bad request body", http.StatusBadRequest)
		return
	}

	if s.raft.State() != raft.Leader {
		http.Error(w, "not the leader", http.StatusServiceUnavailable)
		return
	}
	
	configFuture := s.raft.GetConfiguration()
	if err := configFuture.Error(); err != nil {
		logger.Error("failed to get raft configuration", "error", err)
		http.Error(w, "failed to get configuration", http.StatusInternalServerError)
		return
	}

	for _, srv := range configFuture.Configuration().Servers {
		if srv.ID == raft.ServerID(reqBody.NodeID) {
			logger.Info("node already member of cluster", "node_id", reqBody.NodeID)
			w.WriteHeader(http.StatusOK)
			return
		}
	}

	addVoterFuture := s.raft.AddVoter(raft.ServerID(reqBody.NodeID), raft.ServerAddress(reqBody.RaftAddr), 0, 0)
	if err := addVoterFuture.Error(); err != nil {
		logger.Error("failed to add voter", "error", err, "node_id", reqBody.NodeID)
		http.Error(w, fmt.Sprintf("failed to add voter: %v", err), http.StatusInternalServerError)
		return
	}

	logger.Info("node joined cluster successfully", "node_id", reqBody.NodeID)
	w.WriteHeader(http.StatusOK)
}

// forwardRequest is a placeholder for a proper reverse proxy logic.
func (s *Server) forwardRequest(w http.ResponseWriter, r *http.Request, leaderAddr string) {
    // A production implementation would use a proper reverse proxy.
    // This is a simplified example.
	// We need to resolve the leader's HTTP address from its Raft address.
	// For this example, we assume port mapping or a service discovery mechanism.
    http.Redirect(w, r, "http://"+leaderAddr+r.RequestURI, http.StatusTemporaryRedirect)
}

The /join handler allows a new node to be added to the cluster. A request containing the new node’s ID and Raft address must be sent to the current leader. The leader then uses AddVoter to add the node to the Raft configuration. The new node will initially be a follower and will contact the leader to receive a snapshot and catch up on the log.

Integrating Observability with Loki

Debugging distributed consensus is notoriously difficult. Without proper logging, it’s nearly impossible to understand state transitions, leader elections, or replication failures. We use Go’s structured logger (slog) to output JSON-formatted logs.

// main.go

package main

import (
	"flag"
	"log/slog"
	"os"
)

func main() {
	nodeID := flag.String("id", "node1", "Node ID")
	httpAddr := flag.String("http", "127.0.0.1:8001", "HTTP bind address")
	raftAddr := flag.String("raft", "127.0.0.1:9001", "Raft bind address")
	raftDir := flag.String("dir", "/tmp/raft/node1", "Raft data directory")
	bootstrap := flag.Bool("bootstrap", false, "Bootstrap the cluster")
	enableDebug := flag.Bool("debug", false, "Enable debug logging")
	flag.Parse()

	logLevel := slog.LevelInfo
	if *enableDebug {
		logLevel = slog.LevelDebug
	}
	
	// Structured logging is non-negotiable for distributed systems.
	logger = slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: logLevel})).
		With("node_id", *nodeID)

	if err := os.MkdirAll(*raftDir, 0700); err != nil {
		logger.Error("failed to create raft directory", "error", err, "path", *raftDir)
		os.Exit(1)
	}

	server := NewServer(*nodeID, *httpAddr, *raftAddr, *raftDir, *enableDebug)
	if err := server.Start(*bootstrap); err != nil {
		logger.Error("server failed to start", "error", err)
		os.Exit(1)
	}
}

This setup produces logs like:
{"time":"2023-10-27T10:30:00.123Z","level":"INFO","msg":"server started","node_id":"node1","http_addr":"127.0.0.1:8001","raft_addr":"127.0.0.1:9001"}

These logs can be collected by an agent like Promtail and shipped to Loki. A simple Promtail configuration would be:

# promtail-config.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: raft-sqlite
    static_configs:
      - targets:
          - localhost
        labels:
          job: raft-sqlite
          __path__: /path/to/your/app/logs/*.log

With logs in Loki, we can perform powerful queries using LogQL to diagnose issues. For example, to track leadership changes:

{job="raft-sqlite"} | json | msg="entering leader state"

This allows us to correlate events across different nodes and build dashboards in Grafana to visualize cluster health, such as leader stability and log application rates.

The Frontend Client

A minimal frontend demonstrates the system’s fault tolerance. It provides a simple UI to send INSERT and SELECT queries to the cluster.

<!-- frontend/index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Distributed SQLite Client</title>
    <style> body { font-family: sans-serif; } </style>
</head>
<body>
    <h1>Distributed SQLite over Raft</h1>
    <div>
        <label for="node">Target Node API:</label>
        <input type="text" id="node" value="http://127.0.0.1:8001">
    </div>
    <hr>
    <div>
        <h3>Write Data (INSERT/UPDATE)</h3>
        <input type="text" id="sql-write" size="80" value="INSERT INTO key_value (key, value) VALUES (?, ?) ON CONFLICT(key) DO UPDATE SET value=excluded.value">
        <br>
        <input type="text" id="arg1" placeholder="key">
        <input type="text" id="arg2" placeholder="value">
        <button id="submit-write">Execute Write</button>
    </div>
    <hr>
    <div>
        <h3>Read Data (SELECT)</h3>
        <input type="text" id="sql-read" size="80" value="SELECT * FROM key_value ORDER BY key">
        <button id="submit-read">Execute Read</button>
        <h3>Results:</h3>
        <pre id="results"></pre>
    </div>
    <script>
        document.getElementById('submit-write').addEventListener('click', async () => {
            const api = document.getElementById('node').value;
            const sql = document.getElementById('sql-write').value;
            const args = [
                document.getElementById('arg1').value,
                document.getElementById('arg2').value
            ];

            try {
                const response = await fetch(`${api}/execute`, {
                    method: 'POST',
                    headers: {'Content-Type': 'application/json'},
                    body: JSON.stringify({sql, args})
                });
                if (!response.ok) {
                    throw new Error(`HTTP error! status: ${response.status}`);
                }
                alert('Write successful!');
                document.getElementById('submit-read').click(); // Refresh data
            } catch (e) {
                alert('Write failed: ' + e.message);
            }
        });

        document.getElementById('submit-read').addEventListener('click', async () => {
            const api = document.getElementById('node').value;
            const sql = document.getElementById('sql-read').value;
            const resultsEl = document.getElementById('results');
            resultsEl.textContent = 'Loading...';

            try {
                const response = await fetch(`${api}/query`, {
                    method: 'POST',
                    headers: {'Content-Type': 'application/json'},
                    body: JSON.stringify({sql, args: []})
                });
                if (!response.ok) {
                    throw new Error(`HTTP error! status: ${response.status}`);
                }
                const data = await response.json();
                resultsEl.textContent = JSON.stringify(data, null, 2);
            } catch (e) {
                resultsEl.textContent = 'Read failed: ' + e.message;
            }
        });
    </script>
</body>
</html>

By running three nodes and killing the process of the leader, a user can observe that after a brief pause (the election timeout), write requests to the remaining nodes succeed as a new leader is elected. The application remains available.

Limitations and Future Paths

This implementation is a proof-of-concept, not a production-ready distributed database. Its limitations define the boundary of its applicability. The primary performance bottleneck is the single-threaded application of the Raft log to the SQLite database. Throughput is limited by the fsync performance of the underlying disk for a single writer. Batching commands within a single Raft log entry could provide some amortization.

Snapshotting by copying the entire database file is inefficient and can cause significant I/O spikes. A more advanced implementation should leverage SQLite’s online backup API to stream the snapshot without blocking writes.

The read model offers only eventual consistency from follower nodes. Achieving linearizable reads would require a more complex mechanism, such as routing all reads through the leader or implementing a read protocol on top of Raft that confirms a follower’s log is up-to-date before serving a read.

Finally, cluster management, particularly schema changes (DDL), is not addressed. Applying schema migrations across a distributed state machine is a complex problem that requires careful coordination to ensure all nodes transition to the new schema atomically.


  TOC