Orchestrating Dynamic Lua Runtimes on a GCP-Based Docker Swarm Cluster


The core technical challenge was a requirement for a lightweight, multi-tenant logic execution platform where business rules could be updated and deployed in seconds, not minutes. Full container rebuilds and redeployments for every minor logic tweak were operationally unacceptable, creating significant friction for the data science and rules engine teams. The existing CI/CD process, optimized for compiled Go services, took 5-7 minutes per deployment. We needed a system that could achieve near-instantaneous logic hot-swapping across a distributed cluster without compromising the stability of the core runtime environment.

Our initial concept was to decouple the stable, underlying execution engine from the volatile, user-defined business logic. This led to an architecture centered around a “logic-agnostic” containerized service. This service would act as a host, capable of dynamically loading, sandboxing, and executing scripts provided by users. The choice of orchestration, scripting language, and management tooling became the critical decision points. We evaluated Kubernetes but deemed it overly complex and resource-intensive for this particular use case, which demanded a minimal footprint on each node. The goal was simplicity and speed, both in terms of runtime performance and operational overhead.

The final technology stack was a deliberate choice driven by pragmatism. Docker Swarm was selected for orchestration due to its simplicity, low resource overhead, and tight integration with the Docker CLI we already used. For the underlying infrastructure, Google Cloud Platform (GCP) provided reliable Compute Engine instances and Google Cloud Storage (GCS) for artifact management. The most critical choice was the scripting language. We needed something extremely lightweight, fast to initialize, and easy to sandbox. Lua, particularly with LuaJIT, was the obvious candidate. Its minimal VM, C-like performance, and excellent embedding capabilities via C-bindings made it superior to heavier runtimes like V8 (JavaScript) or a full Python interpreter for this specific need. Finally, for the management interface, a command-line tool was required. We chose Dart to build a native, self-contained CLI. This avoided forcing our users to install a language runtime (like Node.js or Python) on their local machines and provided a path to a future Flutter-based GUI dashboard.

The architecture comprises three main components:

  1. The Executor Service: A Go application packaged in a Docker container. It embeds a Lua VM, exposes a limited set of host functions to the Lua environment, and executes scripts. This service is deployed as a scaled service on Docker Swarm.
  2. The Plugin Manager: A central Go service that monitors a GCS bucket for new or updated Lua scripts. Upon detecting a change, it uses the Docker Swarm API to trigger a rolling update on the Executor Service, replacing the old logic with the new.
  3. The Dart CLI: A command-line tool for developers to upload Lua scripts to GCS, list active scripts, and check the status of the executor services.
graph TD
    subgraph Developer Workstation
        A[Dart CLI]
    end

    subgraph GCP
        B[GCS Bucket: /lua-plugins]
        C[Compute Engine Instances]
        D[Docker Swarm Manager]
        E[Docker Swarm Workers]
    end

    subgraph Docker Swarm Cluster on GCP
        F[Plugin Manager Service]
        G[Executor Service Task 1]
        H[Executor Service Task N]
    end

    subgraph Executor Task
        I[Go Runtime] --> J[Embedded Lua VM]
        J --> K[User Logic: script.lua]
    end

    A -- "plugin:upload my_logic.lua" --> B
    B -- GCS Pub/Sub Notification --> F
    F -- "docker service update" --> D
    D -- Propagates Config Update --> E
    E -- Pulls New Config --> G
    E -- Pulls New Config --> H

    style GCP fill:#f9f,stroke:#333,stroke-width:2px
    style Developer Workstation fill:#ccf,stroke:#333,stroke-width:2px

The Go-based Lua Executor Service

The heart of the system is the executor. A real-world project requires this component to be robust, secure, and observable. We used the gopher-lua library in Go for embedding the Lua VM.

The pitfall here is simply running any script thrown at it. A secure implementation requires a carefully constructed sandbox. This involves creating a new Lua state for each execution, disabling dangerous standard library functions (os.execute, io.open, dofile, etc.), and injecting only the necessary host functions.

Here is the core of the executor’s implementation. Note the context handling for timeouts and cancellations, structured logging, and the explicit sandboxing.

executor/main.go:

package main

import (
	"context"
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/sirupsen/logrus"
	lua "github.com/yuin/gopher-lua"
)

const (
	// In a real system, this would be read from a Docker Config file.
	luaScriptPath = "/config/logic.lua"
	executionTimeout = 2 * time.Second
)

func main() {
	logrus.SetFormatter(&logrus.JSONFormatter{})
	logrus.Info("Starting Lua executor service...")

	scriptBytes, err := ioutil.ReadFile(luaScriptPath)
	if err != nil {
		logrus.WithError(err).Fatalf("Failed to read Lua script from %s", luaScriptPath)
	}
	scriptContent := string(scriptBytes)
	logrus.Infof("Successfully loaded script from %s", luaScriptPath)

	// In a real application, this would be an HTTP server or a message queue consumer.
	// For this example, we'll just execute the script in a loop to simulate work.
	ticker := time.NewTicker(5 * time.Second)
	defer ticker.Stop()

	stop := make(chan os.Signal, 1)
	signal.Notify(stop, syscall.SIGINT, syscall.SIGTERM)

	for {
		select {
		case <-ticker.C:
			// Each execution gets a fresh context and Lua state.
			ctx, cancel := context.WithTimeout(context.Background(), executionTimeout)
			defer cancel()
			
			executeLuaScript(ctx, scriptContent)

		case <-stop:
			logrus.Info("Shutting down executor service.")
			return
		}
	}
}

// executeLuaScript creates a sandboxed Lua VM, executes the script, and logs the result.
func executeLuaScript(ctx context.Context, script string) {
	L := lua.NewState()
	defer L.Close()

	// Apply sandbox
	setupSandbox(L)

	// Inject custom host functions
	registerHostFunctions(L)

	// Set up a context for the Lua execution itself
	// This allows us to kill a script that runs for too long.
	L.SetContext(ctx)

	// Simulate some input data for the script
	inputTable := L.NewTable()
	inputTable.RawSetString("user_id", lua.LString("user-12345"))
	inputTable.RawSetString("request_id", lua.LString(fmt.Sprintf("%d", time.Now().UnixNano())))
	L.SetGlobal("input", inputTable)

	logFields := logrus.Fields{
		"lua_execution_id": time.Now().UnixNano(),
	}

	if err := L.DoString(script); err != nil {
		logrus.WithFields(logFields).WithError(err).Error("Lua script execution failed")
		return
	}

	// Extracting results from the Lua state
	result := L.GetGlobal("result")
	if tbl, ok := result.(*lua.LTable); ok {
		outputValue := tbl.RawGetString("output")
		logFields["lua_result"] = outputValue.String()
		logrus.WithFields(logFields).Info("Lua script executed successfully")
	} else {
		logrus.WithFields(logFields).Warn("Lua script did not return a 'result' table")
	}
}

// setupSandbox disables potentially harmful Lua standard library functions.
// A common mistake is to not be thorough enough here.
func setupSandbox(L *lua.LState) {
	// A whitelist is safer than a blacklist.
	// We start with an empty table for `os` and `io`.
	osTable := L.NewTable()
	L.SetGlobal("os", osTable)
	// We can selectively re-enable safe functions if needed.
	// os.date, os.time are generally safe.
	L.SetField(osTable, "date", L.GetGlobal("os").(*lua.LTable).RawGetString("date"))
	L.SetField(osTable, "time", L.GetGlobal("os").(*lua.LTable).RawGetString("time"))


	ioTable := L.NewTable()
	L.SetGlobal("io", ioTable)
	// Allow io.write for stdout logging, but redirect it through our Go logger.
	L.SetField(ioTable, "write", L.NewFunction(luaHostLog))

	// Completely remove dangerous functions
	L.SetGlobal("dofile", lua.LNil)
	L.SetGlobal("loadfile", lua.LNil)
	L.SetGlobal("require", lua.LNil) // Disabling require forces self-contained scripts
}


// registerHostFunctions exposes Go functions to the Lua environment.
func registerHostFunctions(L *lua.LState) {
	// Expose a structured logging function.
	L.SetGlobal("host_log", L.NewFunction(luaHostLog))
}

// luaHostLog is a Go function callable from Lua.
// It allows Lua scripts to log through the main application's logger.
func luaHostLog(L *lua.LState) int {
	// The first argument is the log level (e.g., "info", "error")
	level := L.ToString(1)
	// The second argument is the message string.
	msg := L.ToString(2)

	entry := logrus.WithField("source", "lua_script")
	
	switch level {
	case "info":
		entry.Info(msg)
	case "warn":
		entry.Warn(msg)
	case "error":
		entry.Error(msg)
	default:
		entry.Info(msg)
	}

	return 0 // Number of return values
}

The corresponding Dockerfile is minimal, leveraging a multi-stage build to keep the final image small.

executor/Dockerfile:

# ---- Build Stage ----
FROM golang:1.19-alpine AS builder

WORKDIR /app

COPY go.mod ./
COPY go.sum ./
RUN go mod download

COPY . .

# Build the binary with optimizations
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -ldflags '-w -s' -o /app/executor ./

# ---- Final Stage ----
FROM alpine:latest

RUN apk --no-cache add ca-certificates

WORKDIR /root/

# Copy only the compiled binary from the builder stage
COPY --from=builder /app/executor .

# This directory will be used to mount the Docker Config
RUN mkdir -p /config

# This binary is the only thing that runs
CMD ["./executor"]

Docker Swarm Service and Configuration Management

Docker Swarm orchestrates the executor. The key mechanism for updating the Lua logic is Docker Configs. We create a config object from the Lua script file and mount it into the service’s containers. When we need to update the logic, we create a new config and perform a docker service update, telling the service to use the new config. Swarm handles the rolling update gracefully.

docker-compose.yml for deploying the stack on a Swarm cluster:

version: '3.8'

services:
  executor:
    image: my-registry/lua-executor:1.0.0
    deploy:
      replicas: 5
      update_config:
        parallelism: 1
        delay: 10s
        order: start-first
      restart_policy:
        condition: on-failure
    configs:
      - source: initial_logic
        target: /config/logic.lua

  plugin_manager:
    image: my-registry/plugin-manager:1.0.0
    volumes:
      # Mount the Docker socket to allow the manager to interact with the Swarm API
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - GCS_BUCKET_NAME=my-company-lua-plugins
      - GCP_PROJECT_ID=my-gcp-project
      # Assumes GCP authentication is handled via a service account on the node
    deploy:
      placement:
        constraints: [node.role == manager] # Run only on manager nodes

configs:
  initial_logic:
    file: ./scripts/default.lua

The Plugin Manager is the automation piece. It listens for GCS events (e.g., via Pub/Sub notifications configured on the bucket) and uses the Docker Go SDK to orchestrate the update.

plugin-manager/main.go (simplified):

package main

import (
	"context"
	"fmt"
	"io/ioutil"
	"log"
	"os"
	"time"

	"cloud.google.com/go/storage"
	"github.comcom/docker/docker/api/types"
	"github.com/docker/docker/api/types/swarm"
	"github.com/docker/docker/client"
	"github.com/google/uuid"
)

// This would be a message handler for a Pub/Sub subscription in a real system.
// For simplicity, we poll the bucket.
func main() {
	bucketName := os.Getenv("GCS_BUCKET_NAME")
	if bucketName == "" {
		log.Fatal("GCS_BUCKET_NAME environment variable not set")
	}

	ctx := context.Background()
	dockerCli, err := client.NewClientWithOpts(client.FromEnv, client.WithAPIVersionNegotiation())
	if err != nil {
		log.Fatalf("Failed to create Docker client: %v", err)
	}

	storageClient, err := storage.NewClient(ctx)
	if err != nil {
		log.Fatalf("Failed to create GCS client: %v", err)
	}
	
	// In a production system, use GCS event notifications instead of polling.
	log.Println("Plugin manager started. Polling GCS for changes...")
	checkForUpdates(ctx, dockerCli, storageClient, bucketName) // Initial check
	for range time.Tick(30 * time.Second) {
		checkForUpdates(ctx, dockerCli, storageClient, bucketName)
	}
}

// A simple state to track the latest version we've deployed.
var latestDeployedVersion string

func checkForUpdates(ctx context.Context, dockerCli *client.Client, storageClient *storage.Client, bucketName string) {
	objectName := "latest.lua" // Convention: the latest script is always named this.
	obj := storageClient.Bucket(bucketName).Object(objectName)
	attrs, err := obj.Attrs(ctx)
	if err != nil {
		log.Printf("Could not get object attributes: %v", err)
		return
	}

	// Use GCS object generation number as a version identifier.
	currentVersion := fmt.Sprintf("%d", attrs.Generation)
	if currentVersion == latestDeployedVersion {
		// No new version. Do nothing.
		return
	}

	log.Printf("New script version detected: %s. Applying update.", currentVersion)

	reader, err := obj.NewReader(ctx)
	if err != nil {
		log.Printf("Failed to read new script from GCS: %v", err)
		return
	}
	defer reader.Close()

	scriptData, err := ioutil.ReadAll(reader)
	if err != nil {
		log.Printf("Failed to read script data: %v", err)
		return
	}
	
	err = updateExecutorService(ctx, dockerCli, scriptData)
	if err != nil {
		log.Printf("Failed to update executor service: %v", err)
	} else {
		log.Printf("Successfully triggered service update with version %s", currentVersion)
		latestDeployedVersion = currentVersion
	}
}


func updateExecutorService(ctx context.Context, cli *client.Client, newScript []byte) error {
	serviceName := "mystack_executor" // Name depends on your stack name.

	// Step 1: Create a new Docker Config with the new script content.
	configName := fmt.Sprintf("logic-config-%s", uuid.New().String())
	configSpec := swarm.ConfigSpec{
		Annotations: swarm.Annotations{Name: configName},
		Data:        newScript,
	}
	configCreateResponse, err := cli.ConfigCreate(ctx, configSpec)
	if err != nil {
		return fmt.Errorf("failed to create new config: %w", err)
	}
	log.Printf("Created new config: %s", configName)


	// Step 2: Inspect the existing service to get its current spec.
	service, _, err := cli.ServiceInspectWithRaw(ctx, serviceName, types.ServiceInspectOptions{})
	if err != nil {
		return fmt.Errorf("failed to inspect service: %w", err)
	}

	// Step 3: Modify the service spec to use the new config.
	// The critical part is to remove the old config and add the new one.
	newConfigs := []*swarm.ConfigReference{}
	for _, conf := range service.Spec.TaskTemplate.ContainerSpec.Configs {
		// Keep all other configs, but discard the one we're replacing.
		if conf.File.Name != "/config/logic.lua" {
			newConfigs = append(newConfigs, conf)
		}
	}
	newConfigs = append(newConfigs, &swarm.ConfigReference{
		File: &swarm.FileTarget{
			Name: "/config/logic.lua",
			UID:  "0",
			GID:  "0",
			Mode: 0444,
		},
		ConfigID:   configCreateResponse.ID,
		ConfigName: configName,
	})
	
	service.Spec.TaskTemplate.ContainerSpec.Configs = newConfigs
	
	// Increment the service version to force an update.
	updateOpts := types.ServiceUpdateOptions{}
	
	// Step 4: Apply the update. Swarm will handle the rolling deployment.
	_, err = cli.ServiceUpdate(ctx, service.ID, service.Version, service.Spec, updateOpts)
	if err != nil {
		return fmt.Errorf("failed to update service: %w", err)
	}
	
	// Ideally, we'd also have a cleanup job to remove old, unused configs.
	// This is omitted for brevity.

	return nil
}

The Dart Management CLI

The CLI provides the human interface. It’s a simple Dart console application that interacts with the GCS bucket. Its primary job is to make uploading scripts easy and atomic. The standard practice is to upload the new script to a temporary name, and only upon successful upload, perform a GCS object copy operation to rename it to latest.lua. This prevents the Plugin Manager from picking up a partially uploaded file.

cli/bin/logic_cli.dart:

```dart
import ‘dart:io’;
import ‘package:args/args.dart’;
import ‘package:googleapis/storage/v1.dart’ as storage;
import ‘package:googleapis_auth/auth_io.dart’ as auth;

// In a real app, this would be configured externally.
const bucketName = ‘my-company-lua-plugins’;
const finalObjectName = ‘latest.lua’;

Future main(List arguments) async {
final parser = ArgParser()
..addCommand(‘upload’, ArgParser()..addOption(‘file’, abbr: ‘f’, mandatory: true));

try {
final results = parser.parse(arguments);
if (results.command?.name == ‘upload’) {
final filePath = results.command


  TOC