Decoupling Kubernetes Operator Logic Using a WebAssembly-Hosted XState Machine

Cloud Native

Word Count: 3k

Read Times: 18 Min

Our Kubernetes operators were becoming liabilities. What started as clean, single-purpose controllers grew into monolithic Go binaries where complex business logic was inextricably tangled with Kubernetes API calls. The Reconcile loop for our custom database provisioner, for example, had evolved into a thousand-line labyrinth of nested if statements, state-tracking annotations, and fragile retry mechanisms. Every minor change to the provisioning workflow—adding a pre-flight check, modifying a backup policy—required a full recompile and redeploy of the operator, a process fraught with risk. The core logic was opaque to anyone but the original authors. This was unsustainable.

The initial concept was to sever this coupling. The operator itself should be a dumb, robust engine concerned only with interacting with the Kubernetes API. The actual workflow logic—the state machine defining the lifecycle of a resource—should be defined and executed separately. This separation would allow us to update the business logic without touching the operator’s core binary. We needed a formal state machine definition for clarity and a sandboxed, portable runtime for execution.

This led us to a specific architectural choice: a Go-based operator using wazero to run a WebAssembly (WASM) module. This WASM module would contain our business logic, defined declaratively as an XState state machine. The operator’s Reconcile loop would become a simple orchestrator: fetch the current state from the Custom Resource’s status, pass it to the WASM module for a decision, receive back a set of actions, execute those actions against the Kubernetes API, and finally, persist the new state back to the status.

In a real-world project, this isn’t just about using shiny new technology. It’s a calculated bet on long-term maintainability. XState provides a formal, visualizable, and version-controllable definition for our workflows. WASM provides a secure, language-agnostic sandbox that decouples the lifecycle of our business logic from the lifecycle of our infrastructure code.

The Custom Resource Definition: The Contract

Everything starts with the API. Our StatefulWorkflow CRD is the contract between the user and our system. The spec is intentionally minimal, containing only the initial input for the workflow. The real action happens in the status block, which serves as the persistent storage for our state machine’s current state.

# config/crd/bases/workflows.example.com_statefulworkflows.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: statefulworkflows.workflows.example.com
spec:
  group: workflows.example.com
  names:
    kind: StatefulWorkflow
    listKind: StatefulWorkflowList
    plural: statefulworkflows
    singular: statefulworkflow
  scope: Namespaced
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            apiVersion:
              type: string
            kind:
              type: string
            metadata:
              type: object
            spec:
              type: object
              properties:
                # User-defined input for the workflow.
                # Passed to the state machine as context.
                input:
                  type: object
                  x-kubernetes-preserve-unknown-fields: true
                # Reference to the ConfigMap containing the WASM module.
                moduleConfigMap:
                  type: string
            status:
              type: object
              properties:
                # Stores the current state of the XState machine as a JSON string.
                # The operator uses this to hydrate the machine on each reconciliation.
                state:
                  type: string
                # A human-readable condition summary.
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                      status:
                        type: string
                      lastTransitionTime:
                        type: string
                      reason:
                        type: string
                      message:
                        type: string
      subresources:
        status: {}

The key element here is status.state. It’s a simple string, but it will hold the entire serialized JSON state of our XState machine, including its current state value, context, and any child machine states. This makes each reconciliation loop stateless from the operator’s perspective; all necessary context is loaded directly from the CRD.

The State Machine: Defining Logic in TypeScript

We chose TypeScript for defining the XState machine due to its strong typing and ecosystem. This code does not know about Kubernetes. It defines a generic database provisioning workflow, emitting abstract “action” identifiers that the Go operator will later interpret.

Here’s a simplified version of our provisioning state machine. It handles creating a PVC, a Deployment, and then runs a health check.

// src/machine.ts
import { createMachine, assign } from 'xstate';

// Define the context (the "memory") of our state machine.
interface WorkflowContext {
  pvcName?: string;
  deploymentName?: string;
  errorMessage?: string;
  retries: number;
}

// Define the events that can be sent to the machine.
type WorkflowEvent =
  | { type: 'PROVISION' }
  | { type: 'K8S_RESOURCE_CREATED'; resourceName: string }
  | { type: 'HEALTH_CHECK_PASSED' }
  | { type: 'FAILURE'; message: string };

// These are the abstract actions the Go operator needs to execute.
// The machine tells the operator *what* to do, not *how*.
type WorkflowAction =
  | { type: 'CREATE_PVC'; params: { size: string } }
  | { type: 'CREATE_DEPLOYMENT'; params: { imageName: string } }
  | { type: 'RUN_HEALTH_CHECK'; params: { deploymentName: string } }
  | { type: 'DELETE_RESOURCES' };

// The machine definition itself.
export const workflowMachine = createMachine<
  WorkflowContext,
  WorkflowEvent,
  { value: any; context: WorkflowContext },
  any,
  any,
  any,
  any,
  {
    actions: WorkflowAction[];
  }
>({
  id: 'database-provisioner',
  initial: 'idle',
  context: {
    retries: 0,
  },
  states: {
    idle: {
      on: { PROVISION: 'creatingPvc' },
    },
    creatingPvc: {
      meta: {
        // This 'actions' block is our contract with the Go host.
        // On entering this state, the operator should perform these actions.
        actions: [{ type: 'CREATE_PVC', params: { size: '1Gi' } }],
      },
      on: {
        K8S_RESOURCE_CREATED: {
          target: 'creatingDeployment',
          actions: assign({ pvcName: (_, event) => event.resourceName }),
        },
        FAILURE: 'failed',
      },
    },
    creatingDeployment: {
      meta: {
        actions: (context) => [
          {
            type: 'CREATE_DEPLOYMENT',
            params: { imageName: 'postgres:13' },
          },
        ],
      },
      on: {
        K8S_RESOURCE_CREATED: {
          target: 'healthChecking',
          actions: assign({ deploymentName: (_, event) => event.resourceName }),
        },
        FAILURE: 'failed',
      },
    },
    healthChecking: {
      meta: {
        actions: (context) => [
          {
            type: 'RUN_HEALTH_CHECK',
            params: { deploymentName: context.deploymentName! },
          },
        ],
      },
      on: {
        HEALTH_CHECK_PASSED: 'ready',
        FAILURE: {
            target: 'failed',
            actions: assign({
                errorMessage: (_, event) => event.message,
                retries: (context) => context.retries + 1
            })
        }
      },
    },
    ready: {
      type: 'final',
    },
    failed: {
      meta: {
          // In a real system, this might trigger cleanup actions.
          actions: [{type: 'DELETE_RESOURCES'}]
      },
      type: 'final'
    },
  },
});

The crucial part is the meta.actions field. This is a custom property we use to declare the side effects that should happen when a state is entered. When the Go operator transitions the machine into creatingPvc, it will inspect newState.meta['database-provisioner.creatingPvc'].actions to discover it needs to execute a CREATE_PVC action.

Compiling to WebAssembly

Getting the TypeScript code into a self-contained WASM module is a critical step. A common mistake is to just compile TypeScript to JavaScript and assume any WASM runtime can execute it. We need a full JavaScript engine compiled to WASM. We used javy, a toolchain that bundles JavaScript code with the QuickJS engine into a single, compact WASI-compatible WASM file.

First, we write a simple entrypoint that exposes a function the Go host can call. This function will take the current state and an event, run the XState logic, and return the new state and any required actions.

// src/index.ts
import { workflowMachine } from './machine';
import { State } from 'xstate';

// This is the exported function our Go operator will call.
// It must be named `handle_event` for the C ABI binding.
export function handle_event(currentStateJson: string, eventJson: string): string {
  try {
    const event = JSON.parse(eventJson);
    
    // Re-hydrate the state machine from the persisted JSON state.
    const previousState = State.create(JSON.parse(currentStateJson));
    
    const service = {
        send: (event) => {
            // In a real implementation, you'd use an interpreter here.
            // For simplicity, we manually transition.
            return workflowMachine.transition(previousState, event);
        }
    };

    const newState = service.send(event);
    
    // Find the actions to be executed for the new state.
    let actions: any[] = [];
    for (const stateId of newState.configuration) {
        if (stateId.meta?.actions) {
            const resolvedActions = typeof stateId.meta.actions === 'function' 
                ? stateId.meta.actions(newState.context) 
                : stateId.meta.actions;
            actions.push(...resolvedActions);
        }
    }

    const result = {
      newState: newState.toJSON(),
      actions: actions,
    };
    
    return JSON.stringify(result);
  } catch (e: any) {
    // Propagate errors back to the Go host.
    const errorResult = {
        error: e.toString()
    };
    return JSON.stringify(errorResult);
  }
}

Then we build it. The process involves bundling all dependencies (xstate, our machine definition) into a single JS file and then passing that to javy.

# 1. Bundle TypeScript into a single JavaScript file
npx esbuild src/index.ts --bundle --outfile=dist/bundle.js --format=esm

# 2. Compile the bundled JS into a WASM module using Javy
javy compile dist/bundle.js -o dist/machine.wasm

The output, machine.wasm, is the portable artifact we can now embed in our operator.

The Go Operator: The WASM Host and Kubernetes Client

This is where everything comes together. The operator’s controller is responsible for loading the WASM binary from a ConfigMap, instantiating it with wazero, and driving the state machine through its lifecycle.

The Reconciler Structure

// internal/controller/statefulworkflow_controller.go
package controller

import (
	// ... imports
	"context"
	"encoding/json"
	"fmt"
	"time"

	"github.com/tetratelabs/wazero"
	"github.com/tetratelabs/wazero/imports/wasi_snapshot_preview1"
	
	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	"k8s.io/apimachinery/pkg/runtime"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/log"

	workflowsv1 "my-operator/api/v1"
)

type StatefulWorkflowReconciler struct {
	client.Client
	Scheme    *runtime.Scheme
	wasmBytes []byte // In a real app, this would be loaded dynamically
}

// A struct to unmarshal the result from our WASM module
type WasmResult struct {
	NewState json.RawMessage `json:"newState"`
	Actions  []WasmAction    `json:"actions"`
	Error    string          `json:"error"`
}

type WasmAction struct {
	Type   string                 `json:"type"`
	Params map[string]interface{} `json:"params"`
}

func (r *StatefulWorkflowReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	logger := log.FromContext(ctx)
	
	var workflow workflowsv1.StatefulWorkflow
	if err := r.Get(ctx, req.NamespacedName, &workflow); err != nil {
		if errors.IsNotFound(err) {
			return ctrl.Result{}, nil
		}
		logger.Error(err, "unable to fetch StatefulWorkflow")
		return ctrl.Result{}, err
	}

	// 1. Initialize State Machine on first run
	currentState := workflow.status.State
	event := `{"type": "PROVISION"}` // Default initial event
	
	if currentState == "" {
		// This is the very first reconciliation for this resource.
		// We get the initial state from the machine.
		logger.Info("Initializing new workflow state machine")
		initialState, err := r.getInitialState(ctx)
		if err != nil {
			logger.Error(err, "failed to get initial state from WASM")
			return ctrl.Result{}, err // a permanent error
		}
		currentState = initialState
	} else {
        // For subsequent reconciles, we construct an event based on observed cluster state.
        // This logic is omitted for brevity, but it's where the operator would
        // check if a PVC was created and then create a `K8S_RESOURCE_CREATED` event.
        // For this example, we assume it's always the same event for simplicity.
        event = determineNextEvent(ctx, r.Client, workflow)
	}

	// 2. Call WASM to get the next state and actions
	result, err := r.callWasm(ctx, currentState, event)
	if err != nil {
		logger.Error(err, "WASM execution failed")
		// Update status with error and retry
		return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
	}
	
	if result.Error != "" {
		logger.Error(fmt.Errorf(result.Error), "error from inside WASM module")
		// Update status with the business logic error
		return ctrl.Result{}, nil // Do not requeue, it's a terminal state
	}

	// 3. Execute actions returned by the state machine
	for _, action := range result.Actions {
		logger.Info("Executing action", "type", action.Type, "params", action.Params)
		if err := r.executeAction(ctx, &workflow, action); err != nil {
			// If an action fails, we need to inform the state machine.
			// We craft a FAILURE event and run the reconciliation again.
			failureEvent := fmt.Sprintf(`{"type": "FAILURE", "message": "%s"}`, err.Error())
			failureResult, wasmErr := r.callWasm(ctx, currentState, failureEvent)
			if wasmErr != nil {
				// This is a critical failure path
				logger.Error(wasmErr, "failed to send FAILURE event to WASM")
				return ctrl.Result{RequeueAfter: time.Minute}, nil
			}
			workflow.Status.State = string(failureResult.NewState)
			// ... update status and return
		}
	}

	// 4. Persist the new state back to the CRD status
	workflow.Status.State = string(result.NewState)
	if err := r.Status().Update(ctx, &workflow); err != nil {
		logger.Error(err, "unable to update StatefulWorkflow status")
		return ctrl.Result{Requeue: true}, err
	}
	
	return ctrl.Result{}, nil
}

The determineNextEvent function is where the magic of reconciliation happens. It observes the state of the world (e.g., checks if the PVC exists) and translates that observation into an event for the state machine.

Interacting with WASM via `wazero`

The callWasm function is the bridge between Go and the WASM module.

// internal/controller/wasm_runtime.go

// In a production setup, the wazero runtime would be initialized once and reused.
func (r *StatefulWorkflowReconciler) callWasm(ctx context.Context, currentState, event string) (*WasmResult, error) {
	// For simplicity, we re-initialize the runtime on each call. Don't do this in production.
	// A real implementation would cache the compiled module.
	runtimeConfig := wazero.NewRuntimeConfig().WithMemoryLimitPages(128) // 8 MiB
	rt := wazero.NewRuntimeWithConfig(ctx, runtimeConfig)
	defer rt.Close(ctx)
	
	wasi_snapshot_preview1.MustInstantiate(ctx, rt)

	// Load the WASM module bytes.
	// In a real operator, this would be loaded from the ConfigMap specified in workflow.spec.moduleConfigMap
	wasmBytes, err := loadWasmModuleFromConfigMap(ctx, r.Client, "default", "workflow-module")
	if err != nil {
		return nil, fmt.Errorf("could not load wasm module: %w", err)
	}

	mod, err := rt.Instantiate(ctx, wasmBytes)
	if err != nil {
		return nil, fmt.Errorf("failed to instantiate wasm module: %w", err)
	}

	handleEvent := mod.ExportedFunction("handle_event")
	
	// Allocate memory in the WASM module for the input strings.
	currentStatePtr, currentStateLen, err := writeStringToWasmMemory(ctx, mod, currentState)
	if err != nil { return nil, err }

	eventPtr, eventLen, err := writeStringToWasmMemory(ctx, mod, event)
	if err != nil { return nil, err }

	// Call the exported function: handle_event(currentStatePtr, currentStateLen, eventPtr, eventLen)
	// Javy expects pointers and lengths.
	results, err := handleEvent.Call(ctx, currentStatePtr, currentStateLen, eventPtr, eventLen)
	if err != nil {
		return nil, fmt.Errorf("wasm function call failed: %w", err)
	}
	
	// The result is a pointer to the returned string data in WASM memory.
	resultPtr := results[0]
	// The length is not returned directly, so we need to find the null terminator.
	// This is a simplification; robust communication requires passing length back.
	resultBytes, ok := mod.Memory().Read(uint32(resultPtr), mod.Memory().Size()-uint32(resultPtr))
	if !ok {
		return nil, fmt.Errorf("failed to read result from wasm memory")
	}

	// Find the end of the string (null terminator)
	end := 0
	for end < len(resultBytes) && resultBytes[end] != 0 {
		end++
	}

	var wasmResult WasmResult
	if err := json.Unmarshal(resultBytes[:end], &wasmResult); err != nil {
		return nil, fmt.Errorf("failed to unmarshal wasm result: %w. Raw: %s", err, string(resultBytes[:end]))
	}

	return &wasmResult, nil
}

// A helper function to write a Go string into the WASM module's linear memory.
func writeStringToWasmMemory(ctx context.Context, mod api.Module, value string) (uint64, uint64, error) {
	ptr, err := mod.ExportedFunction("malloc").Call(ctx, uint64(len(value)))
	if err != nil {
		return 0, 0, err
	}
	if !mod.Memory().Write(uint32(ptr[0]), []byte(value)) {
		return 0, 0, fmt.Errorf("Memory.Write failed")
	}
	return ptr[0], uint64(len(value)), nil
}

A common pitfall is managing memory between the host (Go) and the guest (WASM). Here, we must explicitly allocate memory inside the WASM module using a function like malloc (which must be exported by the WASM module) and then write our string data into that memory. The communication is low-level, passing pointers and lengths. This adds complexity but is essential for performance.

The Action Executor

Finally, the executeAction function translates the abstract actions from the state machine into concrete Kubernetes API calls.

// internal/controller/action_executor.go
func (r *StatefulWorkflowReconciler) executeAction(ctx context.Context, workflow *workflowsv1.StatefulWorkflow, action WasmAction) error {
	switch action.Type {
	case "CREATE_PVC":
		// ... logic to create a PersistentVolumeClaim
		// Use parameters from action.Params
		logger := log.FromContext(ctx)
		logger.Info("Executing CREATE_PVC action")
		// ... build PVC object and r.Client.Create(ctx, pvc)
		return nil
	case "CREATE_DEPLOYMENT":
		// ... logic to create a Deployment
		logger := log.FromContext(ctx)
		logger.Info("Executing CREATE_DEPLOYMENT action")
		// ... build Deployment object and r.Client.Create(ctx, deployment)
		return nil
	case "RUN_HEALTH_CHECK":
		// ... logic to check deployment status
		logger := log.FromContext(ctx)
		logger.Info("Executing RUN_HEALTH_CHECK action")
		return nil // Return error if health check fails
	default:
		return fmt.Errorf("unknown action type: %s", action.Type)
	}
}

This switch statement is the only part of the operator that needs to know the specific details of the workflow. The core reconciliation loop is entirely generic.

The Final Architecture

The resulting system is a clean separation of concerns.

graph TD
    subgraph User
        A[kubectl apply -f workflow.yaml]
    end

    subgraph Kubernetes Control Plane
        B(API Server) -- watches --> C{Operator Pod}
        B -- CRD CRUD --> D[etcd]
    end

    subgraph Operator Pod
        C -- Reconcile Loop --> E{Controller}
        E -- GET/UPDATE --> B
        E -- 1. Load State --> F[CRD Status.state]
        F -- currentStateJson --> G[wazero Runtime]
        E -- 2. Determine Event --> H[Event JSON]
        H -- eventJson --> G
        
        subgraph WASM Sandbox
            G -- calls handle_event() --> I[XState Machine Logic]
        end

        I -- returns (newState, actions) --> G
        G -- resultJson --> E
        
        E -- 3. Execute Actions --> J{Action Executor}
        J -- K8s API Calls --> B
        
        E -- 4. Persist New State --> F
    end

    A --> B

This architecture is not without its trade-offs. The communication overhead between Go and WASM adds latency to each reconciliation. For workflows that need to react in milliseconds, this might not be suitable. Debugging is also more complex, requiring correlation between operator logs and potential diagnostics from within the WASM sandbox. The JS-to-WASM toolchain is another dependency that must be managed.

However, the gain in maintainability is significant. We can now update our database provisioning workflow by simply updating a ConfigMap with a new machine.wasm and rolling the operator pods. The core Go code, the most critical and sensitive part of the system, remains untouched. The business logic is now a portable, verifiable artifact, completely decoupled from its execution environment.