Integrating WebAuthn-Based Human Approvals into a Spinnaker MLOps Pipeline via a NATS Event Bus


The compliance requirement for our MLOps pipelines was non-negotiable: deploying a new financial risk model to production required explicit, cryptographically verifiable approval from both a Lead Data Scientist and a Compliance Officer. Our existing Spinnaker pipelines were fully automated, a source of pride for the engineering team, but a source of terror for our auditors. The initial impulse to handle this with a manual judgment stage in Spinnaker, followed by a user clicking “Continue” in the UI, was immediately discarded. It lacked phishing resistance, provided a weak audit trail, and made multi-party approval awkward to orchestrate. We needed an externalized, robust, and auditable approval mechanism that could be cleanly integrated without turning our pipeline definitions into a convoluted mess.

Our initial concept was to decouple the approval workflow entirely from the deployment orchestrator. The Spinnaker pipeline would pause at a designated gate, emit a request for approval into a message bus, and then wait for an external system to signal completion. This external system would handle the entire multi-party, multi-factor authentication ceremony and, upon successful validation, send a signal back to resume the pipeline. This approach prevents Spinnaker from needing to know the complex details of identity and authentication, adhering to the principle of separation of concerns.

The technology selection was driven by production-grade pragmatism.

  • Spinnaker: The incumbent CD tool. We decided to leverage its native webhook and waitForWebhook stage types, which are designed for this kind of asynchronous, out-of-process integration.
  • NATS: We needed a lightweight, high-performance messaging system to act as the control plane’s event bus. We considered Kafka, but its architectural complexity (Zookeeper dependency, partitions, consumer groups) was overkill for our needs. We required simple, low-latency pub-sub for event notifications. NATS, with its “dial-tone” simplicity, was a perfect fit.
  • WebAuthn (FIDO2): For the approval mechanism itself, TOTP or push notifications were insufficient. The key requirement was phishing resistance and a non-repudiable audit log. WebAuthn, using hardware security keys, provides a public-key cryptography-based challenge-response flow that proves user presence and binds the signature to a specific origin, satisfying our security constraints.

The final architecture is a choreographed sequence of events across these systems.

sequenceDiagram
    participant Spinnaker
    participant NATS
    participant ApprovalService as Approval Service (Go)
    participant UserBrowser as Approver's Browser
    participant HardwareKey as Hardware Key

    Spinnaker->>NATS: 1. Publishes `mlops.approval.request` event (with pipelineId, callbackUrl)
    NATS-->>ApprovalService: 2. Delivers event to service
    ApprovalService->>UserBrowser: 3. User navigates to approval UI
    UserBrowser->>ApprovalService: 4. Initiates login
    ApprovalService->>UserBrowser: 5. Sends WebAuthn Challenge
    UserBrowser->>HardwareKey: 6. Browser relays challenge to key
    HardwareKey-->>UserBrowser: 7. User touches key, which signs challenge
    UserBrowser-->>ApprovalService: 8. Sends signed response
    ApprovalService->>ApprovalService: 9. Verifies signature & records approval
    Note over ApprovalService: If all required approvals are collected...
    ApprovalService->>Spinnaker: 10. POSTs to Spinnaker's callbackUrl
    Spinnaker->>Spinnaker: 11. Pipeline proceeds to next stage

The core of the implementation is a standalone Go service, the ApprovalService, which orchestrates the WebAuthn ceremony and communicates over NATS.

The Approval Service Implementation

This service is responsible for three main tasks: listening for approval requests from NATS, managing the WebAuthn user registration and authentication flow via an HTTP API, and notifying Spinnaker upon successful multi-party approval.

Let’s start with the NATS message structure and the listener. The payload from Spinnaker must contain enough context for the service to act.

// pkg/events/approval_request.go

package events

// ApprovalRequestEvent is the payload sent from a Spinnaker pipeline
// to request a manual, WebAuthn-verified approval.
type ApprovalRequestEvent struct {
	// Unique identifier for the pipeline execution, used for correlation.
	PipelineExecutionID string `json:"pipelineExecutionId"`

	// Spinnaker-generated URL to call back to when the approval is complete.
	CallbackURL string `json:"callbackUrl"`

	// Human-readable name of the model being deployed.
	ModelName string `json:"modelName"`

	// Version tag of the model.
	ModelVersion string `json:"modelVersion"`

	// A list of roles that are required to approve this deployment.
	RequiredApproverRoles []string `json:"requiredApproverRoles"`
}

// ApprovalResponseEvent is the payload sent back to the Spinnaker webhook.
type ApprovalResponseEvent struct {
	Status      string   `json:"status"` // "APPROVED" or "REJECTED"
	ApprovedBy  []string `json:"approvedBy"`
	Comments    string   `json:"comments,omitempty"`
}

The service’s main function sets up the NATS subscription and the HTTP server. A common mistake is to tightly couple business logic with the transport layer. Here, we decouple them by having the NATS handler simply place requests into an in-memory store, which the HTTP layer can then query.

// cmd/approvalsvc/main.go

package main

import (
	"context"
	"log"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/nats-io/nats.go"

	"approval-service/internal/app"
	"approval-service/internal/config"
	"approval-service/internal/nats"
	"approval-service/internal/web"
)

func main() {
	// In a real-world project, configuration would come from a file or env vars.
	cfg := config.Load()

	// Connect to NATS server.
	nc, err := nats.Connect(cfg.NATS.URL, nats.UserInfo(cfg.NATS.User, cfg.NATS.Password))
	if err != nil {
		log.Fatalf("Failed to connect to NATS: %v", err)
	}
	defer nc.Drain()
	log.Println("Connected to NATS server")

	// The application core holds the state of pending approvals.
	// This would be a persistent store like Redis or Postgres in production.
	approvalStore := app.NewInMemoryApprovalStore()

	// The NATS listener processes incoming requests and updates the store.
	natsHandler := nats.NewHandler(approvalStore)
	sub, err := nc.QueueSubscribe(cfg.NATS.RequestSubject, "approval-workers", natsHandler.HandleApprovalRequest)
	if err != nil {
		log.Fatalf("Failed to subscribe to NATS subject: %v", err)
	}
	defer sub.Unsubscribe()

	// The WebAuthn API handler manages the user interaction.
	webHandler := web.NewHandler(cfg, approvalStore, nc)
	server := &http.Server{
		Addr:    ":" + cfg.Server.Port,
		Handler: webHandler.Routes(),
	}

	// Graceful shutdown logic.
	go func() {
		log.Printf("Starting HTTP server on port %s", cfg.Server.Port)
		if err := server.ListenAndServe(); err != http.ErrServerClosed {
			log.Fatalf("HTTP server ListenAndServe: %v", err)
		}
	}()

	stop := make(chan os.Signal, 1)
	signal.Notify(stop, syscall.SIGINT, syscall.SIGTERM)
	<-stop

	log.Println("Shutting down gracefully...")
	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
	defer cancel()

	if err := server.Shutdown(ctx); err != nil {
		log.Printf("HTTP server shutdown error: %v", err)
	}
	log.Println("Server gracefully stopped")
}

The web package contains the HTTP handlers for the WebAuthn ceremony. We use a library like go-webauthn/webauthn to handle the protocol’s complexity. The critical part is storing user credentials (public keys) and tracking sessions correctly.

// internal/web/handler.go

package web

import (
	"encoding/json"
	"log"
	"net/http"

	"github.com/go-webauthn/webauthn/protocol"
	"github.com/go-webauthn/webauthn/webauthn"
	"github.com/gorilla/mux"

	"approval-service/internal/app"
	"approval-service/internal/config"
)

// In-memory user store for simplicity.
// In production, this must be a persistent database.
var userStore = make(map[string]*app.User)

type Handler struct {
	webAuthn *webauthn.WebAuthn
	store    app.ApprovalStore
	// sessionStore would manage WebAuthn session data
}

func NewHandler(cfg *config.Config, store app.ApprovalStore) *Handler {
	wconfig := &webauthn.Config{
		RPDisplayName: "MLOps Approval Service",
		RPID:          cfg.WebAuthn.RPID, // e.g., "localhost" or "approvals.mycompany.com"
		RPOrigin:      cfg.WebAuthn.RPOrigin, // e.g., "http://localhost:8080"
	}
	webAuthn, err := webauthn.New(wconfig)
	if err != nil {
		log.Fatalf("Failed to create WebAuthn instance: %v", err)
	}
	return &Handler{webAuthn: webAuthn, store: store}
}

func (h *Handler) Routes() http.Handler {
	r := mux.NewRouter()
	// Endpoints for registering a new WebAuthn credential
	r.HandleFunc("/register/begin/{username}", h.beginRegistration).Methods("GET")
	r.HandleFunc("/register/finish/{username}", h.finishRegistration).Methods("POST")
	
	// Endpoints for the approval (login) ceremony
	r.HandleFunc("/approve/begin/{username}", h.beginApproval).Methods("GET")
	r.HandleFunc("/approve/finish/{username}", h.finishApproval).Methods("POST")

	// Endpoint for the frontend to get pending approvals
	r.HandleFunc("/pending-approvals", h.getPendingApprovals).Methods("GET")
	return r
}

func (h *Handler) beginRegistration(w http.ResponseWriter, r *http.Request) {
	vars := mux.Vars(r)
	username := vars["username"]
	
	// A real implementation would fetch user from a proper user directory.
	user, ok := userStore[username]
	if !ok {
		// For demo purposes, we create a user on the fly.
		user = app.NewUser(username, "Lead Data Scientist")
		userStore[username] = user
	}

	options, sessionData, err := h.webAuthn.BeginRegistration(user)
	if err != nil {
		http.Error(w, "Failed to begin registration", http.StatusInternalServerError)
		return
	}
	
	// Store sessionData server-side. A real app uses a session store like Redis.
	// For now, we'll just hold it in memory, which is not production-ready.
	// sessionStore.Save("registration", sessionData)

	json.NewEncoder(w).Encode(options)
}

func (h *Handler) finishRegistration(w http.ResponseWriter, r *http.Request) {
    // ... implementation to parse response, validate with session data ...
    // ... and finally save the new credential to the user object ...
    // user.AddCredential(credential)
}


func (h *Handler) finishApproval(w http.ResponseWriter, r *http.Request) {
	// ... WebAuthn verification logic ...
	// 1. Parse the request from the browser
	// 2. Retrieve the session data
	// 3. Call h.webAuthn.ValidateLogin(...)
	// 4. If successful, proceed to record the approval

	// Placeholder for successful WebAuthn validation
	username := "testuser" 
	pipelineID := r.URL.Query().Get("pipelineId") // passed from UI
	
	user, ok := userStore[username]
	if !ok {
		http.Error(w, "User not found", http.StatusNotFound)
		return
	}

	// This is the core logic: record the approval in our state store.
	err := h.store.AddApproval(pipelineID, user.Name, user.Role)
	if err != nil {
		http.Error(w, err.Error(), http.StatusBadRequest)
		return
	}

	// Check if all approvals are now met for this pipeline.
	isComplete, err := h.store.CheckAndFinalize(pipelineID)
	if err != nil {
		http.Error(w, "Failed to check approval status", http.StatusInternalServerError)
		return
	}

	if isComplete {
		// If complete, trigger the callback to Spinnaker.
		go h.notifySpinnaker(pipelineID)
	}

	w.WriteHeader(http.StatusOK)
	w.Write([]byte("Approval recorded successfully"))
}

func (h *Handler) notifySpinnaker(pipelineID string) {
	// ... fetch the pending approval from the store to get callback URL ...
	// ... construct the approval response payload ...
	// ... make an HTTP POST request to the Spinnaker callback URL ...
	// A pitfall here is network reliability. Add retries with exponential backoff.
}

Spinnaker Pipeline Configuration

The Spinnaker side involves two key stages within the deployment pipeline.

  1. Request Approval (Webhook Stage): This stage sends the initial request. It’s configured to fire a POST to a small intermediary service or directly to an endpoint on the ApprovalService that pushes the event to NATS. The payload is dynamically constructed using Spinnaker’s Expression Language (SpEL).
{
  "name": "Request Security Approval",
  "type": "webhook",
  "failPipeline": true,
  "waitForCompletion": false,
  "url": "http://approval-service.internal:8080/request-approval",
  "customHeaders": {
    "Content-Type": "application/json"
  },
  "payload": {
    "pipelineExecutionId": "${execution.id}",
    "callbackUrl": "${webhook.callbackUrl}",
    "modelName": "${trigger['properties']['modelName']}",
    "modelVersion": "${trigger['properties']['modelVersion']}",
    "requiredApproverRoles": [
      "Lead Data Scientist",
      "Compliance Officer"
    ]
  }
}

A key detail here is ${webhook.callbackUrl}. This SpEL expression provides the unique URL that Spinnaker generates for the waitForWebhook stage. We must pass this to our external service.

  1. Wait for Approval (WaitForWebhook Stage): This stage pauses the pipeline execution. It creates a webhook endpoint and only proceeds when that endpoint receives a valid POST request.
{
  "name": "Wait for Approval Callback",
  "type": "waitForWebhook",
  "webhook": {
    "source": "approval-service",
    "type": "approval-callback"
  },
  "expectedStatus": [
    "SUCCEEDED"
  ],
  "propagateStatusCode": true
}

When our Go service calls the callbackUrl, it must include a payload that Spinnaker can interpret. The waitForWebhook stage can extract values from the incoming payload into the pipeline context. We’ll post a body like {"status": "SUCCEEDED"}. The stage configuration will map this JSON path to the stage’s overall status, causing it to succeed and unblock the pipeline.

Frontend Client for WebAuthn

The user interaction requires a small amount of JavaScript in the browser to call the WebAuthn APIs (navigator.credentials.create and .get) and communicate with our service’s backend. This code would live on the UI of our Approval Service.

// A simplified client-side script for the approval UI.

async function approveDeployment(username, pipelineId) {
    // 1. Fetch the challenge from our backend
    const responseBegin = await fetch(`/approve/begin/${username}?pipelineId=${pipelineId}`);
    const credentialRequestOptions = await responseBegin.json();

    // The PublicKeyCredentialRequestOptions object needs buffer fields decoded from Base64
    // (A utility function is needed for this, omitted for brevity)
    // credentialRequestOptions.challenge = base64url.decode(credentialRequestOptions.challenge);
    // credentialRequestOptions.allowCredentials.forEach(c => {
    //   c.id = base64url.decode(c.id);
    // });


    let assertion;
    try {
        // 2. Trigger the browser/OS security prompt
        assertion = await navigator.credentials.get({
            publicKey: credentialRequestOptions
        });
    } catch (err) {
        console.error("WebAuthn ceremony failed:", err);
        // Display error to the user
        return;
    }

    // 3. Send the signed assertion back to the server for verification
    // (Another utility is needed to encode ArrayBuffers to Base64)
    // const assertionForServer = { ... };

    const responseFinish = await fetch(`/approve/finish/${username}?pipelineId=${pipelineId}`, {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
        },
        body: JSON.stringify(assertionForServer),
    });

    if (responseFinish.ok) {
        alert('Approval successful! The pipeline will now proceed.');
        // UI can now refresh to show the updated status
    } else {
        alert('Approval failed. Please check logs.');
    }
}

This complete loop—Spinnaker to NATS, NATS to Go service, Go service handles WebAuthn ceremony, and finally calls back to Spinnaker—creates an incredibly robust and auditable MLOps pipeline. The pitfall in this design is the state management within the ApprovalService. Our in-memory store is fine for a demonstration but is a critical single point of failure in production. Replacing it with a transactional database or a distributed cache like Redis is a mandatory next step. Furthermore, the correlation mechanism relies on the pipelineExecutionId. Care must be taken to ensure this ID is properly indexed in the persistent store to avoid performance degradation as the number of pending approvals grows. Finally, while the system works, the user experience for the approvers could be improved; a future iteration could involve pushing notifications to them (e.g., via Slack or email) with a direct link to the approval page, rather than requiring them to poll it manually.


  TOC