Implementing a Custom Kubernetes Controller on AWS EKS for Mobile CI/CD with MySQL-Backed State Management


The mobile CI/CD pipeline was a perpetual bottleneck. Our setup, a collection of Jenkins controllers orchestrating static EC2 Mac instances, was fragile and inefficient. A single failed build could poison an agent’s workspace, blocking all subsequent jobs until someone manually intervened. Scaling was a nightmare of pre-provisioning expensive Mac instances for peak load, which sat idle most of the time. We were paying a premium for underutilization and unreliability. Queues would form, developers would complain, and release velocity suffered. The core problem was treating our build agents as persistent, stateful pets instead of disposable, stateless cattle.

The obvious path forward was to containerize and orchestrate everything on Kubernetes. Android builds were simple; they could run in standard Linux pods on our existing EKS cluster. The intractable problem, as always, was iOS. Apple’s licensing requires macOS for builds, which doesn’t containerize natively. Our initial concept was to abstract this constraint away. Could we make a physical or virtualized Mac machine appear to the cluster as just another ephemeral, schedulable resource? This led us to the Kubernetes Operator pattern. We decided to build a custom controller that would manage the entire lifecycle of a mobile build, treating macOS nodes as a specialized, finite resource pool.

This controller would be the brains of the operation, governed by a Custom Resource Definition (CRD) called MobileBuild. A developer would simply declare their intent by applying a MobileBuild YAML, and the controller would handle the rest: queuing, scheduling onto an appropriate agent (Linux for Android, macOS for iOS), monitoring execution, and persisting the results. For durable, queryable records of every build—essential for auditing and metrics—we opted for an external AWS RDS for MySQL instance, rather than cluttering the CRD’s status field or relying on ephemeral Kubernetes events. The controller’s reconciliation logic would be complex and stateful, making Test-Driven Development (TDD) a non-negotiable discipline to ensure its correctness and stability.

Phase 1: The MobileBuild CRD and TDD Foundation

Before writing a single line of reconciliation logic, we defined the API. Using Kubebuilder, we scaffolded our new project and defined the MobileBuild CRD.

# Scaffolding the project
kubebuilder init --domain build.my-company.com --repo my-company.com/mobile-build-controller
kubebuilder create api --group build --version v1alpha1 --kind MobileBuild

The MobileBuildSpec defines the desired state—what a developer wants to build. The MobileBuildStatus reflects the current, observed state.

api/v1alpha1/mobilebuild_types.go:

package v1alpha1

import (
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// BuildType defines the type of mobile build
type BuildType string

const (
	BuildTypeIOS     BuildType = "iOS"
	BuildTypeAndroid BuildType = "Android"
)

// BuildPhase defines the current state of the build
type BuildPhase string

const (
	PhasePending   BuildPhase = "Pending"
	PhaseQueued    BuildPhase = "Queued"
	PhaseRunning   BuildPhase = "Running"
	PhaseSucceeded BuildPhase = "Succeeded"
	PhaseFailed    BuildPhase = "Failed"
)

// MobileBuildSpec defines the desired state of MobileBuild
type MobileBuildSpec struct {
	// Type of the build, e.g., "iOS" or "Android"
	// +kubebuilder:validation:Enum=iOS;Android
	Type BuildType `json:"type"`

	// GitRepositoryURL is the HTTPS URL of the git repository to clone.
	// +kubebuilder:validation:Required
	GitRepositoryURL string `json:"gitRepositoryURL"`

	// GitRef is the branch, tag, or commit hash to check out.
	// +kubebuilder:validation:Required
	GitRef string `json:"gitRef"`

	// BuildCommand is the shell command to execute for the build.
	// +kubebuilder:validation:Required
	BuildCommand string `json:"buildCommand"`
}

// MobileBuildStatus defines the observed state of MobileBuild
type MobileBuildStatus struct {
	// Phase is the current phase of the build.
	Phase BuildPhase `json:"phase,omitempty"`

	// Message provides more details about the current status.
	Message string `json:"message,omitempty"`

	// StartTime is the time the build started processing.
	StartTime *metav1.Time `json:"startTime,omitempty"`

	// CompletionTime is the time the build was completed.
	CompletionTime *metav1.Time `json:"completionTime,omitempty"`

	// BuildPodName is the name of the Kubernetes Pod running the build.
	BuildPodName string `json:"buildPodName,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Type",type="string",JSONPath=".spec.type"
// +kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.phase"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

// MobileBuild is the Schema for the mobilebuilds API
type MobileBuild struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   MobileBuildSpec   `json:"spec,omitempty"`
	Status MobileBuildStatus `json:"status,omitempty"`
}

//+kubebuilder:object:root=true

// MobileBuildList contains a list of MobileBuild
type MobileBuildList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []MobileBuild `json:"items"`
}

func init() {
	SchemeBuilder.Register(&MobileBuild{}, &MobileBuildList{})
}

With the API defined, we established the TDD loop for the controller’s reconciler. The key is envtest, which spins up a temporary, lightweight etcd and Kubernetes API server for integration tests, avoiding the need for a full-blown cluster.

Our first test case asserts the most basic behavior: when a new MobileBuild resource is created, the controller should see it and update its status to Pending.

controllers/mobilebuild_controller_test.go:

package controllers

import (
	"context"
	"time"

	. "github.com/onsi/ginkgo/v2"
	. "github.com/onsi/gomega"
	buildv1alpha1 "my-company.com/mobile-build-controller/api/v1alpha1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/types"
)

var _ = Describe("MobileBuild Controller", func() {
	const (
		BuildName      = "test-build"
		BuildNamespace = "default"
		timeout        = time.Second * 10
		interval       = time.Millisecond * 250
	)

	Context("When creating a new MobileBuild resource", func() {
		It("Should update the status to Pending", func() {
			ctx := context.Background()
			build := &buildv1alpha1.MobileBuild{
				TypeMeta: metav1.TypeMeta{
					APIVersion: "build.my-company.com/v1alpha1",
					Kind:       "MobileBuild",
				},
				ObjectMeta: metav1.ObjectMeta{
					Name:      BuildName,
					Namespace: BuildNamespace,
				},
				Spec: buildv1alpha1.MobileBuildSpec{
					Type:             buildv1alpha1.BuildTypeAndroid,
					GitRepositoryURL: "https://github.com/user/repo.git",
					GitRef:           "main",
					BuildCommand:     "./gradlew assembleRelease",
				},
			}
			Expect(k8sClient.Create(ctx, build)).Should(Succeed())

			buildLookupKey := types.NamespacedName{Name: BuildName, Namespace: BuildNamespace}
			createdBuild := &buildv1alpha1.MobileBuild{}

			// We'll need to retry getting this object since it might take a second for the controller to update it.
			Eventually(func() bool {
				err := k8sClient.Get(ctx, buildLookupKey, createdBuild)
				if err != nil {
					return false
				}
				return createdBuild.Status.Phase == buildv1alpha1.PhasePending
			}, timeout, interval).Should(BeTrue())

			// Clean up the resource
			Expect(k8sClient.Delete(ctx, createdBuild)).Should(Succeed())
		})
	})
})

To make this test pass, the Reconcile method needs its initial logic.

controllers/mobilebuild_controller.go:

func (r *MobileBuildReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := log.FromContext(ctx)

	var mobileBuild buildv1alpha1.MobileBuild
	if err := r.Get(ctx, req.NamespacedName, &mobileBuild); err != nil {
		// Ignore not-found errors, since they can't be fixed by an immediate requeue.
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	// If the status is empty, it's a new build. Initialize it.
	if mobileBuild.Status.Phase == "" {
		log.Info("New build detected, setting status to Pending")
		mobileBuild.Status.Phase = buildv1alpha1.PhasePending
		mobileBuild.Status.Message = "Build has been accepted by the controller."
		if err := r.Status().Update(ctx, &mobileBuild); err != nil {
			log.Error(err, "Failed to update MobileBuild status")
			return ctrl.Result{}, err
		}
		// Requeue immediately after status update to proceed to the next state
		return ctrl.Result{Requeue: true}, nil
	}
	
	// ... more logic will go here
	return ctrl.Result{}, nil
}

This simple loop—write a failing test, write the minimum code to make it pass—formed the bedrock for building out the controller’s complex behavior.

Phase 2: Persisting State with MySQL

Relying solely on the CRD for state is risky. If a CRD is accidentally deleted, its history is gone. For audit and long-term analytics, we needed a durable store. We provisioned an RDS for MySQL instance and defined a simple schema.

schema.sql:

CREATE TABLE `build_records` (
  `id` INT NOT NULL AUTO_INCREMENT,
  `uid` VARCHAR(255) NOT NULL UNIQUE,
  `namespace` VARCHAR(255) NOT NULL,
  `name` VARCHAR(255) NOT NULL,
  `build_type` VARCHAR(50) NOT NULL,
  `git_repo_url` VARCHAR(500) NOT NULL,
  `git_ref` VARCHAR(255) NOT NULL,
  `status` VARCHAR(50) NOT NULL,
  `message` TEXT,
  `created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  INDEX `idx_status` (`status`),
  INDEX `idx_created_at` (`created_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

The controller needed a robust way to communicate with this database. We created a data access layer package, ensuring connection pooling and secure credential management via Kubernetes secrets.

internal/database/client.go:

package database

import (
	"database/sql"
	"fmt"
	"log"
	"time"

	_ "github.com/go-sql-driver/mysql"
	buildv1alpha1 "my-company.com/mobile-build-controller/api/v1alpha1"
)

type Client struct {
	DB *sql.DB
}

// NewClient creates a new database client. DSN is the Data Source Name.
func NewClient(dsn string) (*Client, error) {
	db, err := sql.Open("mysql", dsn)
	if err != nil {
		return nil, fmt.Errorf("failed to open database connection: %w", err)
	}

	db.SetConnMaxLifetime(time.Minute * 3)
	db.SetMaxOpenConns(10)
	db.SetMaxIdleConns(10)

	// Ping the database to verify the connection is alive.
	if err := db.Ping(); err != nil {
		db.Close()
		return nil, fmt.Errorf("failed to ping database: %w", err)
	}

	log.Println("Successfully connected to the database")
	return &Client{DB: db}, nil
}

// CreateOrUpdateBuildRecord inserts a new build record or updates an existing one based on UID.
func (c *Client) CreateOrUpdateBuildRecord(ctx context.Context, build *buildv1alpha1.MobileBuild) error {
	query := `
		INSERT INTO build_records (uid, namespace, name, build_type, git_repo_url, git_ref, status, message)
		VALUES (?, ?, ?, ?, ?, ?, ?, ?)
		ON DUPLICATE KEY UPDATE
		status = VALUES(status),
		message = VALUES(message);
	`
	_, err := c.DB.ExecContext(ctx, query,
		string(build.UID),
		build.Namespace,
		build.Name,
		string(build.Spec.Type),
		build.Spec.GitRepositoryURL,
		build.Spec.GitRef,
		string(build.Status.Phase),
		build.Status.Message,
	)

	if err != nil {
		return fmt.Errorf("failed to execute CreateOrUpdateBuildRecord query: %w", err)
	}
	return nil
}

The controller’s main struct was updated to include this client, and the reconciler logic was extended to call it. A common pitfall here is managing the database connection lifecycle. We initialize it once in main.go and pass the client instance to the reconciler manager, ensuring we don’t create new connections on every reconcile loop.

main.go:

// ... imports
func main() {
    // ... setup code for flags, logger, etc.

    // Get database credentials from environment variables, populated by a secret
    dbUser := os.Getenv("DB_USER")
    dbPassword := os.Getenv("DB_PASSWORD")
    dbHost := os.Getenv("DB_HOST")
    dbName := os.Getenv("DB_NAME")
    dsn := fmt.Sprintf("%s:%s@tcp(%s)/%s?parseTime=true", dbUser, dbPassword, dbHost, dbName)

    dbClient, err := database.NewClient(dsn)
    if err != nil {
        setupLog.Error(err, "unable to create database client")
        os.Exit(1)
    }
    defer dbClient.DB.Close()

    if err = (&controllers.MobileBuildReconciler{
        Client:   mgr.GetClient(),
        Scheme:   mgr.GetScheme(),
        DBClient: dbClient, // Pass the client to the reconciler
    }).SetupWithManager(mgr); err != nil {
        // ... error handling
    }

    // ... start manager
}

Now, the Reconcile method persists the Pending state to MySQL. TDD required us to mock the database interface or use a test container to validate this interaction without connecting to a real database during unit tests.

Phase 3: The Scheduling and Execution Logic

This is the core of the controller. When a MobileBuild is in the Pending phase, the scheduler must decide what to do.

A mermaid diagram helps visualize the flow:

graph TD
    A[User applies MobileBuild YAML] --> B{Kubernetes API Server};
    B --> C[MobileBuild Controller Watches];
    C -- Reconcile Loop --> D{Build Status?};
    D -- "" (New) --> E[Update Status to Pending];
    E --> F[Write to MySQL];
    F --> D;
    D -- "Pending" --> G{Build Type?};
    G -- "Android" --> H[Create Linux Build Job];
    G -- "iOS" --> I[Find Available macOS Node];
    I -- Available --> J[Create macOS Build Pod on Node];
    I -- Not Available --> K[Update Status to Queued];
    K --> F;
    H --> L[Monitor Job Status];
    J --> L;
    L -- Succeeded/Failed --> M[Update Final Status];
    M --> F;

Android builds were straightforward. The controller constructs a Kubernetes batch/v1.Job that uses a container image pre-loaded with the Android SDK.

// In Reconcile method, inside the "Pending" case
case buildv1alpha1.PhasePending:
    log.Info("Processing build in Pending phase")
    switch mobileBuild.Spec.Type {
    case buildv1alpha1.BuildTypeAndroid:
        return r.reconcileAndroidBuild(ctx, &mobileBuild)
    case buildv1alpha1.BuildTypeIOS:
        return r.reconcileIOSBuild(ctx, &mobileBuild)
    }

The reconcileAndroidBuild function would create a Job manifest in Go and apply it.

iOS builds presented the real challenge. We had a pool of EC2 Mac instances registered as nodes in our EKS cluster. Each was labeled with build-agent-type=macos. The controller’s first job was to find an available node. A node was considered “available” if it existed and didn’t already have an active build pod running on it, a state we tracked using another label, build-status=available.

A critical mistake we almost made was to have the controller provision the node itself. This would have tightly coupled the build logic with infrastructure logic. Instead, we decided the controller’s only responsibility is to find and use a node from the available pool. A separate process (initially manual, later automated by Cluster Autoscaler with custom node groups) would manage the pool size.

controllers/mobilebuild_controller.go:

func (r *MobileBuildReconciler) reconcileIOSBuild(ctx context.Context, build *buildv1alpha1.MobileBuild) (ctrl.Result, error) {
	log := log.FromContext(ctx)

	// 1. Find an available macOS node
	nodeList := &corev1.NodeList{}
	labels := map[string]string{
		"build-agent-type": "macos",
		"build-status":     "available",
	}
	if err := r.List(ctx, nodeList, client.MatchingLabels(labels)); err != nil {
		log.Error(err, "Failed to list macOS nodes")
		return ctrl.Result{}, err
	}

	if len(nodeList.Items) == 0 {
		// 2. No nodes available, queue the build
		log.Info("No available macOS nodes found. Queuing build.")
		build.Status.Phase = buildv1alpha1.PhaseQueued
		build.Status.Message = "No available macOS build agent. Build is waiting in queue."
		if err := r.updateStatusAndDB(ctx, build); err != nil {
			return ctrl.Result{}, err
		}
		// Requeue after a delay to check for nodes again
		return ctrl.Result{RequeueAfter: time.Minute * 1}, nil
	}

	// 3. Node found, schedule the pod
	selectedNode := nodeList.Items[0] // Simple strategy: pick the first one
	log.Info("Found available macOS node, scheduling build pod", "node", selectedNode.Name)

	buildPod, err := r.constructIOSPod(build, selectedNode.Name)
	if err != nil {
		log.Error(err, "Failed to construct build pod")
		return ctrl.Result{}, err
	}
	
	if err := r.Create(ctx, buildPod); err != nil {
		log.Error(err, "Failed to create build pod")
		return ctrl.Result{}, err
	}

	// 4. Update node and build status to "Running"
	log.Info("Successfully created build pod", "podName", buildPod.Name)
	
	originalNode := selectedNode.DeepCopy()
	selectedNode.Labels["build-status"] = "busy"
	if err := r.Patch(ctx, &selectedNode, client.MergeFrom(originalNode)); err != nil {
		log.Error(err, "Failed to label node as busy")
		// This is a tricky state; we created a pod but failed to label the node.
		// A robust implementation would add logic to delete the pod and retry.
		return ctrl.Result{}, err
	}

	build.Status.Phase = buildv1alpha1.PhaseRunning
	build.Status.Message = "Build pod created and is running on node " + selectedNode.Name
	build.Status.BuildPodName = buildPod.Name
	build.Status.StartTime = &metav1.Time{Time: time.Now()}
	if err := r.updateStatusAndDB(ctx, build); err != nil {
		return ctrl.Result{}, err
	}
	
	return ctrl.Result{}, nil
}

// Helper to construct the pod manifest
func (r *MobileBuildReconciler) constructIOSPod(build *buildv1alpha1.MobileBuild, nodeName string) (*corev1.Pod, error) {
	podName := fmt.Sprintf("%s-%s", build.Name, "ios-build") // Add random suffix in production
	pod := &corev1.Pod{
		ObjectMeta: metav1.ObjectMeta{
			Name:      podName,
			Namespace: build.Namespace,
			Labels: map[string]string{
				"app": "mobile-build",
				"build-uid": string(build.UID),
			},
			// Set the owner reference so the pod is garbage collected if the MobileBuild is deleted
			OwnerReferences: []metav1.OwnerReference{
				*metav1.NewControllerRef(build, buildv1alpha1.SchemeGroupVersion.WithKind("MobileBuild")),
			},
		},
		Spec: corev1.PodSpec{
			NodeName:      nodeName, // This is key: direct scheduling to the chosen node
			RestartPolicy: corev1.RestartPolicyNever,
			Containers: []corev1.Container{
				{
					Name:  "build-container",
					Image: "my-custom-registry/ios-build-agent:latest", // Image with Xcode, fastlane, etc.
					Command: []string{"/bin/sh", "-c"},
					Args: []string{
						fmt.Sprintf("git clone %s app && cd app && git checkout %s && %s",
							build.Spec.GitRepositoryURL,
							build.Spec.GitRef,
							build.Spec.BuildCommand,
						),
					},
				},
			},
		},
	}
	return pod, nil
}

The TDD cycle for this was intense. We used the fake client provided by controller-runtime to inject fake Node objects into the test environment, allowing us to validate the queuing logic when no nodes were present, and the scheduling logic when a node was available.

Phase 4: Status Monitoring and Cleanup

Once a pod is running, the controller’s job shifts to monitoring. It enters a new state (PhaseRunning). In this state, on each reconciliation, it fetches the pod it created (using the name stored in the MobileBuildStatus) and checks its phase.

If the pod’s phase is Succeeded, the controller updates the MobileBuild status to Succeeded, records the completion time, and most importantly, cleans up. Cleanup involves deleting the build pod and patching the macOS node’s label back to build-status=available, releasing it back into the pool. A similar flow handles the Failed state.

controllers/mobilebuild_controller.go:

// In Reconcile method...
case buildv1alpha1.PhaseRunning:
    log.Info("Processing build in Running phase")
    
    // Find the associated pod
    buildPod := &corev1.Pod{}
    err := r.Get(ctx, types.NamespacedName{Name: mobileBuild.Status.BuildPodName, Namespace: mobileBuild.Namespace}, buildPod)
    if err != nil {
        if errors.IsNotFound(err) {
            // Pod is gone unexpectedly. Mark build as failed.
            log.Info("Build pod not found. Marking build as failed.")
            mobileBuild.Status.Phase = buildv1alpha1.PhaseFailed
            mobileBuild.Status.Message = "Build pod was deleted unexpectedly."
            // ... update status and DB
            return ctrl.Result{}, nil
        }
        return ctrl.Result{}, err
    }

    switch buildPod.Status.Phase {
    case corev1.PodSucceeded:
        log.Info("Build pod succeeded. Finalizing build.")
        mobileBuild.Status.Phase = buildv1alpha1.PhaseSucceeded
        mobileBuild.Status.Message = "Build completed successfully."
        mobileBuild.Status.CompletionTime = &metav1.Time{Time: time.Now()}
        if err := r.updateStatusAndDB(ctx, &mobileBuild); err != nil {
            return ctrl.Result{}, err
        }
        // Cleanup
        return r.cleanupForBuild(ctx, &mobileBuild)

    case corev1.PodFailed:
        log.Info("Build pod failed. Finalizing build.")
        mobileBuild.Status.Phase = buildv1alpha1.PhaseFailed
        // In a real system, you'd fetch logs from the failed pod here and add them to the message
        mobileBuild.Status.Message = "Build pod failed execution."
        mobileBuild.Status.CompletionTime = &metav1.Time{Time: time.Now()}
        if err := r.updateStatusAndDB(ctx, &mobileBuild); err != nil {
            return ctrl.Result{}, err
        }
        // Cleanup
        return r.cleanupForBuild(ctx, &mobileBuild)

    default:
        // Pod is still running, pending, or in another state. Do nothing and requeue to check again later.
        return ctrl.Result{RequeueAfter: time.Second * 30}, nil
    }

The final result is a declarative, elastic system. A developer wanting to run an iOS build simply creates a resource:

apiVersion: build.my-company.com/v1alpha1
kind: MobileBuild
metadata:
  name: my-app-ios-release-1.2.3
spec:
  type: iOS
  gitRepositoryURL: "https://github.com/my-company/my-ios-app.git"
  gitRef: "release/1.2.3"
  buildCommand: "fastlane build_and_archive"

They can monitor it with kubectl get mobilebuilds, and the platform team can query the MySQL database for historical performance data. The static, fragile Jenkins agents are gone, replaced by an elastic, self-healing orchestration layer native to our cloud platform.

The current implementation, however, is not without its limitations. The node selection strategy is basic (“pick the first available”) and could be enhanced with more sophisticated scheduling based on node resources or build history. More critically, the system still relies on a pre-provisioned pool of macOS nodes. The logical next step is to integrate the controller with a cluster autoscaler-like mechanism to provision and terminate expensive EC2 Mac instances on-demand based on the depth of the build queue. This would be the final step in achieving a truly cost-efficient, hands-off mobile CI platform. Security is another area for improvement; running builds in pods directly on the host OS is not ideal. Exploring ephemeral, isolated VMs for each build using technologies like Anka or Tart would provide much stronger security guarantees.


  TOC