The shared, static pool of Android emulators had become the primary bottleneck in our CI pipeline. Contention was constant, with jobs queuing for hours waiting for a free slot. State corruption was rampant; tests would fail intermittently because a previous run left the emulator in a dirty state. Scaling the pool meant manually provisioning more oversized VMs, a slow and expensive process that never seemed to keep pace with developer demand. Each test run was a gamble, eroding trust in our end-to-end test suite. The core problem was treating these test environments as persistent, stateful pets instead of disposable, ephemeral cattle. The mandate was clear: build a system where a pristine, isolated Android environment could be provisioned on-demand for every single test run and destroyed immediately after.
Our backend infrastructure is built on AWS EKS, so leveraging Kubernetes was the obvious path. The concept was to package an Android emulator into a Docker container and manage it as a Pod. This would theoretically give us the isolation and scalability we needed. Vitest was the chosen test runner for our frontend and Node.js services, and unifying the testing toolchain was a significant goal. The challenge was to bridge the gap between a JavaScript test runner and a complex, stateful workload like an Android emulator running within a Kubernetes cluster.
Initial attempts using simple shell scripts in a CI pipeline to kubectl apply
a Pod manifest were fragile. They were littered with sleep
commands, crude polling loops, and complex cleanup logic that often failed, leaving orphaned emulator pods burning through expensive resources. This imperative approach was not robust. In a real-world project, reliability is paramount. A declarative model was necessary, which is the cornerstone of Kubernetes. This led us to the Operator pattern: creating a custom Kubernetes controller to manage the entire lifecycle of an AndroidEmulator
resource.
Phase 1: Containerizing the Uncontainerizable
The first hurdle was creating a stable Docker image that could run a headless Android emulator. This is non-trivial because the emulator relies on hardware acceleration via KVM, which is not available in a standard container environment.
The solution requires the container to run in a privileged security context, granting it access to the host’s KVM device at /dev/kvm
.
Here is the finalized Dockerfile. It’s built on Ubuntu and meticulously installs only the necessary Android SDK components to keep the image size manageable.
# Use a stable base image
FROM ubuntu:22.04
# Avoid interactive prompts during package installation
ENV DEBIAN_FRONTEND=noninteractive
# Install dependencies for Android SDK, KVM, and ADB
RUN apt-get update && \
apt-get install -y --no-install-recommends \
wget \
unzip \
libnss3 \
libxss1 \
libasound2 \
libxtst6 \
pulseaudio \
qemu-kvm \
libpulse0 \
libglu1-mesa \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Set up Android SDK environment
ENV ANDROID_SDK_ROOT /opt/android/sdk
ENV PATH $PATH:${ANDROID_SDK_ROOT}/cmdline-tools/latest/bin:${ANDROID_SDK_ROOT}/platform-tools:${ANDROID_SDK_ROOT}/emulator
# Install Android command-line tools
RUN mkdir -p ${ANDROID_SDK_ROOT}/cmdline-tools && \
wget -q https://dl.google.com/android/repository/commandlinetools-linux-9477386_latest.zip -O /tmp/cmdline-tools.zip && \
unzip -q /tmp/cmdline-tools.zip -d /tmp/ && \
mv /tmp/cmdline-tools ${ANDROID_SDK_ROOT}/cmdline-tools/latest && \
rm /tmp/cmdline-tools.zip
# Accept licenses automatically
RUN yes | sdkmanager --licenses
# Install platform tools, emulator, and a system image (e.g., Android 12)
# A common mistake is to install too many images, bloating the container.
# Pin versions for reproducibility.
RUN sdkmanager "platform-tools" "emulator" "system-images;android-31;google_apis;x86_64"
# Create an Android Virtual Device (AVD)
# The configuration here is critical for headless operation.
RUN echo "no" | avdmanager create avd \
--force \
--name "test_emulator" \
--package "system-images;android-31;google_apis;x86_64" \
--device "pixel_6" \
--sdcard 256M
# Expose ADB ports
EXPOSE 5554 5555
# The entrypoint script is crucial for starting the emulator correctly
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
The accompanying entrypoint.sh
is more than a simple startup command. It handles launching the emulator with specific flags required for headless, containerized operation and then waits indefinitely to keep the container running.
#!/bin/bash
set -e
# Start the Android emulator in the background
# -no-window: Essential for headless CI environments.
# -no-snapshot: Ensures a clean state for every run.
# -no-audio: Reduces unnecessary resource consumption.
# -verbose: Provides helpful logs for debugging startup issues.
# -writable-system: Sometimes needed for tests that modify system properties.
# -qemu -enable-kvm: Explicitly tells QEMU to use KVM.
emulator @test_emulator \
-no-window \
-no-snapshot \
-no-audio \
-verbose \
-writable-system \
-qemu -enable-kvm &
# Wait for the emulator to boot completely.
# A common pitfall is to assume the emulator is ready as soon as the process starts.
# The `sys.boot_completed` property is the reliable signal.
echo "Waiting for emulator to boot..."
until adb shell getprop sys.boot_completed | grep -m 1 "1"; do
sleep 1
done
echo "Emulator booted successfully."
# Keep the container running
# A simple tail on /dev/null is a lightweight way to prevent the container from exiting.
tail -f /dev/null
This image provides a self-contained, reproducible Android environment. The next challenge was finding a place to run it.
Phase 2: Preparing the EKS Infrastructure
Standard EKS node groups running on virtualized EC2 instances (like the m5
family) do not support nested virtualization. Attempting to run our privileged container on such a node results in the emulator failing to start, complaining about KVM not being available.
The only viable solution on AWS is to use bare metal instances. These instances provide direct access to the underlying hardware’s virtualization extensions (Intel VT-x/AMD-V). We provisioned a dedicated, auto-scaling node group using m5.metal
instances, specifically for these emulator workloads.
Using Terraform, we defined a dedicated node group with a taint to ensure only our emulator pods would be scheduled on this expensive hardware.
resource "aws_eks_node_group" "android_emulators" {
cluster_name = var.eks_cluster_name
node_group_name = "android-emulator-nodes"
instance_types = ["m5.metal"]
node_role_arn = aws_iam_role.eks_node_role.arn
subnet_ids = var.private_subnet_ids
scaling_config {
desired_size = 1
max_size = 5
min_size = 0 # Allows scaling to zero to save costs
}
# Taints prevent general workloads from being scheduled on these expensive nodes.
taint {
key = "workload"
value = "android-emulator"
effect = "NO_SCHEDULE"
}
labels = {
"workload-type" = "android-emulator"
}
# ... other configurations like security groups, AMI type, etc.
}
This infrastructure setup is critical. Without bare metal nodes and the correct taints, the entire system is a non-starter. The ability to scale to zero is a key FinOps consideration; these nodes should only be running when tests are active.
Phase 3: The Kubernetes Controller and CRD
With the image and infrastructure in place, we moved to the core of the solution: the custom controller. We defined a Custom Resource Definition (CRD) for AndroidEmulator
. This allows developers to request an emulator declaratively.
androidemulator_crd.yaml
:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: androidemulators.testing.our-company.com
spec:
group: testing.our-company.com
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
androidVersion:
type: string
description: "e.g., android-31"
ttlSecondsAfterFinished:
type: integer
description: "Seconds to keep the resource after use before cleanup."
status:
type: object
properties:
phase:
type: string
description: "Current phase: Pending, Provisioning, Running, Terminated."
adbConnectAddress:
type: string
description: "The host:port string for ADB connection."
scope: Namespaced
names:
plural: androidemulators
singular: androidemulator
kind: AndroidEmulator
shortNames:
- andemu
The controller, written in Go using Kubebuilder, contains the reconciliation logic. Its job is to watch for AndroidEmulator
resources and take action to make the actual state of the cluster match the desired state defined in the resource’s spec
.
Here is a simplified snippet of the core Reconcile
function:
// File: internal/controller/androidemulator_controller.go
package controller
import (
// ... imports
testingv1alpha1 "your-repo/android-emulator-operator/api/v1alpha1"
corev1 "k8s.io/api/core/v1"
// ... more imports
)
const adbPort = 5555
func (r *AndroidEmulatorReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
var androidEmulator testingv1alpha1.AndroidEmulator
if err := r.Get(ctx, req.NamespacedName, &androidEmulator); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// === Finalizer logic for cleanup ===
finalizerName := "androidemulators.testing.our-company.com/finalizer"
if androidEmulator.ObjectMeta.DeletionTimestamp.IsZero() {
// The object is not being deleted, so if it does not have our finalizer,
// let's add it and update the object.
if !controllerutil.ContainsFinalizer(&androidEmulator, finalizerName) {
controllerutil.AddFinalizer(&androidEmulator, finalizerName)
if err := r.Update(ctx, &androidEmulator); err != nil {
return ctrl.Result{}, err
}
}
} else {
// The object is being deleted
if controllerutil.ContainsFinalizer(&androidEmulator, finalizerName) {
// Run our finalizer logic.
log.Info("Cleaning up associated resources for AndroidEmulator")
// The implementation of cleanup (deleting Pod and Service) goes here.
// ...
// Remove our finalizer from the list and update it.
controllerutil.RemoveFinalizer(&androidEmulator, finalizerName)
if err := r.Update(ctx, &androidEmulator); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// === Reconciliation Logic ===
// 1. Check if the Pod already exists
foundPod := &corev1.Pod{}
err := r.Get(ctx, types.NamespacedName{Name: androidEmulator.Name, Namespace: androidEmulator.Namespace}, foundPod)
if err != nil && errors.IsNotFound(err) {
// Pod does not exist, let's create it
pod := r.podForAndroidEmulator(&androidEmulator)
if err = ctrl.SetControllerReference(&androidEmulator, pod, r.Scheme); err != nil {
return ctrl.Result{}, err
}
log.Info("Creating a new Pod", "Pod.Namespace", pod.Namespace, "Pod.Name", pod.Name)
if err := r.Create(ctx, pod); err != nil {
return ctrl.Result{}, err
}
// Also create the Service to expose ADB
svc := r.serviceForAndroidEmulator(&androidEmulator)
if err = ctrl.SetControllerReference(&androidEmulator, svc, r.Scheme); err != nil {
return ctrl.Result{}, err
}
log.Info("Creating a new Service", "Service.Namespace", svc.Namespace, "Service.Name", svc.Name)
if err := r.Create(ctx, svc); err != nil {
return ctrl.Result{}, err
}
// Update status and requeue
androidEmulator.Status.Phase = "Provisioning"
if err := r.Status().Update(ctx, &androidEmulator); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{Requeue: true}, nil
} else if err != nil {
return ctrl.Result{}, err
}
// 2. Pod exists, check its status
// The most important part is determining if the emulator is ready for connections.
// A common mistake is to just check if the Pod is 'Running'. The container can be running
// but the emulator inside could still be booting or have crashed.
// Our entrypoint script handles this, but the controller needs to reflect the final "ready" state.
if foundPod.Status.Phase == corev1.PodRunning && androidEmulator.Status.Phase != "Running" {
// A more robust check would involve probing the adb port or checking pod logs.
// For now, we assume if the pod is Running, the entrypoint script has succeeded.
androidEmulator.Status.Phase = "Running"
// The service name is stable and predictable
androidEmulator.Status.AdbConnectAddress = fmt.Sprintf("%s.%s.svc.cluster.local:%d", androidEmulator.Name, androidEmulator.Namespace, adbPort)
log.Info("Emulator is running", "AdbConnectAddress", androidEmulator.Status.AdbConnectAddress)
if err := r.Status().Update(ctx, &androidEmulator); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// podForAndroidEmulator returns a Pod definition for the emulator.
func (r *AndroidEmulatorReconciler) podForAndroidEmulator(m *testingv1alpha1.AndroidEmulator) *corev1.Pod {
// ... Pod definition ...
// This is where we specify the privileged context and volume mounts.
privileged := true
pod := &corev1.Pod{
// ... metadata ...
Spec: corev1.PodSpec{
// Toleration to run on our dedicated bare metal nodes
Tolerations: []corev1.Toleration{
{
Key: "workload",
Operator: "Equal",
Value: "android-emulator",
Effect: "NoSchedule",
},
},
Containers: []corev1.Container{{
Image: "your-registry/android-emulator:latest",
Name: "emulator",
SecurityContext: &corev1.SecurityContext{
Privileged: &privileged,
},
VolumeMounts: []corev1.VolumeMount{
{
Name: "kvm",
MountPath: "/dev/kvm",
},
},
// ... ports, resources ...
}},
Volumes: []corev1.Volume{
{
Name: "kvm",
VolumeSource: corev1.VolumeSource{
HostPath: &corev1.HostPathVolumeSource{
Path: "/dev/kvm",
},
},
},
},
},
}
return pod
}
// serviceForAndroidEmulator returns a Service to expose the ADB port.
func (r *AndroidEmulatorReconciler) serviceForAndroidEmulator(m *testingv1alpha1.AndroidEmulator) *corev1.Service {
// ... Service definition for port 5555 ...
}
This controller handles creation, updates the status with the connection address once ready, and—crucially—uses a finalizer to ensure that when an AndroidEmulator
resource is deleted, the associated Pod and Service are garbage collected cleanly.
Phase 4: Integrating with Vitest
The final piece was to orchestrate this system from our test suite. We used Vitest’s globalSetup
and globalTeardown
hooks to manage the lifecycle of the AndroidEmulator
resource for the entire test run.
We used the @kubernetes/client-node
library to interact with the EKS cluster’s API server from our Node.js-based test environment.
vitest.global-setup.ts
:
import { KubeConfig, CustomObjectsApi } from '@kubernetes/client-node';
import { v4 as uuidv4 } from 'uuid';
import * as fs from 'fs';
import * as path from 'path';
const EMULATOR_GROUP = 'testing.our-company.com';
const EMULATOR_VERSION = 'v1alpha1';
const EMULATOR_PLURAL = 'androidemulators';
const NAMESPACE = 'testing';
export async function setup() {
console.log('Global setup: Provisioning Android Emulator...');
const kc = new KubeConfig();
// Assumes in-cluster config or local kubeconfig is correctly set up
kc.loadFromDefault();
const k8sApi = kc.makeApiClient(CustomObjectsApi);
const testRunId = uuidv4().slice(0, 8);
const emulatorName = `test-emu-${testRunId}`;
const emulatorManifest = {
apiVersion: `${EMULATOR_GROUP}/${EMULATOR_VERSION}`,
kind: 'AndroidEmulator',
metadata: {
name: emulatorName,
namespace: NAMESPACE,
},
spec: {
androidVersion: 'android-31',
},
};
try {
// 1. Create the AndroidEmulator resource
await k8sApi.createNamespacedCustomObject(
EMULATOR_GROUP,
EMULATOR_VERSION,
NAMESPACE,
EMULATOR_PLURAL,
emulatorManifest,
);
console.log(`AndroidEmulator resource '${emulatorName}' created. Waiting for it to become ready...`);
// 2. Poll the resource status until it is 'Running'
let adbConnectAddress = '';
const pollTimeout = 300 * 1000; // 5 minutes
const pollInterval = 5 * 1000; // 5 seconds
const startTime = Date.now();
while (Date.now() - startTime < pollTimeout) {
const { body: customObject } = await k8sApi.getNamespacedCustomObjectStatus(
EMULATOR_GROUP,
EMULATOR_VERSION,
NAMESPACE,
EMULATOR_PLURAL,
emulatorName,
);
const status = (customObject as any).status;
if (status && status.phase === 'Running' && status.adbConnectAddress) {
adbConnectAddress = status.adbConnectAddress;
console.log(`Emulator is ready! ADB address: ${adbConnectAddress}`);
break;
}
await new Promise(resolve => setTimeout(resolve, pollInterval));
}
if (!adbConnectAddress) {
throw new Error('Emulator provisioning timed out.');
}
// 3. Store the connection details for tests to use
// A temporary file is a simple way to pass state from globalSetup to teardown and tests.
const state = { emulatorName, adbConnectAddress };
fs.writeFileSync(path.join(__dirname, '.test-state.json'), JSON.stringify(state));
// Expose as an environment variable for the test processes
process.env.ADB_CONNECT_ADDRESS = adbConnectAddress;
} catch (err) {
console.error('Failed to provision Android Emulator:', err);
// Best effort cleanup
await teardown();
process.exit(1);
}
}
export async function teardown() {
console.log('Global teardown: Cleaning up Android Emulator...');
const stateFilePath = path.join(__dirname, '.test-state.json');
if (!fs.existsSync(stateFilePath)) {
return;
}
const state = JSON.parse(fs.readFileSync(stateFilePath, 'utf-8'));
const { emulatorName } = state;
const kc = new KubeConfig();
kc.loadFromDefault();
const k8sApi = kc.makeApiClient(CustomObjectsApi);
try {
// Deleting the custom resource triggers the finalizer in our controller
await k8sApi.deleteNamespacedCustomObject(
EMULATOR_GROUP,
EMULATOR_VERSION,
NAMESPACE,
EMULATOR_PLURAL,
emulatorName,
);
console.log(`AndroidEmulator resource '${emulatorName}' deleted.`);
} catch (err) {
console.error(`Failed to delete AndroidEmulator '${emulatorName}':`, err);
} finally {
fs.unlinkSync(stateFilePath);
}
}
An actual test file would then use this environment variable to connect.
example.spec.ts
:
import { describe, it, expect, beforeAll } from 'vitest';
import { execSync } from 'child_process';
describe('Android Emulator E2E Test', () => {
const adbAddress = process.env.ADB_CONNECT_ADDRESS;
beforeAll(() => {
// Ensure we have the address before running tests
if (!adbAddress) {
throw new Error('ADB_CONNECT_ADDRESS is not set. Global setup may have failed.');
}
try {
// Connect to the dynamically provisioned emulator
console.log(`Connecting ADB to ${adbAddress}...`);
execSync(`adb connect ${adbAddress}`);
} catch (e) {
console.error('Failed to connect ADB:', e);
throw e;
}
});
it('should have a booted device available', () => {
const devicesOutput = execSync(`adb devices`).toString();
expect(devicesOutput).toContain(adbAddress);
expect(devicesOutput).toContain('device');
});
it('should be able to get a system property', () => {
const manufacturer = execSync(`adb -s ${adbAddress} shell getprop ro.product.manufacturer`).toString().trim();
expect(manufacturer).toBe('Google');
});
});
This architecture is illustrated by the following flow:
sequenceDiagram participant CI/CD as CI Pipeline (e.g. GitHub Actions) participant Vitest as Vitest Global Setup participant K8s API as Kubernetes API Server participant Controller as AndroidEmulator Controller participant EKS as EKS Node (m5.metal) CI/CD->>Vitest: Triggers `pnpm test` Vitest->>K8s API: CREATE AndroidEmulator CR K8s API->>Controller: Notifies of new CR Controller->>K8s API: CREATE Pod & Service K8s API->>EKS: Schedules Pod EKS->>EKS: Starts emulator container Note over EKS: Emulator boots... Controller->>K8s API: Polls Pod status Controller->>K8s API: UPDATE CR status to 'Running' with adb address loop Poll for readiness Vitest->>K8s API: GET AndroidEmulator CR Status end Vitest-->>Vitest: Status is 'Running', extract adb address Note over Vitest: Test execution begins...
Tests connect to ADB address CI/CD->>Vitest: Test run finishes Vitest->>K8s API: DELETE AndroidEmulator CR K8s API->>Controller: Notifies of deletion (finalizer runs) Controller->>K8s API: DELETE Pod & Service K8s API->>EKS: Terminates Pod
The system is now fully declarative and automated. Developers no longer interact with emulators directly. They write their tests, and the platform provides the necessary environment, completely isolated and on-demand. Parallelism is no longer limited by a fixed pool of devices but by the scaling limits of our EKS node group.
This solution has its own set of trade-offs. The boot time for a fresh emulator can still be over a minute, adding latency to every test run. For faster feedback, this might be too slow. Future work could involve exploring emulator snapshotting capabilities, where we could potentially create a “golden” snapshot and restore it, cutting boot time significantly. Furthermore, the cost of bare metal instances is not negligible. Rigorous auto-scaling policies for the node group are essential to manage expenses, scaling down to zero during off-peak hours. The current implementation also lacks robust support for GPU acceleration, which would be a blocker for any tests involving heavy UI rendering or animations.