The operational brief was deceptively simple: manage the lifecycle of 15,000 single-purpose Android devices deployed globally in logistics hubs. The reality was a recurring nightmare of failed over-the-air (OTA) updates. A faulty system component update could soft-brick hundreds of devices in a remote warehouse, triggering expensive manual interventions—what the team grimly called “truck rolls.” Our previous “big bang” deployment strategy, pushing updates to all devices simultaneously, was a high-stakes gamble we kept losing. We needed a system that treated our Android fleet not like consumer phones, but like servers in a datacenter: managed, monitored, and updated with surgical precision.
Our initial attempt to solve this involved building a bespoke management solution, which quickly devolved into a mess of shell scripts and fragile state management. The core problem was idempotency and state convergence. We were reinventing configuration management, poorly. This is where the first unorthodox decision was made: treat the Android devices as Puppet nodes. This was met with skepticism. Puppet is for servers, not Android tablets. But the principles are the same. We needed to define a target state—a specific set of system properties, installed packages, and configuration files—and have an agent enforce it.
The first hurdle was getting a Puppet agent running on Android. A full-blown Ruby-based agent was a non-starter due to resource constraints. Instead, we built a lightweight, native Android service in Kotlin that acted as a client for a Puppet master. It fetched a compiled catalog (in JSON format) and used Android’s own tools (pm
, am
, settings
, setprop
) to apply the desired state. This wasn’t a true Puppet agent, but it implemented the core resource application logic.
A manifest for a target device looked like this. It ensures our core logistics application is installed at a specific version, a critical background service is enabled, and SELinux is set to permissive for our diagnostics tools.
# modules/android_device/manifests/logistics_terminal.pp
class android_device::logistics_terminal (
String $app_package_name = 'com.logistics.terminal',
String $app_apk_source = 'https://artifactory.internal/repo/apks/terminal-v2.1.3.apk',
String $service_name = 'com.logistics.telemetry.TelemetryService',
String $selinux_mode = 'permissive',
) {
# Ensure the core logistics application package is present at a specific version
# The 'package' resource type is implemented by our custom Android provider.
package { $app_package_name:
ensure => '2.1.3',
provider => 'android_pm',
source => $app_apk_source,
notify => Service[$service_name],
}
# Ensure the background telemetry service is always enabled.
# If the package is updated, this service is restarted.
service { $service_name:
ensure => 'running',
enable => true,
provider => 'android_am',
}
# A custom resource to manage Android system properties using 'setprop'.
# This is critical for low-level configuration.
# A pitfall here is that many props require root access and are not persistent across reboots.
# We had to modify the device's init.rc for properties that needed to persist.
android_setprop { 'persist.sys.selinux':
value => $selinux_mode,
}
# Manage a critical configuration file pushed to the device's storage.
file { '/sdcard/config/settings.json':
ensure => file,
owner => 'shell',
group => 'shell',
mode => '0644',
content => template('android_device/settings.json.erb'),
}
}
This gave us reliable, idempotent configuration management. But it didn’t solve the deployment risk. For that, we needed canary releases. This led us to Spinnaker. Spinnaker’s heritage is in deploying to cloud environments (AWS, GCP), not physical Android devices. We had to abstract our fleet into a concept Spinnaker could understand. We created a custom “Cloud Driver” for Spinnaker that treated our device inventory database as a source of server groups. A “server group” in our context was simply a collection of device serial numbers, grouped by geography, hardware model, or a special canary
tag.
With Spinnaker in place, we could design a deployment pipeline. But a canary release is useless without automated judgment. How do we know if the canary group is healthy? Relying on users to report issues was not an option. We needed hard data, which brought Prometheus into the picture.
We developed a minimal, native Prometheus exporter as an Android background service. It was written in Go and cross-compiled for ARM64, resulting in a tiny, efficient binary. It exposed critical device metrics over a local HTTP endpoint, which was then scraped by a Prometheus server within our internal network (devices were connected via VPN).
The exporter’s code focused on stability and minimal resource usage. A common mistake is to perform expensive operations synchronously within the metrics handler. We used a separate collection goroutine that updated metrics in memory on a fixed interval, ensuring the scrape endpoint always returned immediately.
// cmd/android-exporter/main.go
package main
import (
"log"
"net/http"
"os"
"sync"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Using a registry specific to our collector avoids metric collisions.
reg = prometheus.NewRegistry()
// Global lock to prevent race conditions on metric updates
// from the collector goroutine and reads from the HTTP handler.
// While prometheus client is thread-safe, our collection logic might not be.
mu = &sync.Mutex{}
)
// DeviceCollector implements the prometheus.Collector interface.
type DeviceCollector struct {
// Gauges are used for values that can go up or down.
cpuUsage *prometheus.GaugeVec
memoryUsage *prometheus.GaugeVec
batteryLevel prometheus.Gauge
// Counters are for values that only increase.
appCrashes *prometheus.Desc
kernelPanics *prometheus.Desc
}
// newDeviceCollector is our constructor.
func newDeviceCollector() *DeviceCollector {
return &DeviceCollector{
cpuUsage: prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "android_cpu_usage_percent",
Help: "Current CPU usage of the device.",
}, []string{"core"}),
memoryUsage: prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "android_memory_usage_bytes",
Help: "Memory usage statistics.",
}, []string{"type"}), // types: total, used, free
batteryLevel: prometheus.NewGauge(prometheus.GaugeOpts{
Name: "android_battery_level_percent",
Help: "Current battery level.",
}),
appCrashes: prometheus.NewDesc(
"android_app_crashes_total",
"Total number of crashes for the primary application since boot.",
[]string{"package_name"},
nil,
),
kernelPanics: prometheus.NewDesc(
"android_kernel_panics_total",
"Total number of kernel panics detected in dmesg.",
nil,
nil,
),
}
}
// Describe sends the static descriptions of all metrics to the provided channel.
func (c *DeviceCollector) Describe(ch chan<- *prometheus.Desc) {
c.cpuUsage.Describe(ch)
c.memoryUsage.Describe(ch)
c.batteryLevel.Describe(ch)
ch <- c.appCrashes
ch <- c.kernelPanics
}
// Collect is the core method where we gather the metrics.
// This is called by the Prometheus registry on every scrape.
// In a real-world project, the functions like 'readCPUUsage' would execute shell commands
// like 'top' or read from '/proc' filesystem. Error handling is critical.
func (c *DeviceCollector) Collect(ch chan<- prometheus.Metric) {
mu.Lock()
defer mu.Unlock()
// These functions would contain the platform-specific logic
// to get the actual data from the Android OS.
cpu, err := readCPUUsage() // returns map[string]float64
if err != nil {
log.Printf("ERROR: Failed to read CPU usage: %v", err)
} else {
for core, usage := range cpu {
c.cpuUsage.WithLabelValues(core).Set(usage)
}
}
mem, err := readMemoryInfo() // returns map[string]float64
if err != nil {
log.Printf("ERROR: Failed to read memory info: %v", err)
} else {
for key, val := range mem {
c.memoryUsage.WithLabelValues(key).Set(val)
}
}
// This is a direct collection, not from a background goroutine,
// because it's cheap to get.
ch <- prometheus.MustNewConstMetric(c.appCrashes, prometheus.CounterValue, getAppCrashCount("com.logistics.terminal"), "com.logistics.terminal")
ch <- prometheus.MustNewConstMetric(c.kernelPanics, prometheus.CounterValue, getKernelPanicCount())
// These are collected and set by a background goroutine, so we just collect them here.
c.cpuUsage.Collect(ch)
c.memoryUsage.Collect(ch)
c.batteryLevel.Collect(ch)
}
func main() {
collector := newDeviceCollector()
reg.MustRegister(collector)
// Background goroutine to update expensive or slow metrics periodically.
// This prevents scrapes from timing out if metric collection is slow.
go func() {
for {
// In a real implementation, this function would call Android APIs
// or shell out to get the battery level.
level, err := getBatteryLevel()
if err != nil {
log.Printf("WARN: could not get battery level: %v", err)
} else {
mu.Lock()
collector.batteryLevel.Set(level)
mu.Unlock()
}
time.Sleep(30 * time.Second)
}
}()
http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))
log.Println("Starting Android Prometheus exporter on :9101")
log.Fatal(http.ListenAndServe(":9101", nil))
}
// Mocked data-gathering functions for demonstration.
func getBatteryLevel() (float64, error) { /* ... implementation ... */ return 87.5, nil }
func readCPUUsage() (map[string]float64, error) { /* ... */ return map[string]float64{"core0": 12.3, "core1": 25.0}, nil }
func readMemoryInfo() (map[string]float64, error) { /* ... */ return map[string]float64{"total": 4e9, "used": 1.5e9}, nil }
func getAppCrashCount(pkg string) float64 { /* ... check logcat ... */ return 0 }
func getKernelPanicCount() float64 { /* ... check dmesg/pstore ... */ return 0 }
With Puppet managing state and Prometheus providing metrics, the final piece was defining “health” in a way that was both unambiguous and automated. This is where Behavior-Driven Development (BDD) became unexpectedly crucial. We used Gherkin to write acceptance criteria for a successful update. This wasn’t for application feature testing; it was for operational health validation.
A feature file for our canary analysis looked like this:
# features/operational_health.feature
Feature: Post-Update Operational Health Validation
As an Operations Engineer
I want to verify the core functionality of a device after a system update
So that I can confidently promote the update to the entire fleet
Scenario: Verify core logistics application functionality and system stability
Given a device with serial number "CANARY_DEVICE_SN" has received the canary update
When I check the device's operational state after 15 minutes
Then the "com.logistics.terminal" application should be running
And the device should have a network round-trip-time less than 150 ms to "core.internal.api"
And the Prometheus metric "android_app_crashes_total{package_name='com.logistics.terminal'}" should have increased by at most 0
And the device should not have registered any new kernel panics since the update
These Gherkin steps were backed by step definitions (written in Python, running on a Jenkins worker) that performed the actual checks. They would SSH into the device, run ping
commands, and query the Prometheus API for specific metrics related to that device.
Finally, we tied everything together in a Spinnaker pipeline using its Automated Canary Analysis (ACA) engine, Kayenta.
The pipeline structure was as follows:
graph TD A[Start: Trigger on new APK/IMG artifact] --> B(Deploy Baseline); B --> C(Deploy Canary); C --> D{Automated Canary Analysis}; D -- Analysis Period --> E[Run BDD Health Checks]; E -- Success --> F[Collect Prometheus Metrics]; F -- Compare --> D; D -- Score > 95 --> G(Promote to Full Fleet); D -- Score <= 95 --> H(Automatic Rollback); H --> I[Alert On-Call]; G --> J[End]; I --> J;
The Kayenta configuration was the heart of the automated judgment. We defined a set of key metrics and their expected behavior. We compared the canary group against the baseline group.
A snippet of the Spinnaker canary config:
{
"name": "android-fleet-canary-config",
"description": "Canary analysis for Android system updates.",
"configVersion": "1.0",
"applications": ["android-fleet-manager"],
"judge": {
"name": "NetflixACAJudge-v1.0",
"judgeConfigurations": {
"lookbackMins": 30
}
},
"metrics": [
{
"name": "AppCrashes",
"query": {
"type": "prometheus",
"serviceType": "prometheus",
"query": "rate(android_app_crashes_total{package_name='com.logistics.terminal',device_sn=~'${scope}'}[5m])"
},
"analysisConfigurations": {
"canary": {
"direction": "decrease",
"nanStrategy": "replace",
"critical": true
}
},
"scopeName": "default"
},
{
"name": "CPU_Usage_P95",
"query": {
"type": "prometheus",
"query": "histogram_quantile(0.95, sum(rate(android_cpu_usage_seconds_total{device_sn=~'${scope}'}[5m])) by (le))"
},
"analysisConfigurations": {
"canary": {
"direction": "decrease",
"mustBeLessThan": 1.5
}
},
"scopeName": "default"
},
{
"name": "BDD_Health_Check_Success",
"query": {
"type": "prometheus",
"query": "bdd_test_run_success{pipeline_id='${pipelineId}'}"
},
"analysisConfigurations": {
"canary": {
"direction": "increase",
"critical": true,
"mustBe": 1.0
}
},
"scopeName": "default"
}
],
"classifier": {
"groupWeights": {
"Errors": 60,
"Performance": 40
},
"scoreThresholds": {
"pass": 95,
"marginal": 75
}
}
}
The BDD_Health_Check_Success
metric is a clever trick. The Jenkins job running our BDD tests would push a metric to a Prometheus Pushgateway upon success (bdd_test_run_success{pipeline_id="...",job="bdd_validator"} 1
) or failure (... 0
). This allowed us to integrate a binary pass/fail signal directly into Kayenta’s numerical scoring model, making it a critical, non-negotiable part of the analysis.
The first time this system caught a bad release was a milestone. A minor change in a native library caused a memory leak that was only visible under sustained load. After 45 minutes in the canary phase, the android_memory_usage_bytes{type='used'}
metric for the canary group began to deviate significantly from the baseline. The canary score dropped below the threshold, the pipeline automatically triggered a rollback (using Puppet to enforce the previous version’s manifest), and an alert was fired. Not a single production device outside the 20-unit canary group was affected. There was no late-night panic, no customer impact, and most importantly, no truck rolls.
This architecture is not without its own complexities. The custom Spinnaker driver requires ongoing maintenance. The Prometheus exporter consumes a non-zero amount of battery and CPU, a constant trade-off on resource-constrained devices. Furthermore, the security posture of running a Puppet-like agent and a web server on these devices required significant hardening, including strict firewall rules and mTLS communication back to our internal services. The tooling chain is intricate, and onboarding a new engineer requires a deep understanding of how these disparate systems are stitched together to serve a very specific purpose.