The fragmentation of deployment pipelines is a significant source of friction in engineering organizations. Backend teams standardize on Kubernetes manifests and GitOps, while mobile teams operate within the ecosystems of Fastlane or Gradle, often on dedicated macOS or Linux virtual machines. Progressive Web App (PWA) deployments, particularly the critical Service Worker lifecycle, are frequently managed by ad-hoc shell scripts. This separation creates operational silos, complicates release coordination, and makes end-to-end observability nearly impossible. The core technical problem is not the individual pipelines, but the lack of a unified control plane to manage, orchestrate, and visualize these disparate workflows. A developer should be able to track a single feature from backend microservice deployment to mobile app release and PWA update from a single interface, like a Kanban board.
Solution A: The Federated Tooling Approach
One immediate path is to treat this as an integration problem. We could maintain our best-of-breed, domain-specific tooling—Jenkins for mobile builds, ArgoCD for Kubernetes deployments, and custom GitHub Actions for PWA assets—and build a meta-layer on top. This orchestration layer would trigger jobs via API calls and poll for their status.
- Pros: This approach minimizes disruption to existing teams. It leverages established expertise and tooling. The initial implementation cost for the orchestration layer appears lower, as it’s primarily composed of API clients and state-tracking logic.
- Cons: In practice, this architecture becomes brittle. API-based integrations are imperative, not declarative, and lack transactional guarantees. A failure in the orchestration layer can leave downstream systems in an indeterminate state. Reconciling different authentication and authorization models (Jenkins API keys, AWS IAM roles, GitHub tokens) creates a significant security and management burden. Most importantly, it’s impossible to define complex dependencies declaratively, such as “only build the mobile app if the backend API deployment to canary succeeds and the PWA Service Worker update is confirmed active.”
Solution B: The Unified Kubernetes-Native Control Plane
A more robust, albeit more intensive, approach is to build a unified control plane on a single orchestration standard: Kubernetes. By using AWS EKS as the central managed orchestrator, we can model every stage of every pipeline—even those executing on external infrastructure like Alibaba Cloud—as a Kubernetes-native resource. Tools like Argo Workflows or Tekton allow us to define complex, polyglot pipelines as declarative YAML.
- Pros: This provides a single source of truth and a consistent, declarative API for all delivery processes. We gain unified logging, monitoring, and security through the Kubernetes ecosystem. Cross-domain dependencies become first-class citizens in our workflow definitions.
- Cons: The initial investment is substantial. It requires deep expertise in Kubernetes, including the development of custom controllers or complex workflow templates. We take on the operational burden of maintaining this internal developer platform. It also introduces the risk of creating an abstraction that is too rigid or complex for teams to adopt easily.
After evaluating the trade-offs, the decision was made to pursue Solution B. The long-term benefits of a stable, observable, and declarative system outweigh the initial engineering cost. It directly addresses the root cause of fragmentation rather than patching over it with a fragile integration layer. AWS EKS was chosen for its maturity as a managed control plane, while Alibaba Cloud ECS instances are leveraged for specialized, cost-effective Android build and test environments, creating a pragmatic hybrid-cloud architecture.
Core Architecture and Implementation
The system’s core is an Argo Workflows instance running on an EKS cluster. A WorkflowTemplate
defines the Directed Acyclic Graph (DAG) for a complete release, encompassing backend, PWA, and mobile artifacts. The status of each step is propagated via a small, custom controller to an external Kanban API, providing real-time visualization for developers.
graph TD subgraph AWS EKS Control Plane A[Git Commit Trigger] --> B{Argo Workflows}; B --> C{Workflow Controller}; C --> D[Kanban API Update Pod]; D -- "POST /card/move" --> E[External Kanban Board]; end subgraph Workflow Execution B --> F[Backend Deploy Step: kubectl apply]; B --> G[PWA Deploy Step: Service Worker Update]; B --> H[Mobile Build Step: Remote Trigger]; F -- Runs on --> I[EKS Worker Nodes]; G -- Runs on --> I; H -- "Triggers job via API" --> J; end subgraph Alibaba Cloud J[ECS Instance: Android Build Farm]; end I -- "Secure Connect (Transit Gateway)" --> J; J -- "Uploads .apk" --> K[Alibaba Cloud OSS]; G -- "Uploads sw.js" --> L[AWS S3 / CloudFront];
The networking between the AWS EKS cluster and the Alibaba Cloud ECS instances is established via a site-to-site VPN managed with Terraform, ensuring secure communication for the remote job trigger and artifact retrieval.
Here is the Terraform configuration to provision the fundamental hybrid-cloud networking. In a production scenario, this would be more complex, likely involving Transit Gateways and more granular security group rules.
# main.tf - Manages hybrid network infrastructure
# --- AWS Provider Configuration ---
provider "aws" {
region = "us-east-1"
}
# --- Alibaba Cloud Provider Configuration ---
provider "alicloud" {
region = "cn-hangzhou"
access_key = var.alicloud_access_key
secret_key = var.alicloud_secret_key
}
# --- AWS Resources ---
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "eks-main-vpc"
}
}
resource "aws_customer_gateway" "ali_cgw" {
bgp_asn = 65000
ip_address = alicloud_vpn_gateway.main.internet_ip
type = "ipsec.1"
tags = {
Name = "ali-cloud-cgw"
}
}
resource "aws_vpn_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "eks-vpn-gateway"
}
}
resource "aws_vpn_connection" "to_ali" {
vpn_gateway_id = aws_vpn_gateway.main.id
customer_gateway_id = aws_customer_gateway.ali_cgw.id
type = "ipsec.1"
static_routes_only = false # Use BGP for dynamic routing
tags = {
Name = "vpn-to-alicloud"
}
}
# --- Alibaba Cloud Resources ---
resource "alicloud_vpc" "main" {
vpc_name = "ali-build-vpc"
cidr_block = "192.168.0.0/16"
}
resource "alicloud_vswitch" "main" {
vswitch_name = "build-farm-vsw"
cidr_block = "192.168.1.0/24"
vpc_id = alicloud_vpc.main.id
zone_id = "cn-hangzhou-i"
}
resource "alicloud_vpn_gateway" "main" {
name = "ali-vpn-gw"
vpc_id = alicloud_vpc.main.id
bandwidth = 10 # Mbps
enable_ipsec = true
enable_ssl = false
payment_type = "PayAsYouGo"
vswitch_id = alicloud_vswitch.main.id
}
resource "alicloud_vpn_customer_gateway" "aws_cgw" {
name = "aws-eks-cgw"
ip_address = aws_vpn_connection.to_ali.tunnel1_address
# This IP is known after the AWS VPN connection is created.
# In a real setup, you might need to use outputs or data sources.
}
resource "alicloud_vpn_connection" "to_aws" {
name = "vpn-to-aws"
vpn_gateway_id = alicloud_vpn_gateway.main.id
customer_gateway_id = alicloud_vpn_customer_gateway.aws_cgw.id
local_subnet = ["192.168.0.0/16"]
remote_subnet = ["10.0.0.0/16"]
# In a production environment, use BGP for routing.
# This example uses static routes for simplicity.
effect_immediately = true
ike_config {
psk = var.vpn_preshared_key
}
}
Mobile Build Workflow on a Remote Executor
The most complex part of this integration is executing a mobile build on an external, non-Kubernetes worker. We solve this by having a pod in the EKS workflow that acts as a client, securely calling a small agent running on the Alibaba Cloud ECS build instance.
This is a WorkflowTemplate
for Argo that defines a multi-stage process.
# argo-workflow-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: unified-release-pipeline
spec:
entrypoint: release-dag
arguments:
parameters:
- name: git-commit-sha
value: "latest"
- name: kanban-card-id
value: "PROJ-123"
templates:
- name: release-dag
dag:
tasks:
- name: backend-deploy
template: deploy-backend-service
arguments:
parameters:
- name: git-commit-sha
value: "{{workflow.parameters.git-commit-sha}}"
- name: kanban-card-id
value: "{{workflow.parameters.kanban-card-id}}"
- name: pwa-deploy
template: deploy-pwa-assets
dependencies: [backend-deploy]
arguments:
parameters:
- name: git-commit-sha
value: "{{workflow.parameters.git-commit-sha}}"
- name: kanban-card-id
value: "{{workflow.parameters.kanban-card-id}}"
- name: android-build
template: trigger-android-build
dependencies: [backend-deploy] # Depends on backend being ready
arguments:
parameters:
- name: git-commit-sha
value: "{{workflow.parameters.git-commit-sha}}"
- name: kanban-card-id
value: "{{workflow.parameters.kanban-card-id}}"
- name: deploy-backend-service
inputs:
parameters:
- name: git-commit-sha
- name: kanban-card-id
container:
image: bitnami/kubectl:latest
command: ["/bin/sh", "-c"]
args:
- |
echo "Updating Kanban card {{inputs.parameters.kanban-card-id}} to 'Deploying Backend'..."
# In a real implementation, this would be an API call
# kubectl port-forward svc/kanban-updater 8080:80 &
# curl -X POST http://localhost:8080/update -d '{"card": "{{inputs.parameters.kanban-card-id}}", "status": "Deploying Backend"}'
echo "Deploying backend service with image tag {{inputs.parameters.git-commit-sha}}"
kubectl apply -k path/to/kustomize/overlay/production
# Production-grade check would involve waiting for rollout status
kubectl rollout status deployment/my-backend-service --timeout=5m
- name: trigger-android-build
inputs:
parameters:
- name: git-commit-sha
- name: kanban-card-id
# This pod acts as a client to the remote build agent
container:
image: curlimages/curl:7.72.0
command: ["/bin/sh", "-c"]
env:
- name: BUILD_AGENT_SECRET
valueFrom:
secretKeyRef:
name: alicloud-build-agent-secret
key: token
args:
- |
set -euo pipefail
echo "Updating Kanban card {{inputs.parameters.kanban-card-id}} to 'Building Android App'..."
# ... API call to Kanban service ...
# Securely trigger the remote build job
# The IP address is the private IP of the ECS instance over the VPN
BUILD_AGENT_URL="http://192.168.1.10:8080/build"
echo "Triggering build for commit {{inputs.parameters.git-commit-sha}}"
# The agent responds with a job ID for polling
JOB_ID=$(curl -s -f -X POST "${BUILD_AGENT_URL}" \
-H "Authorization: Bearer ${BUILD_AGENT_SECRET}" \
-d '{"commit_sha": "{{inputs.parameters.git-commit-sha}}"}')
if [ -z "$JOB_ID" ]; then
echo "Failed to trigger build job."
exit 1
fi
echo "Build job started with ID: ${JOB_ID}. Polling for completion..."
# Poll for status. In a real system, use websockets or a more robust mechanism.
STATUS="PENDING"
while [ "$STATUS" = "PENDING" ] || [ "$STATUS" = "RUNNING" ]; do
sleep 30
STATUS=$(curl -s -f -H "Authorization: Bearer ${BUILD_AGENT_SECRET}" "${BUILD_AGENT_URL}/${JOB_ID}/status")
echo "Current build status: ${STATUS}"
done
if [ "$STATUS" != "SUCCESS" ]; then
echo "Android build failed."
# ... update Kanban card to 'Build Failed' ...
exit 1
fi
echo "Android build successful. Artifacts are available in Alibaba Cloud OSS."
# ... update Kanban card to 'Build Succeeded' ...
Service Worker Lifecycle Management
Deploying a Service Worker is more than just uploading a file; it requires careful versioning and cache invalidation to prevent users from being stuck with stale assets. A naive approach of overwriting sw.js
is dangerous. Our pipeline step automates a robust process.
This script would be executed inside the deploy-pwa-assets
container step.
#!/bin/bash
# pwa-deploy-script.sh
set -euo pipefail
GIT_COMMIT_SHA=$1
ASSET_BUCKET="my-pwa-assets-s3-bucket"
CDN_DISTRIBUTION_ID="E12345ABCDEF"
ALI_ASSET_BUCKET="my-pwa-assets-oss-bucket-cn"
ALI_CDN_DOMAIN="cdn.my-app.com.cn"
echo "--- Building PWA assets ---"
# Assumes a build tool like Webpack or Rollup generates assets
npm install && npm run build
# --- Service Worker Versioning ---
# Calculate a hash of the service worker's content
SW_FILE="dist/sw.js"
SW_HASH=$(sha256sum "$SW_FILE" | awk '{print $1}')
SW_VERSIONED_FILE="dist/sw.${SW_HASH}.js"
mv "$SW_FILE" "$SW_VERSIONED_FILE"
echo "Versioned service worker: ${SW_VERSIONED_FILE}"
# --- Update Asset Manifest ---
# The web app needs to know the new service worker filename.
# We update a manifest file that the app fetches.
echo "{\"serviceWorker\": \"/sw.${SW_HASH}.js\"}" > dist/asset-manifest.json
echo "Generated asset manifest."
# --- Upload to Primary CDN (AWS S3 + CloudFront) ---
echo "Uploading assets to AWS S3 bucket: ${ASSET_BUCKET}"
# Use --delete to remove old files, but be careful with this in production
aws s3 sync dist/ s3://${ASSET_BUCKET}/ --acl public-read
# Invalidate the CloudFront cache for the root files to force clients to get the new manifest
echo "Invalidating CloudFront cache for critical paths..."
aws cloudfront create-invalidation --distribution-id ${CDN_DISTRIBUTION_ID} --paths "/index.html" "/asset-manifest.json"
# --- Upload to Secondary CDN (Alibaba Cloud OSS) for China region ---
# Assumes 'ossutil64' is configured in the container
echo "Uploading assets to Alibaba Cloud OSS bucket: ${ALI_ASSET_BUCKET}"
ossutil64 cp -r dist/ oss://${ALI_ASSET_BUCKET}/ -u
echo "Purging Alibaba Cloud CDN cache..."
# This requires using the aliyun CLI or SDK
aliyun cdn RefreshObjectCaches --ObjectPath "https://${ALI_CDN_DOMAIN}/index.html" --ObjectType File
aliyun cdn RefreshObjectCaches --ObjectPath "https://${ALI_CDN_DOMAIN}/asset-manifest.json" --ObjectType File
echo "--- PWA Deployment Complete ---"
Limitations and Future Iterations
This unified control plane architecture is powerful but introduces its own set of challenges. The primary limitation is the operational overhead of maintaining the platform itself. The custom controller for Kanban updates, the remote build agent on ECS, and the complex Argo WorkflowTemplates
all represent internal software that requires lifecycle management. The hybrid-cloud networking, while functional, adds latency and a potential point of failure; a VPN disruption could halt all mobile build pipelines.
Furthermore, the current Kanban integration is a simple fire-and-forget API call. It lacks transactional integrity with the pipeline steps. A future iteration should employ a more robust event-driven architecture, where the workflow engine emits events to a message queue (like AWS SQS or Apache Kafka), and a separate consumer service is responsible for idempotently updating the Kanban board and other external systems. This decouples the core pipeline logic from its presentation-layer integrations. Finally, the rollback strategy for failed Service Worker deployments needs to be formalized. This could involve maintaining a manifest of the last known good version and implementing a pipeline step that can quickly revert the asset-manifest.json
and trigger the necessary cache invalidations.