The transition from a simple Docker-in-Docker CI setup to a production-grade, secure software factory is fraught with non-obvious challenges. Our initial pipeline, running on Alibaba Cloud’s ACK (Container Service for Kubernetes), was a constant source of friction. Builds were monolithic and slow, especially within our growing TypeScript monorepo. A change in one small service triggered a full, uncached rebuild of the entire dependency tree, pushing pipeline times past 15 minutes. More critically, our security posture was weak; we had no mechanism for generating Software Bill of Materials (SBOMs) or cryptographically signing our container images, leaving our software supply chain vulnerable. The mandate was clear: create a new build process that is fast, daemonless for better security and resource isolation on our CI nodes, and produces verifiable, signed artifacts.
Our initial solution was to simply replace the docker build
command with buildah bud
. This offered a daemonless execution path but failed to address the core performance problem. Buildah’s caching, like Docker’s, is effective for Containerfile
changes but doesn’t intelligently handle changes in the source code before they are copied into the image. For a monorepo using pnpm
, where a single lockfile governs a massive node_modules
directory, this meant any dependency change invalidated the entire layer, forcing a lengthy reinstall.
This led to the realization that a declarative Containerfile
was insufficient. We needed programmatic control over the build process to create more granular caching layers and to inject security steps like testing and signing. This is where Buildah’s command-line tools (buildah from
, buildah mount
, buildah copy
, buildah commit
) shine over its bud
command. We decided to orchestrate the entire build within a shell script, giving us the fine-grained control necessary to solve our caching woes and integrate our security tooling.
The final architecture pivots around a central build.sh
script, executed within a custom CI runner image. This script coordinates Vitest
for quality gating, leverages Buildah’s APIs for multi-stage, cache-optimized image construction, and integrates syft
and cosign
for generating and attaching supply chain attestations. The entire process leverages Alibaba Cloud services: ACK for the runners, ACR Enterprise Edition for storing OCI artifacts (images, signatures, SBOMs), and KMS for managing the signing keys.
graph TD subgraph Alibaba Cloud CI/CD Pipeline A[Git Push to Monorepo] --> B{Trigger Pipeline on ACK Runner}; B --> C[Execute build.sh]; subgraph "build.sh Execution Flow" C --> D{1. Authenticate to ACR EE}; D --> E{2. Run Quality Gate: `pnpm vitest run`}; E -- Fails --> F[Abort Build]; E -- Succeeds --> G{3. Build Dependencies Layer}; G --> H{4. Build Application Layer}; H --> I{5. Assemble Final Lean Image}; I --> J{6. Generate SBOM with Syft}; J --> K{7. Sign Image with Cosign + KMS}; end K --> L[8. Push Image, SBOM, Signature to ACR EE]; L --> M{9. Verify Signature from ACR EE}; end
The first step was designing the CI runner environment itself. This container needs all the necessary tools pre-installed to avoid downloading them on every run.
# Containerfile for the CI runner image
# This image contains all tools needed to execute our build script.
ARG ALIYUN_MIRROR=registry.aliyuncs.com
FROM ${ALIYUN_MIRROR}/ubuntu:22.04
# Basic setup and adding Buildah repository
RUN apt-get update && \
apt-get install -y --no-install-recommends \
gnupg \
software-properties-common \
curl \
git \
wget && \
add-apt-repository -y ppa:projectatomic/ppa && \
apt-get update && \
apt-get install -y --no-install-recommends buildah skopeo && \
rm -rf /var/lib/apt/lists/*
# Install Node.js, pnpm (our package manager)
RUN curl -fsSL https://deb.nodesource.com/setup_18.x | bash - && \
apt-get install -y nodejs && \
npm install -g pnpm
# Install Cosign for image signing
RUN wget "https://github.com/sigstore/cosign/releases/download/v2.2.1/cosign-linux-amd64" -O /usr/local/bin/cosign && \
chmod +x /usr/local/bin/cosign
# Install Syft for SBOM generation
RUN curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
# Set up storage for Buildah
RUN sed -i 's/#mount_program/mount_program/' /etc/containers/storage.conf && \
sed -i 's|#mountopt = "nodev,fsync=0"|mountopt = "nodev,fsync=0"|' /etc/containers/storage.conf
ENV LANG C.UTF-8
WORKDIR /workspace
With the runner defined, the core logic resides in our build.sh
script. This script is designed to be idempotent and robust, with strict error handling.
#!/usr/bin/env bash
set -eo pipefail
# ==============================================================================
# Secure, Cache-Optimized Container Build Script
#
# This script orchestrates a multi-stage, daemonless container build process.
# Key features:
# 1. Quality Gate: Runs Vitest unit tests before proceeding.
# 2. Granular Caching: Creates separate layers for dependencies and source code
# to optimize rebuilds in a monorepo context.
# 3. Security: Generates an SBOM with Syft and signs the final image using
# Cosign with a key from Alibaba Cloud KMS.
# 4. Cloud Integration: Authenticates with Alibaba Cloud ACR using RAM roles.
# ==============================================================================
# --- Configuration ---
# These would typically be passed as environment variables in a CI system.
readonly REGISTRY_HOST="registry-cn-hangzhou.ack.aliyuncs.com"
readonly REGISTRY_NAMESPACE="my-corp-namespace"
readonly APP_NAME="my-app"
readonly GIT_COMMIT_SHA="${CI_COMMIT_SHA:-$(git rev-parse --short HEAD)}"
readonly BASE_IMAGE="docker.io/library/node:18-alpine"
readonly FINAL_IMAGE_BASE="cgr.dev/chainguard/static:latest"
readonly TARGET_IMAGE_TAG="${REGISTRY_HOST}/${REGISTRY_NAMESPACE}/${APP_NAME}:${GIT_COMMIT_SHA}"
readonly LATEST_IMAGE_TAG="${REGISTRY_HOST}/${REGISTRY_NAMESPACE}/${APP_NAME}:latest"
readonly SBOM_FILE_PATH="/tmp/${APP_NAME}-sbom.spdx.json"
# --- Helper Functions ---
log() {
echo "[BUILD_SCRIPT] [$(date +'%Y-%m-%dT%H:%M:%S%z')] - $1"
}
cleanup() {
log "Performing cleanup..."
# buildah rm --all might fail if there are no containers, so we ignore errors.
buildah rm --all > /dev/null 2>&1 || true
rm -f "${SBOM_FILE_PATH}"
log "Cleanup complete."
}
# Ensure cleanup runs on script exit or interruption
trap cleanup EXIT INT TERM
# --- Main Execution ---
# 1. Authentication
# In a production CI/CD on Alibaba Cloud, this would use OIDC or a RAM role
# associated with the ACK node or pod. For local execution, `docker login` helps.
log "Step 1: Authenticating with container registry..."
buildah login -u "${ACR_USER}" -p "${ACR_PASSWORD}" "${REGISTRY_HOST}"
log "Authentication successful."
# 2. Quality Gate: Unit Tests
log "Step 2: Running quality gate (Vitest)..."
# The --passWithNoTests flag is crucial for services that might not have tests yet.
if ! pnpm vitest run --coverage=false --passWithNoTests; then
log "ERROR: Vitest tests failed. Aborting build."
exit 1
fi
log "Vitest tests passed."
# 3. Building the Dependency Layer (The core caching strategy)
log "Step 3: Building dependency layer..."
# We create a hash of the lockfile. This hash becomes the tag for our dependency image.
# If the lockfile hasn't changed, we can pull the existing layer instead of rebuilding.
PNPM_LOCK_HASH=$(sha256sum pnpm-lock.yaml | awk '{print $1}')
DEP_IMAGE_TAG="${REGISTRY_HOST}/${REGISTRY_NAMESPACE}/${APP_NAME}-deps:${PNPM_LOCK_HASH}"
# Check if the dependency image already exists in the registry
if buildah manifest inspect "${DEP_IMAGE_TAG}" >/dev/null 2>&1; then
log "Found existing dependency layer in registry: ${DEP_IMAGE_TAG}"
# We use this existing image as the base for the next stage.
BUILDER_BASE_IMAGE="${DEP_IMAGE_TAG}"
else
log "No matching dependency layer found. Building a new one..."
# Start from the base image
builder=$(buildah from "${BASE_IMAGE}")
# Mount the container's filesystem to the host
mount_point=$(buildah mount "${builder}")
# Copy only the necessary package manifests
log "Copying package manifests..."
cp package.json pnpm-lock.yaml pnpm-workspace.yaml "${mount_point}/app/"
# Run pnpm install inside the mounted directory.
# The --frozen-lockfile is a best practice for CI.
log "Installing dependencies with pnpm..."
buildah run "${builder}" -- sh -c "cd /app && pnpm install --frozen-lockfile"
# Unmount before committing
buildah unmount "${builder}"
# Commit this state as our new dependency image
log "Committing and pushing dependency image: ${DEP_IMAGE_TAG}"
buildah commit --squash "${builder}" "${DEP_IMAGE_TAG}"
buildah push "${DEP_IMAGE_TAG}"
buildah rm "${builder}"
BUILDER_BASE_IMAGE="${DEP_IMAGE_TAG}"
fi
# 4. Building the Application Layer
log "Step 4: Building application layer..."
builder_app=$(buildah from "${BUILDER_BASE_IMAGE}")
log "Copying source code..."
# Use .dockerignore to exclude unnecessary files
buildah copy --ignorefile .dockerignore "${builder_app}" . /app/
log "Building TypeScript application..."
# This command builds the application, creating artifacts in the 'dist' directory.
buildah run "${builder_app}" -- sh -c "cd /app && pnpm build"
# 5. Assembling the Final Lean Image
log "Step 5: Assembling final production image..."
final_image_container=$(buildah from "${FINAL_IMAGE_BASE}")
log "Copying built artifacts from builder stage..."
# This mimics a multi-stage build by copying from one container to another.
buildah copy --from "${builder_app}" "${final_image_container}" /app/dist /app/
buildah copy --from "${builder_app}" "${final_image_container}" /app/node_modules /app/node_modules
# Configure runtime metadata for the final image
log "Configuring image metadata..."
buildah config \
--author="My Corp <[email protected]>" \
--label "org.opencontainers.image.source=https://github.com/my-corp/my-repo" \
--label "org.opencontainers.image.revision=${GIT_COMMIT_SHA}" \
--port 8080 \
--entrypoint '["node", "/app/index.js"]' \
"${final_image_container}"
log "Committing final application image: ${TARGET_IMAGE_TAG}"
# We use the digest of the commit to ensure we sign the exact image we built
image_digest=$(buildah commit --squash "${final_image_container}" "${TARGET_IMAGE_TAG}")
# 6. Generate and Attach SBOM
log "Step 6: Generating SBOM using Syft..."
# Run syft against the image we just built
syft "oci-archive:$(buildah push --digestfile /dev/stdout ${TARGET_IMAGE_TAG} oci-archive:/dev/stdout)" \
-o spdx-json > "${SBOM_FILE_PATH}"
log "SBOM generated at ${SBOM_FILE_PATH}"
log "Attaching SBOM to image in ACR..."
cosign attach sbom --sbom "${SBOM_FILE_PATH}" "${TARGET_IMAGE_TAG}"
log "SBOM attached successfully."
# 7. Sign the Image with KMS
log "Step 7: Signing the image with Cosign and Alibaba Cloud KMS..."
# COSIGN_EXPERIMENTAL=1 is required for KMS support.
# The KMS key URI format is specific to the cloud provider.
# Example: alikms://[key-id]?region=[region-id]
export COSIGN_EXPERIMENTAL=1
cosign sign --key "${KMS_KEY_URI}" "${TARGET_IMAGE_TAG}"
log "Image signed successfully."
log "Pushing final image and signatures to registry..."
buildah push --all "${TARGET_IMAGE_TAG}"
# Also tag and push as 'latest' for convenience
buildah tag "${TARGET_IMAGE_TAG}" "${LATEST_IMAGE_TAG}"
buildah push "${LATEST_IMAGE_TAG}"
# 8. Verification (Crucial final step)
log "Step 8: Verifying image signature..."
# The verification step proves the entire chain of trust works.
# It checks the signature against the public key stored in KMS.
if cosign verify --key "${KMS_KEY_URI}" "${TARGET_IMAGE_TAG}"; then
log "Verification successful! Image digest: ${image_digest}"
else
log "ERROR: Image signature verification failed!"
exit 1
fi
log "Build process completed successfully."
A common pitfall in this approach is mishandling build context. The buildah copy
command, when copying from the host, is sensitive to the current working directory and respects a .dockerignore
file. Forgetting this file can lead to bloating the image with development artifacts, build caches, and .git
history, negating the benefits of a lean final image.
The most significant architectural decision was using a hash of pnpm-lock.yaml
to tag our dependency layer. This is the linchpin of our caching strategy. In a monorepo, source code for different services changes constantly, but the core dependencies in the lockfile change much less frequently. By externalizing the dependency layer into its own image, we ensure that as long as pnpm-lock.yaml
is unchanged, the CI pipeline can pull a pre-built, multi-gigabyte node_modules
layer in seconds instead of rebuilding it, which can take several minutes. This single change reduced our average build time for code-only changes from 8 minutes to under 2 minutes.
The integration with Alibaba Cloud KMS via cosign
was another critical piece. In a real-world project, managing private keys for signing is a significant security burden. Using a KMS avoids ever having to expose the private key material to the CI environment. The runner only needs a RAM role with permissions to perform signing operations with a specific key. This provides a hardware-backed, auditable, and easily rotatable signing mechanism. The command cosign sign --key "alikms://<key-id>?region=<region-id>" <image>
is deceptively simple for the security it provides.
This scripted Buildah approach, while more complex than a single Containerfile
, provides the extensibility that modern DevSecOps practices demand. We’ve created distinct, controllable stages for testing, dependency management, code compilation, and security attestation. A Containerfile
can’t easily express “fail the build if tests fail” or “create a cache layer based on a file’s hash.” This programmatic control is the key to unlocking both performance and security in a complex build environment.
The current implementation, however, is not without its limitations. The build cache created by Buildah is local to the filesystem of the CI runner. While our dependency layer caching works around this for the most expensive step, intermediate build layers are still ephemeral. In our ACK cluster, where CI jobs can land on different nodes, this local cache is lost between runs. A potential future iteration is to configure Buildah to use a shared cache backend. This could be an NFS volume mounted to all runners or a dedicated registry used for cache blobs (--cache-to
and --cache-from
flags). This would further reduce cold-start build times.
Furthermore, the build script itself has become a critical piece of infrastructure. Its complexity requires its own testing and maintenance discipline. A mistake in the script could have wide-reaching consequences. This represents a trade-off: we’ve exchanged the simplicity of a declarative Containerfile
for the power and performance of an imperative build script. For our use case, where build times and supply chain security are paramount, this trade-off is not just acceptable but necessary.