A shared staging environment is a bottleneck. It’s a single point of failure where feature branches collide, data gets corrupted, and QA cycles are blocked. The cost of this friction is immense, manifesting as delayed releases and developer frustration. The objective became clear: create a fully automated system that provisions an isolated, production-like environment for every single pull request. A git push
on a feature branch must result in a unique, publicly accessible HTTPS URL, with zero manual intervention.
This required orchestrating several components. The core of the problem isn’t just building and deploying an application; it’s dynamically managing the routing, SSL termination, and state for an arbitrary number of these ephemeral environments. After evaluating options, the architecture coalesced around a specific set of tools on Google Cloud Platform, chosen for their unique capabilities in this context.
- Google Cloud Run: Its serverless, scale-to-zero nature is ideal for ephemeral services. We only pay when an environment is actively being tested, and it handles all the underlying infrastructure.
- Google Cloud Build: Native integration with our Git repository provides the trigger mechanism. Its multi-step execution environment is perfect for our build, push, and deploy pipeline.
- esbuild: Build speed is paramount for developer experience. Waiting several minutes for a preview link negates the benefit. esbuild’s sub-second build times for a moderately complex frontend were a deciding factor over other bundlers.
- Cloud SQL for MySQL: We need a persistent, reliable source of truth to track the state of every environment: which commit it’s tied to, its generated Cloud Run URL, its status, and its custom domain. A managed relational database is the correct tool for this job.
- Caddy Server: This is the lynchpin. Caddy’s On-Demand TLS feature is the solution to the biggest challenge: provisioning SSL certificates for potentially thousands of dynamic subdomains (e.g.,
pr-1234.my-app.com
,feat-new-auth.my-app.com
). Manually managing these would be impossible.
The system is composed of three primary services running on GCP:
- The Application: The user-facing service we want to create previews for.
- The Orchestrator: A small Go service that acts as the brain, processing build webhooks and managing GCP resources.
- The Caddy Gateway: A single, public-facing Cloud Run service that acts as a reverse proxy for all preview environments.
The entire process is kicked off by a git push
to a non-main branch, which triggers a Cloud Build pipeline.
sequenceDiagram participant Dev participant Git Repo participant Cloud Build participant esbuild participant Artifact Registry participant Cloud Run participant Orchestrator API participant Cloud SQL (MySQL) participant Caddy Gateway Dev->>Git Repo: git push feature-branch Git Repo->>Cloud Build: Trigger webhook Cloud Build->>esbuild: 1. Build frontend assets esbuild-->>Cloud Build: Return bundled assets Cloud Build->>Cloud Build: 2. Build Docker image Cloud Build->>Artifact Registry: 3. Push image Artifact Registry-->>Cloud Build: Confirm push Cloud Build->>Cloud Run: 4. Deploy image as new service (e.g., app-pr-1234) Cloud Run-->>Cloud Build: Return service URL (app-pr-1234.a.run.app) Cloud Build->>Orchestrator API: 5. POST /notify-deploy with Git info & service URL Orchestrator API->>Cloud SQL (MySQL): INSERT/UPDATE environment record Orchestrator API->>Caddy Gateway: 6. POST Caddy Admin API to add route Caddy Gateway-->>Orchestrator API: Confirm route added Orchestrator API-->>Cloud Build: Return 200 OK Cloud Build->>Git Repo: (Optional) Post comment on PR with URL
Database Schema: The Source of Truth
Before writing any orchestration code, defining the state we need to track is critical. A single MySQL table in Cloud SQL will suffice. The pitfall here is under-specifying this schema, leading to complex logic later. We need to store not just the URL, but enough metadata to manage the environment’s entire lifecycle.
-- Filename: schema.sql
-- Description: SQL schema for the preview environments state management table.
CREATE TABLE `preview_environments` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`branch_name` VARCHAR(255) NOT NULL,
`git_commit_sha` VARCHAR(40) NOT NULL,
`cloud_run_service_name` VARCHAR(255) NOT NULL,
`cloud_run_url` VARCHAR(512) NOT NULL,
`custom_domain` VARCHAR(255) NOT NULL,
`status` ENUM('PENDING', 'DEPLOYING', 'ACTIVE', 'DELETING', 'INACTIVE', 'ERROR') NOT NULL DEFAULT 'PENDING',
`created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `uk_branch_name` (`branch_name`),
UNIQUE KEY `uk_custom_domain` (`custom_domain`),
KEY `idx_status` (`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-
branch_name
: The unique identifier for an environment. We enforce a one-to-one mapping. -
cloud_run_service_name
: Essential for later operations, like deleting the service. -
custom_domain
: The public-facing URL we will configure in Caddy, e.g.,pr-123.my-app.com
. -
status
: A state machine to track the lifecycle. This is crucial for cleanup jobs and UI representation. A simpleboolean
is_active
flag is insufficient for a real-world project.
The Build Pipeline: cloudbuild.yaml
This file defines the CI/CD process. It’s executed by Cloud Build and orchestrates the entire build and deployment. A common mistake is to put too much logic into shell scripts called by the pipeline. Instead, we keep the steps declarative and delegate complex logic to dedicated services, like our Orchestrator.
# Filename: cloudbuild.yaml
# Description: Cloud Build pipeline for building and deploying a preview environment.
steps:
# Step 1: Install dependencies and build the frontend assets with esbuild
- name: 'node:18'
entrypoint: 'npm'
args: ['install']
id: 'Install Dependencies'
- name: 'node:18'
entrypoint: 'npm'
args: ['run', 'build'] # Assumes "build": "esbuild src/index.js --bundle --outfile=dist/bundle.js" in package.json
id: 'Build with esbuild'
# Step 2: Build the application Docker image
# This Dockerfile should copy the 'dist' directory from the previous step
- name: 'gcr.io/cloud-builders/docker'
args:
- 'build'
- '-t'
- 'us-central1-docker.pkg.dev/$PROJECT_ID/my-app-repo/app-preview:${SHORT_SHA}'
- '.'
id: 'Build Docker Image'
# Step 3: Push the image to Google Artifact Registry
- name: 'gcr.io/cloud-builders/docker'
args:
- 'push'
- 'us-central1-docker.pkg.dev/$PROJECT_ID/my-app-repo/app-preview:${SHORT_SHA}'
id: 'Push to Registry'
# Step 4: Deploy to Cloud Run
# The service name is dynamically generated to be unique per branch.
# We use gcloud run deploy which creates or updates a service.
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: 'gcloud'
args:
- 'run'
- 'deploy'
- 'preview-${_BRANCH_SLUG}' # e.g., preview-feat-new-auth
- '--image=us-central1-docker.pkg.dev/$PROJECT_ID/my-app-repo/app-preview:${SHORT_SHA}'
- '--region=us-central1'
- '--platform=managed'
- '--no-allow-unauthenticated' # Secure by default, Caddy will be authenticated.
- '--format=value(status.url)'
id: 'Deploy to Cloud Run'
# Step 5: Notify our Orchestrator service about the successful deployment
# The Orchestrator URL is passed via a substitution variable.
- name: 'gcr.io/cloud-builders/curl'
args:
- '-X'
- 'POST'
- '${_ORCHESTRATOR_URL}/api/v1/environments/notify'
- '-H'
- 'Content-Type: application/json'
- '-H'
- 'Authorization: Bearer $(gcloud auth print-identity-token)' # Secure invocation
- '--data-raw'
- |
{
"branchName": "${BRANCH_NAME}",
"commitSha": "${SHORT_SHA}",
"cloudRunUrl": "$(steps.Deploy-to-Cloud-Run.outputs.result)"
}
id: 'Notify Orchestrator'
# Available substitutions from the Cloud Build trigger:
# $PROJECT_ID, $SHORT_SHA, $BRANCH_NAME
# Custom substitutions:
# _ORCHESTRATOR_URL: URL of our orchestrator service.
# _BRANCH_SLUG: A sanitized version of the branch name suitable for a service name.
# This sanitization step is crucial and should be done in the trigger configuration or an initial script step.
A key detail here is securing the communication between Cloud Build and the Orchestrator. We configure the Orchestrator’s Cloud Run service to only accept authenticated requests and use gcloud auth print-identity-token
to have Cloud Build generate a token for the request. Never use hardcoded API keys in these pipelines.
The Orchestrator: A Go Microservice
This service is the central controller. It’s a simple stateless API that translates events from the build system into actions against our database and the Caddy Gateway. Using a compiled language like Go gives us a small, fast-starting container, perfect for Cloud Run.
// Filename: main.go (Orchestrator Service)
package main
import (
"bytes"
"context"
"database/sql"
"encoding/json"
"fmt"
"log"
"net/http"
"os"
"regexp"
"strings"
_ "github.com/go-sql-driver/mysql"
)
var (
db *sql.DB
caddyAdminAPI string // e.g., http://caddy-gateway:2019
appDomain string // e.g., my-app.com
nonAlphanumericRegex = regexp.MustCompile(`[^a-zA-Z0-9]+`)
)
type NotifyPayload struct {
BranchName string `json:"branchName"`
CommitSha string `json:"commitSha"`
CloudRunUrl string `json:"cloudRunUrl"`
}
func main() {
// In a real project, use a proper config library.
dbUser := os.Getenv("DB_USER")
dbPass := os.Getenv("DB_PASS")
dbHost := os.Getenv("DB_HOST") // Cloud SQL proxy handles this.
dbName := os.Getenv("DB_NAME")
caddyAdminAPI = os.Getenv("CADDY_ADMIN_API")
appDomain = os.Getenv("APP_DOMAIN")
var err error
dsn := fmt.Sprintf("%s:%s@tcp(%s)/%s?parseTime=true", dbUser, dbPass, dbHost, dbName)
db, err = sql.Open("mysql", dsn)
if err != nil {
log.Fatalf("could not connect to database: %v", err)
}
defer db.Close()
http.HandleFunc("/api/v1/environments/notify", notifyHandler)
port := os.Getenv("PORT")
if port == "" {
port = "8080"
}
log.Printf("starting orchestrator on port %s", port)
if err := http.ListenAndServe(":"+port, nil); err != nil {
log.Fatalf("server failed: %v", err)
}
}
func notifyHandler(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
return
}
var payload NotifyPayload
if err := json.NewDecoder(r.Body).Decode(&payload); err != nil {
http.Error(w, "bad request", http.StatusBadRequest)
return
}
// Basic validation
if payload.BranchName == "" || payload.CommitSha == "" || payload.CloudRunUrl == "" {
http.Error(w, "missing fields in payload", http.StatusBadRequest)
return
}
// Sanitize branch name for subdomain and service name use.
slug := sanitizeBranchName(payload.BranchName)
customDomain := fmt.Sprintf("%s.%s", slug, appDomain)
cloudRunServiceName := fmt.Sprintf("preview-%s", slug)
ctx := r.Context()
// Use a transaction to ensure atomicity.
tx, err := db.BeginTx(ctx, nil)
if err != nil {
log.Printf("ERROR starting transaction: %v", err)
http.Error(w, "internal server error", http.StatusInternalServerError)
return
}
defer tx.Rollback() // Rollback on any error.
// Upsert logic: new branch creates, existing branch updates.
stmt := `
INSERT INTO preview_environments (branch_name, git_commit_sha, cloud_run_service_name, cloud_run_url, custom_domain, status)
VALUES (?, ?, ?, ?, ?, 'DEPLOYING')
ON DUPLICATE KEY UPDATE
git_commit_sha = VALUES(git_commit_sha),
cloud_run_url = VALUES(cloud_run_url),
status = VALUES(status)
`
_, err = tx.ExecContext(ctx, stmt, payload.BranchName, payload.CommitSha, cloudRunServiceName, payload.CloudRunUrl, customDomain)
if err != nil {
log.Printf("ERROR upserting environment: %v", err)
http.Error(w, "internal server error", http.StatusInternalServerError)
return
}
// After successfully updating DB, configure Caddy.
if err := configureCaddyRoute(ctx, customDomain, payload.CloudRunUrl); err != nil {
log.Printf("ERROR configuring caddy route: %v", err)
// The transaction will be rolled back, maintaining consistency.
http.Error(w, "failed to configure gateway", http.StatusInternalServerError)
return
}
// Mark as active only after Caddy is configured.
_, err = tx.ExecContext(ctx, "UPDATE preview_environments SET status = 'ACTIVE' WHERE branch_name = ?", payload.BranchName)
if err != nil {
log.Printf("ERROR updating status to active: %v", err)
http.Error(w, "internal server error", http.StatusInternalServerError)
return
}
if err := tx.Commit(); err != nil {
log.Printf("ERROR committing transaction: %v", err)
http.Error(w, "internal server error", http.StatusInternalServerError)
return
}
log.Printf("successfully processed deployment for branch: %s", payload.BranchName)
w.WriteHeader(http.StatusOK)
fmt.Fprintf(w, "ok")
}
func sanitizeBranchName(branch string) string {
// Lowercase, replace non-alphanumeric with hyphen, trim hyphens.
s := strings.ToLower(branch)
s = nonAlphanumericRegex.ReplaceAllString(s, "-")
s = strings.Trim(s, "-")
if len(s) > 40 { // Cloud Run service name length limit
s = s[:40]
}
return s
}
// configureCaddyRoute adds or updates a route in the Caddy Gateway via its Admin API.
func configureCaddyRoute(ctx context.Context, domain, upstreamUrl string) error {
// The upstream URL from Cloud Run is https://... We need to extract the host.
upstreamHost := strings.TrimPrefix(upstreamUrl, "https://")
// Caddy's route object structure.
// The `@id` is critical for idempotent updates.
route := map[string]interface{}{
"@id": domain, // Unique identifier for this route.
"match": []map[string]interface{}{
{
"host": []string{domain},
},
},
"handle": []map[string]interface{}{
{
"handler": "reverse_proxy",
"upstreams": []map[string]interface{}{
{
"dial": upstreamHost + ":443",
},
},
"transport": map[string]interface{}{
"protocol": "http",
"tls": map[string]interface{}{}, // Enable TLS for the backend connection.
},
},
},
}
jsonData, err := json.Marshal(route)
if err != nil {
return fmt.Errorf("failed to marshal Caddy route: %w", err)
}
// The path targets a specific route within the server's config.
// This is an idempotent `POST` that creates or replaces.
url := fmt.Sprintf("%s/config/apps/http/servers/previews/routes/0", caddyAdminAPI)
req, err := http.NewRequestWithContext(ctx, "POST", url, bytes.NewBuffer(jsonData))
if err != nil {
return fmt.Errorf("failed to create Caddy request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
// In a real-world setup, the Admin API should be secured.
// Here we assume network-level security (e.g., internal VPC).
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
return fmt.Errorf("failed to send request to Caddy admin: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
// Try to read body for better error message.
bodyBytes, _ := io.ReadAll(resp.Body)
return fmt.Errorf("caddy admin API returned non-200 status: %d. body: %s", resp.StatusCode, string(bodyBytes))
}
return nil
}
The Caddy Gateway: Dynamic TLS and Routing
This is the most critical and novel part of the architecture. A single Caddy instance, running as a Cloud Run service, will serve as the entry point for all preview domains. Its configuration is surprisingly minimal because the heavy lifting is done by its On-Demand TLS feature and its Admin API.
Here is the initial Caddyfile
used to bootstrap the gateway service.
# Filename: Caddyfile
# Description: Initial configuration for the Caddy Gateway service.
{
# This enables the Admin API, which is essential for our dynamic configuration.
# It's crucial to secure this. In Cloud Run, we can make it listen only on a
# specific port and not expose that port publicly.
admin :2019
# Enable On-Demand TLS. This tells Caddy to get certificates for any domain
# it receives a request for, provided it's allowed by the `ask` endpoint.
on_demand_tls {
# The 'ask' endpoint is a security measure to prevent abuse. Caddy will make a GET
# request to this URL before issuing a certificate. Our orchestrator must expose
# an endpoint that verifies the domain is in our database.
ask http://orchestrator-service-url/api/v1/tls/check
}
}
# This is a named server block. We will inject routes into it via the API.
# The name 'previews' is what we target in the API call path: /config/apps/http/servers/previews/routes
https:// {
# Binds to all hostnames on port 443.
tls {
on_demand
}
# Initial state: no routes.
# We could add a default route here that returns a 404 or a landing page.
}
The magic is in the { on_demand_tls }
block. When a request for pr-123.my-app.com
first arrives, Caddy will:
- Pause the request.
- Make a GET request to our Orchestrator’s
/api/v1/tls/check?domain=pr-123.my-app.com
. - The Orchestrator checks the
preview_environments
table. If a record with thatcustom_domain
exists, it returns a200 OK
. Otherwise, a403 Forbidden
. - If it receives a
200 OK
, Caddy proceeds to get a certificate from Let’s Encrypt for that domain. - It then resumes the original request. The browser sees a valid HTTPS connection.
The reverse proxy logic itself is not in the file; it’s injected by the configureCaddyRoute
function in our orchestrator. That function dynamically builds a JSON object representing a Caddy route and POSTs it to Caddy’s admin API endpoint. This adds a new route to the running configuration without requiring a restart. The route matches on the hostname (e.g., pr-123.my-app.com
) and proxies the request to the correct internal Cloud Run URL (e.g., app-pr-1234.a.run.app
).
The result is a highly dynamic system. The static Caddyfile
just enables the core features. The actual routing table lives in Caddy’s memory and is managed entirely by our Orchestrator service based on the state in the MySQL database.
Limitations and Future Iterations
This architecture, while powerful, is not without its operational complexities. In a production environment, several areas would need further hardening.
First, the lifecycle management is incomplete. We have a creation path, but no automated cleanup. A cron-triggered job is necessary to query the Git provider’s API for merged or closed pull requests and then call the Orchestrator to tear down the associated Cloud Run service, delete the Caddy route, and mark the database record as INACTIVE
. Without this, costs would quickly spiral out of control.
Second, Caddy’s configuration is stored in memory. If the Caddy Cloud Run instance restarts or scales, it loses all dynamically added routes. A robust solution would involve the Caddy instance, upon starting, calling back to the Orchestrator to fetch and load the full set of active routes from the MySQL database, thus rehydrating its state.
Third, database seeding for preview environments is a complex problem. The current setup provides infrastructure, but no data. Strategies must be developed to provide each preview environment with a safe, relevant, and isolated dataset. This could range from running seed scripts post-deployment to leveraging snapshotting features of Cloud SQL.
Finally, the security of Caddy’s On-Demand TLS ask
endpoint is critical. While it prevents random actors from using your server to get certificates, it still relies on a simple HTTP check. For higher security, mutual TLS (mTLS) between Caddy and the Orchestrator would ensure that only the legitimate Caddy instance can ask for certificate validation.