The central bottleneck in our development cycle wasn’t coding; it was validation. A single, overloaded staging environment served dozens of developers, leading to constant merge conflicts, overwritten deployments, and a QA process that was perpetually blocked. A developer’s pull request could sit for days waiting for a clear window to be deployed and tested. This friction was unacceptable. We decided to build a system that would provision a completely isolated, fully functional preview environment for every single pull request and destroy it automatically upon merge or closure.
Our initial concept was an event-driven system triggered by GitHub webhooks. We needed an orchestrator to receive these events and manage the lifecycle of cloud infrastructure. The infrastructure itself had to be defined as code for repeatability. A dynamic routing layer was necessary to expose each ephemeral environment on a unique URL, and a database was required to track the state of dozens of concurrent environments. Finally, a simple UI would provide visibility for the team.
Technology selection was driven by a desire for a lean, scalable, and maintainable core. We chose OpenFaaS running on Kubernetes as our serverless orchestrator. The appeal was its event-driven nature; a simple function could encapsulate the logic for handling a “PR opened” event without the overhead of a persistent CI server. For infrastructure management, Terraform was the non-negotiable standard. The challenge was not choosing Terraform, but figuring out how to run it effectively within a stateless serverless function. Nginx, via the standard Kubernetes Ingress Controller, would handle the dynamic routing. For state tracking, a PostgreSQL database offered the transactional integrity we needed. A minimal React dashboard would serve as the control plane’s user interface. A common mistake at this stage is to underestimate the complexity of managing state and long-running processes in a serverless model. Our entire architecture hinged on solving this correctly.
The foundational layer is the Kubernetes cluster and the OpenFaaS installation, managed entirely by Terraform. This “meta” infrastructure provides the platform upon which our ephemeral environments will be built.
# main.tf - Management Cluster Infrastructure
provider "aws" {
region = "us-east-1"
}
resource "aws_eks_cluster" "management_cluster" {
name = "ephemeral-env-platform"
role_arn = aws_iam_role.eks_cluster_role.arn
version = "1.28"
vpc_config {
subnet_ids = module.vpc.private_subnets
}
# Ensure dependencies are met
depends_on = [
aws_iam_role_policy_attachment.eks_cluster_policy,
]
}
resource "aws_eks_node_group" "openfaas_nodes" {
cluster_name = aws_eks_cluster.management_cluster.name
node_group_name = "openfaas-system-nodes"
node_role_arn = aws_iam_role.eks_node_role.arn
subnet_ids = module.vpc.private_subnets
instance_types = ["t3.medium"]
scaling_config {
desired_size = 2
min_size = 1
max_size = 3
}
}
# Helm chart for OpenFaaS
resource "helm_release" "openfaas" {
name = "openfaas"
repository = "https://openfaas.github.io/faas-netes/"
chart = "openfaas"
namespace = "openfaas"
create_namespace = true
set {
name = "functionNamespace"
value = "openfaas-fn"
}
set {
name = "serviceType"
value = "LoadBalancer"
}
set {
name = "basic_auth"
value = "true"
}
}
# Helm chart for NGINX Ingress Controller
resource "helm_release" "nginx_ingress" {
name = "ingress-nginx"
repository = "https://kubernetes.github.io/ingress-nginx"
chart = "ingress-nginx"
namespace = "ingress-nginx"
create_namespace = true
set {
name = "controller.service.type"
value = "LoadBalancer"
}
}
This setup gives us the OpenFaaS gateway, which will receive webhooks, and the Nginx Ingress, which will later route traffic to the preview applications. The first major architectural hurdle was running Terraform from within a function. Serverless functions are ephemeral; their filesystems are discarded after execution. Storing the Terraform state file (terraform.tfstate
) locally is not an option. The only robust solution is to use a remote backend. We configured an S3 bucket for state storage and a DynamoDB table for state locking to prevent race conditions when multiple PRs are opened simultaneously.
# backend.tf - Configuration for the ephemeral environment's state
terraform {
backend "s3" {
# The bucket and key will be dynamically configured by the orchestrator function
# to ensure each environment has an isolated state.
bucket = "ephemeral-env-tfstate-bucket"
key = "placeholder.tfstate" # This will be replaced
region = "us-east-1"
dynamodb_table = "terraform-state-locks"
encrypt = true
}
}
The second, more severe problem is execution time. A terraform apply
for even a moderately complex application can take several minutes. Most serverless platforms have a timeout of 60-900 seconds. Hitting this timeout would leave deployments in a dangerous, partially-applied state. The solution was to adopt an asynchronous pattern.
We designed two functions:
-
github-webhook-dispatcher
: A synchronous function that acts as the entry point. It validates the GitHub webhook payload, performs basic checks, writes an initial “pending” status to our PostgreSQL database, and then immediately triggers the second function asynchronously. It returns a202 Accepted
response to GitHub within milliseconds. -
terraform-orchestrator
: A long-running asynchronous function. This function contains the core logic for running Terraform. It checks out the correct Git SHA, sets up the Terraform environment, and executesapply
ordestroy
.
Here is the handler code for the terraform-orchestrator
function, written in Go. The pitfall here is managing the execution environment; the function’s Docker image must contain the Terraform binary and any necessary provider CLIs.
// handler.go - Core logic for the Terraform Orchestrator function
package function
import (
"context"
"encoding/json"
"fmt"
"log"
"os"
"os/exec"
"path/filepath"
"gorm.io/driver/postgres"
"gorm.io/gorm"
)
// Represents the state of an environment in our database
type Environment struct {
gorm.Model
PullRequestID int `gorm:"uniqueIndex"`
Status string // e.g., "pending", "creating", "active", "destroying", "destroyed", "failed"
GitSHA string
TerraformWorkspace string
PreviewURL string
Logs string
}
// Payload received from the dispatcher function
type OrchestratorPayload struct {
Action string `json:"action"` // "apply" or "destroy"
PullRequestID int `json:"pull_request_id"`
GitSHA string `json:"git_sha"`
}
var db *gorm.DB
func init() {
// A real-world project would use a more secure way to handle secrets.
dsn := os.Getenv("DB_DSN")
var err error
db, err = gorm.Open(postgres.Open(dsn), &gorm.Config{})
if err != nil {
log.Fatalf("failed to connect database: %v", err)
}
db.AutoMigrate(&Environment{})
}
// Handle is the entrypoint for the OpenFaaS function
func Handle(ctx context.Context, req []byte) (string, error) {
var payload OrchestratorPayload
if err := json.Unmarshal(req, &payload); err != nil {
return "", fmt.Errorf("invalid payload: %w", err)
}
workspaceName := fmt.Sprintf("pr-%d", payload.PullRequestID)
// Update DB state
db.Model(&Environment{}).Where("pull_request_id = ?", payload.PullRequestID).Update("Status", fmt.Sprintf("%sing", payload.Action))
// Create a temporary directory for this execution
tempDir, err := os.MkdirTemp("", "terraform-run-")
if err != nil {
updateStatus(payload.PullRequestID, "failed", "failed to create temp dir")
return "", err
}
defer os.RemoveAll(tempDir)
// In a real implementation, this would clone a specific repository and checkout the SHA
// For this example, we assume the Terraform modules are in the function context
if err := copyDirectory("./terraform", tempDir); err != nil {
updateStatus(payload.PullRequestID, "failed", "failed to copy terraform files")
return "", err
}
if err := runTerraform(tempDir, workspaceName, payload); err != nil {
updateStatus(payload.PullRequestID, "failed", err.Error())
return "", err
}
finalStatus := "active"
if payload.Action == "destroy" {
finalStatus = "destroyed"
}
updateStatus(payload.PullRequestID, finalStatus, "Operation successful")
return "Orchestration complete.", nil
}
// runTerraform executes the required terraform commands in a specific directory
func runTerraform(dir, workspace string, payload OrchestratorPayload) error {
// 1. Initialize Terraform with the correct remote backend key
// This is the critical step for state isolation.
backendConfigKey := fmt.Sprintf("env/%s.tfstate", workspace)
initCmd := exec.Command("terraform", "init", "-reconfigure", fmt.Sprintf("-backend-config=key=%s", backendConfigKey))
initCmd.Dir = dir
if output, err := runCommand(initCmd); err != nil {
return fmt.Errorf("terraform init failed: %s\n%w", string(output), err)
}
// 2. Select or create the workspace
workspaceCmd := exec.Command("terraform", "workspace", "select", workspace)
workspaceCmd.Dir = dir
if _, err := runCommand(workspaceCmd); err != nil {
// If it fails, workspace probably doesn't exist, so create it
createWorkspaceCmd := exec.Command("terraform", "workspace", "new", workspace)
createWorkspaceCmd.Dir = dir
if output, err := runCommand(createWorkspaceCmd); err != nil {
return fmt.Errorf("terraform workspace new failed: %s\n%w", string(output), err)
}
}
// 3. Apply or Destroy
var tfCmd *exec.Cmd
if payload.Action == "apply" {
// Pass variables like the git SHA and PR number to the module
tfCmd = exec.Command("terraform", "apply", "-auto-approve",
"-var", fmt.Sprintf("git_sha=%s", payload.GitSHA),
"-var", fmt.Sprintf("pr_number=%d", payload.PullRequestID),
)
} else if payload.Action == "destroy" {
tfCmd = exec.Command("terraform", "destroy", "-auto-approve",
"-var", fmt.Sprintf("pr_number=%d", payload.PullRequestID),
)
} else {
return fmt.Errorf("unknown action: %s", payload.Action)
}
tfCmd.Dir = dir
if output, err := runCommand(tfCmd); err != nil {
return fmt.Errorf("terraform %s failed: %s\n%w", payload.Action, string(output), err)
}
return nil
}
func runCommand(cmd *exec.Cmd) ([]byte, error) {
// A production implementation needs much better logging and output streaming
log.Printf("Running command: %s", cmd.String())
output, err := cmd.CombinedOutput()
if err != nil {
log.Printf("Command failed. Output:\n%s", string(output))
}
return output, err
}
func updateStatus(prID int, status string, message string) {
// This function should be more robust, appending logs instead of replacing them.
db.Model(&Environment{}).Where("pull_request_id = ?", prID).Updates(Environment{Status: status, Logs: message})
}
The core of the system is the Terraform module that defines a single ephemeral environment. It’s designed to be a self-contained slice of our application stack, including the Kubernetes deployment, a service, and an ingress rule.
# ./terraform/environment.tf - Module for a single preview environment
variable "pr_number" {
type = number
description = "The pull request number."
}
variable "git_sha" {
type = string
description = "The git commit SHA to deploy."
}
locals {
# Construct a unique name and hostname for all resources.
sanitized_name = "pr-${var.pr_number}"
hostname = "pr-${var.pr_number}.preview.my-company.com"
image_tag = substr(var.git_sha, 0, 7) # Use short SHA for image tag
}
resource "kubernetes_deployment" "app" {
metadata {
name = local.sanitized_name
namespace = "preview-apps"
labels = {
app = local.sanitized_name
}
}
spec {
replicas = 1
selector {
match_labels = {
app = local.sanitized_name
}
}
template {
metadata {
labels = {
app = local.sanitized_name
}
}
spec {
container {
image = "my-docker-registry/my-app:${local.image_tag}"
name = "application"
port {
container_port = 8080
}
# Production-grade deployments require readiness/liveness probes
readiness_probe {
http_get {
path = "/healthz"
port = 8080
}
initial_delay_seconds = 5
period_seconds = 10
}
}
}
}
}
}
resource "kubernetes_service" "app_svc" {
metadata {
name = local.sanitized_name
namespace = "preview-apps"
}
spec {
selector = {
app = kubernetes_deployment.app.metadata.0.labels.app
}
port {
port = 80
target_port = 8080
}
}
}
# This is where Nginx comes in. The ingress controller watches for these objects
# and automatically configures the underlying Nginx proxy.
resource "kubernetes_ingress_v1" "app_ingress" {
metadata {
name = local.sanitized_name
namespace = "preview-apps"
annotations = {
"kubernetes.io/ingress.class": "nginx"
# Add other annotations for TLS, etc.
}
}
spec {
rule {
host = local.hostname
http {
path {
path = "/"
path_type = "Prefix"
backend {
service {
name = kubernetes_service.app_svc.metadata.0.name
port {
number = 80
}
}
}
}
}
}
}
}
output "preview_url" {
value = "https://${local.hostname}"
}
With terraform apply
, this module creates a deployment running the specific container image for the PR, exposes it with a service, and creates an ingress rule. The Nginx Ingress Controller automatically detects this new ingress object and updates its configuration to route traffic for pr-123.preview.my-company.com
to the correct pod.
A critical aspect of production readiness is cleanup. If a “PR closed” webhook fails to trigger the destroy
action, we are left with orphaned, costly resources. To mitigate this, we implemented a “reaper” function scheduled to run every hour.
// reaper.go - Cron-triggered function to clean up stale resources
func HandleReaper(ctx context.Context, req []byte) (string, error) {
var staleEnvironments []Environment
// Find envs that are 'active' but older than 24 hours,
// or 'creating' for more than 1 hour.
db.Where("status = ? AND created_at < NOW() - INTERVAL '24 hours'", "active").
Or("status = ? AND created_at < NOW() - INTERVAL '1 hour'", "creating").
Find(&staleEnvironments)
for _, env := range staleEnvironments {
log.Printf("Reaping stale environment for PR %d", env.PullRequestID)
// Construct payload to trigger the orchestrator asynchronously
payload := OrchestratorPayload{
Action: "destroy",
PullRequestID: env.PullRequestID,
GitSHA: env.GitSHA, // Not strictly needed for destroy, but good practice
}
// This is a simplified async invocation call. A real implementation
// would use the OpenFaaS client or direct HTTP POST to /async-function/
triggerAsyncOrchestrator(payload)
}
return fmt.Sprintf("Reaped %d environments.", len(staleEnvironments)), nil
}
The final piece was the React dashboard. It provided a simple view into the PostgreSQL database, showing a list of all active environments, their PR numbers, commit SHAs, and clickable preview URLs. It also had a crucial “Force Destroy” button that could manually trigger the destroy
action for a specific environment if the automated process failed. The API for this frontend was another simple OpenFaaS function that performed read-only queries against the database.
The complete workflow looks like this:
sequenceDiagram participant GitHub participant OpenFaaS Gateway participant Dispatcher Fn participant PostgreSQL DB participant Orchestrator Fn (Async) participant Terraform GitHub->>+OpenFaaS Gateway: Webhook (PR Opened) OpenFaaS Gateway->>+Dispatcher Fn: Invoke Dispatcher Fn->>+PostgreSQL DB: INSERT {pr_id, status: 'pending'} Dispatcher Fn->>-OpenFaaS Gateway: Trigger Async Orchestrator OpenFaaS Gateway-->>-GitHub: 202 Accepted OpenFaaS Gateway->>+Orchestrator Fn (Async): Invoke Orchestrator Fn->>+PostgreSQL DB: UPDATE status='creating' Orchestrator Fn->>+Terraform: init, workspace new, apply Terraform-->>-Orchestrator Fn (Async): Resources Created Orchestrator Fn->>+PostgreSQL DB: UPDATE status='active', url='...' deactivate Orchestrator Fn (Async)
While this system successfully solved our staging bottleneck, it’s not without its own set of trade-offs and lingering issues. The most significant is cost. Spinning up a full application stack, including database clones, for every PR is expensive, especially if those environments sit idle for long periods. Future iterations must explore cost-saving measures like scaling deployments down to zero after a period of inactivity. The terraform apply
process, while asynchronous, can still take 5-10 minutes, meaning the feedback loop for developers isn’t instantaneous. We are investigating if a Kubernetes-native tool like Crossplane could provision the in-cluster resources faster, reserving Terraform for the heavier lifting of external resources. Finally, the security model of a serverless function holding powerful cloud credentials to run Terraform is a major concern; it requires meticulous IAM policy design and a robust secret management strategy to minimize the blast radius of a potential compromise.