Implementing Pull-Request-Based Ephemeral Environments with an OpenFaaS Terraform Orchestrator


The central bottleneck in our development cycle wasn’t coding; it was validation. A single, overloaded staging environment served dozens of developers, leading to constant merge conflicts, overwritten deployments, and a QA process that was perpetually blocked. A developer’s pull request could sit for days waiting for a clear window to be deployed and tested. This friction was unacceptable. We decided to build a system that would provision a completely isolated, fully functional preview environment for every single pull request and destroy it automatically upon merge or closure.

Our initial concept was an event-driven system triggered by GitHub webhooks. We needed an orchestrator to receive these events and manage the lifecycle of cloud infrastructure. The infrastructure itself had to be defined as code for repeatability. A dynamic routing layer was necessary to expose each ephemeral environment on a unique URL, and a database was required to track the state of dozens of concurrent environments. Finally, a simple UI would provide visibility for the team.

Technology selection was driven by a desire for a lean, scalable, and maintainable core. We chose OpenFaaS running on Kubernetes as our serverless orchestrator. The appeal was its event-driven nature; a simple function could encapsulate the logic for handling a “PR opened” event without the overhead of a persistent CI server. For infrastructure management, Terraform was the non-negotiable standard. The challenge was not choosing Terraform, but figuring out how to run it effectively within a stateless serverless function. Nginx, via the standard Kubernetes Ingress Controller, would handle the dynamic routing. For state tracking, a PostgreSQL database offered the transactional integrity we needed. A minimal React dashboard would serve as the control plane’s user interface. A common mistake at this stage is to underestimate the complexity of managing state and long-running processes in a serverless model. Our entire architecture hinged on solving this correctly.

The foundational layer is the Kubernetes cluster and the OpenFaaS installation, managed entirely by Terraform. This “meta” infrastructure provides the platform upon which our ephemeral environments will be built.

# main.tf - Management Cluster Infrastructure

provider "aws" {
  region = "us-east-1"
}

resource "aws_eks_cluster" "management_cluster" {
  name     = "ephemeral-env-platform"
  role_arn = aws_iam_role.eks_cluster_role.arn
  version  = "1.28"

  vpc_config {
    subnet_ids = module.vpc.private_subnets
  }

  # Ensure dependencies are met
  depends_on = [
    aws_iam_role_policy_attachment.eks_cluster_policy,
  ]
}

resource "aws_eks_node_group" "openfaas_nodes" {
  cluster_name    = aws_eks_cluster.management_cluster.name
  node_group_name = "openfaas-system-nodes"
  node_role_arn   = aws_iam_role.eks_node_role.arn
  subnet_ids      = module.vpc.private_subnets
  instance_types  = ["t3.medium"]

  scaling_config {
    desired_size = 2
    min_size     = 1
    max_size     = 3
  }
}

# Helm chart for OpenFaaS
resource "helm_release" "openfaas" {
  name       = "openfaas"
  repository = "https://openfaas.github.io/faas-netes/"
  chart      = "openfaas"
  namespace  = "openfaas"
  create_namespace = true

  set {
    name  = "functionNamespace"
    value = "openfaas-fn"
  }
  set {
    name  = "serviceType"
    value = "LoadBalancer"
  }
  set {
    name = "basic_auth"
    value = "true"
  }
}

# Helm chart for NGINX Ingress Controller
resource "helm_release" "nginx_ingress" {
  name       = "ingress-nginx"
  repository = "https://kubernetes.github.io/ingress-nginx"
  chart      = "ingress-nginx"
  namespace  = "ingress-nginx"
  create_namespace = true

  set {
    name  = "controller.service.type"
    value = "LoadBalancer"
  }
}

This setup gives us the OpenFaaS gateway, which will receive webhooks, and the Nginx Ingress, which will later route traffic to the preview applications. The first major architectural hurdle was running Terraform from within a function. Serverless functions are ephemeral; their filesystems are discarded after execution. Storing the Terraform state file (terraform.tfstate) locally is not an option. The only robust solution is to use a remote backend. We configured an S3 bucket for state storage and a DynamoDB table for state locking to prevent race conditions when multiple PRs are opened simultaneously.

# backend.tf - Configuration for the ephemeral environment's state

terraform {
  backend "s3" {
    # The bucket and key will be dynamically configured by the orchestrator function
    # to ensure each environment has an isolated state.
    bucket         = "ephemeral-env-tfstate-bucket"
    key            = "placeholder.tfstate" # This will be replaced
    region         = "us-east-1"
    dynamodb_table = "terraform-state-locks"
    encrypt        = true
  }
}

The second, more severe problem is execution time. A terraform apply for even a moderately complex application can take several minutes. Most serverless platforms have a timeout of 60-900 seconds. Hitting this timeout would leave deployments in a dangerous, partially-applied state. The solution was to adopt an asynchronous pattern.

We designed two functions:

  1. github-webhook-dispatcher: A synchronous function that acts as the entry point. It validates the GitHub webhook payload, performs basic checks, writes an initial “pending” status to our PostgreSQL database, and then immediately triggers the second function asynchronously. It returns a 202 Accepted response to GitHub within milliseconds.
  2. terraform-orchestrator: A long-running asynchronous function. This function contains the core logic for running Terraform. It checks out the correct Git SHA, sets up the Terraform environment, and executes apply or destroy.

Here is the handler code for the terraform-orchestrator function, written in Go. The pitfall here is managing the execution environment; the function’s Docker image must contain the Terraform binary and any necessary provider CLIs.

// handler.go - Core logic for the Terraform Orchestrator function

package function

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"os"
	"os/exec"
	"path/filepath"

	"gorm.io/driver/postgres"
	"gorm.io/gorm"
)

// Represents the state of an environment in our database
type Environment struct {
	gorm.Model
	PullRequestID     int `gorm:"uniqueIndex"`
	Status            string // e.g., "pending", "creating", "active", "destroying", "destroyed", "failed"
	GitSHA            string
	TerraformWorkspace string
	PreviewURL        string
	Logs              string
}

// Payload received from the dispatcher function
type OrchestratorPayload struct {
	Action        string `json:"action"` // "apply" or "destroy"
	PullRequestID int    `json:"pull_request_id"`
	GitSHA        string `json:"git_sha"`
}

var db *gorm.DB

func init() {
	// A real-world project would use a more secure way to handle secrets.
	dsn := os.Getenv("DB_DSN")
	var err error
	db, err = gorm.Open(postgres.Open(dsn), &gorm.Config{})
	if err != nil {
		log.Fatalf("failed to connect database: %v", err)
	}
	db.AutoMigrate(&Environment{})
}

// Handle is the entrypoint for the OpenFaaS function
func Handle(ctx context.Context, req []byte) (string, error) {
	var payload OrchestratorPayload
	if err := json.Unmarshal(req, &payload); err != nil {
		return "", fmt.Errorf("invalid payload: %w", err)
	}

	workspaceName := fmt.Sprintf("pr-%d", payload.PullRequestID)

	// Update DB state
	db.Model(&Environment{}).Where("pull_request_id = ?", payload.PullRequestID).Update("Status", fmt.Sprintf("%sing", payload.Action))
	
	// Create a temporary directory for this execution
	tempDir, err := os.MkdirTemp("", "terraform-run-")
	if err != nil {
		updateStatus(payload.PullRequestID, "failed", "failed to create temp dir")
		return "", err
	}
	defer os.RemoveAll(tempDir)

	// In a real implementation, this would clone a specific repository and checkout the SHA
	// For this example, we assume the Terraform modules are in the function context
	if err := copyDirectory("./terraform", tempDir); err != nil {
		updateStatus(payload.PullRequestID, "failed", "failed to copy terraform files")
		return "", err
	}

	if err := runTerraform(tempDir, workspaceName, payload); err != nil {
		updateStatus(payload.PullRequestID, "failed", err.Error())
		return "", err
	}
	
	finalStatus := "active"
	if payload.Action == "destroy" {
		finalStatus = "destroyed"
	}
	updateStatus(payload.PullRequestID, finalStatus, "Operation successful")

	return "Orchestration complete.", nil
}

// runTerraform executes the required terraform commands in a specific directory
func runTerraform(dir, workspace string, payload OrchestratorPayload) error {
	// 1. Initialize Terraform with the correct remote backend key
	// This is the critical step for state isolation.
	backendConfigKey := fmt.Sprintf("env/%s.tfstate", workspace)
	initCmd := exec.Command("terraform", "init", "-reconfigure", fmt.Sprintf("-backend-config=key=%s", backendConfigKey))
	initCmd.Dir = dir
	if output, err := runCommand(initCmd); err != nil {
		return fmt.Errorf("terraform init failed: %s\n%w", string(output), err)
	}

	// 2. Select or create the workspace
	workspaceCmd := exec.Command("terraform", "workspace", "select", workspace)
	workspaceCmd.Dir = dir
	if _, err := runCommand(workspaceCmd); err != nil {
		// If it fails, workspace probably doesn't exist, so create it
		createWorkspaceCmd := exec.Command("terraform", "workspace", "new", workspace)
		createWorkspaceCmd.Dir = dir
		if output, err := runCommand(createWorkspaceCmd); err != nil {
			return fmt.Errorf("terraform workspace new failed: %s\n%w", string(output), err)
		}
	}

	// 3. Apply or Destroy
	var tfCmd *exec.Cmd
	if payload.Action == "apply" {
		// Pass variables like the git SHA and PR number to the module
		tfCmd = exec.Command("terraform", "apply", "-auto-approve",
			"-var", fmt.Sprintf("git_sha=%s", payload.GitSHA),
			"-var", fmt.Sprintf("pr_number=%d", payload.PullRequestID),
		)
	} else if payload.Action == "destroy" {
		tfCmd = exec.Command("terraform", "destroy", "-auto-approve",
			"-var", fmt.Sprintf("pr_number=%d", payload.PullRequestID),
		)
	} else {
		return fmt.Errorf("unknown action: %s", payload.Action)
	}
	
	tfCmd.Dir = dir
	if output, err := runCommand(tfCmd); err != nil {
		return fmt.Errorf("terraform %s failed: %s\n%w", payload.Action, string(output), err)
	}
	
	return nil
}

func runCommand(cmd *exec.Cmd) ([]byte, error) {
	// A production implementation needs much better logging and output streaming
	log.Printf("Running command: %s", cmd.String())
	output, err := cmd.CombinedOutput()
	if err != nil {
		log.Printf("Command failed. Output:\n%s", string(output))
	}
	return output, err
}

func updateStatus(prID int, status string, message string) {
	// This function should be more robust, appending logs instead of replacing them.
	db.Model(&Environment{}).Where("pull_request_id = ?", prID).Updates(Environment{Status: status, Logs: message})
}

The core of the system is the Terraform module that defines a single ephemeral environment. It’s designed to be a self-contained slice of our application stack, including the Kubernetes deployment, a service, and an ingress rule.

# ./terraform/environment.tf - Module for a single preview environment

variable "pr_number" {
  type        = number
  description = "The pull request number."
}

variable "git_sha" {
  type        = string
  description = "The git commit SHA to deploy."
}

locals {
  # Construct a unique name and hostname for all resources.
  sanitized_name = "pr-${var.pr_number}"
  hostname       = "pr-${var.pr_number}.preview.my-company.com"
  image_tag      = substr(var.git_sha, 0, 7) # Use short SHA for image tag
}

resource "kubernetes_deployment" "app" {
  metadata {
    name      = local.sanitized_name
    namespace = "preview-apps"
    labels = {
      app = local.sanitized_name
    }
  }

  spec {
    replicas = 1

    selector {
      match_labels = {
        app = local.sanitized_name
      }
    }

    template {
      metadata {
        labels = {
          app = local.sanitized_name
        }
      }

      spec {
        container {
          image = "my-docker-registry/my-app:${local.image_tag}"
          name  = "application"

          port {
            container_port = 8080
          }
          
          # Production-grade deployments require readiness/liveness probes
          readiness_probe {
            http_get {
              path = "/healthz"
              port = 8080
            }
            initial_delay_seconds = 5
            period_seconds = 10
          }
        }
      }
    }
  }
}

resource "kubernetes_service" "app_svc" {
  metadata {
    name      = local.sanitized_name
    namespace = "preview-apps"
  }
  spec {
    selector = {
      app = kubernetes_deployment.app.metadata.0.labels.app
    }
    port {
      port        = 80
      target_port = 8080
    }
  }
}

# This is where Nginx comes in. The ingress controller watches for these objects
# and automatically configures the underlying Nginx proxy.
resource "kubernetes_ingress_v1" "app_ingress" {
  metadata {
    name      = local.sanitized_name
    namespace = "preview-apps"
    annotations = {
      "kubernetes.io/ingress.class": "nginx"
      # Add other annotations for TLS, etc.
    }
  }

  spec {
    rule {
      host = local.hostname
      http {
        path {
          path      = "/"
          path_type = "Prefix"
          backend {
            service {
              name = kubernetes_service.app_svc.metadata.0.name
              port {
                number = 80
              }
            }
          }
        }
      }
    }
  }
}

output "preview_url" {
  value = "https://${local.hostname}"
}

With terraform apply, this module creates a deployment running the specific container image for the PR, exposes it with a service, and creates an ingress rule. The Nginx Ingress Controller automatically detects this new ingress object and updates its configuration to route traffic for pr-123.preview.my-company.com to the correct pod.

A critical aspect of production readiness is cleanup. If a “PR closed” webhook fails to trigger the destroy action, we are left with orphaned, costly resources. To mitigate this, we implemented a “reaper” function scheduled to run every hour.

// reaper.go - Cron-triggered function to clean up stale resources

func HandleReaper(ctx context.Context, req []byte) (string, error) {
    var staleEnvironments []Environment
    // Find envs that are 'active' but older than 24 hours,
    // or 'creating' for more than 1 hour.
    db.Where("status = ? AND created_at < NOW() - INTERVAL '24 hours'", "active").
       Or("status = ? AND created_at < NOW() - INTERVAL '1 hour'", "creating").
       Find(&staleEnvironments)
    
    for _, env := range staleEnvironments {
        log.Printf("Reaping stale environment for PR %d", env.PullRequestID)
        
        // Construct payload to trigger the orchestrator asynchronously
        payload := OrchestratorPayload{
            Action:        "destroy",
            PullRequestID: env.PullRequestID,
            GitSHA:        env.GitSHA, // Not strictly needed for destroy, but good practice
        }
        
        // This is a simplified async invocation call. A real implementation
        // would use the OpenFaaS client or direct HTTP POST to /async-function/
        triggerAsyncOrchestrator(payload)
    }

    return fmt.Sprintf("Reaped %d environments.", len(staleEnvironments)), nil
}

The final piece was the React dashboard. It provided a simple view into the PostgreSQL database, showing a list of all active environments, their PR numbers, commit SHAs, and clickable preview URLs. It also had a crucial “Force Destroy” button that could manually trigger the destroy action for a specific environment if the automated process failed. The API for this frontend was another simple OpenFaaS function that performed read-only queries against the database.

The complete workflow looks like this:

sequenceDiagram
    participant GitHub
    participant OpenFaaS Gateway
    participant Dispatcher Fn
    participant PostgreSQL DB
    participant Orchestrator Fn (Async)
    participant Terraform

    GitHub->>+OpenFaaS Gateway: Webhook (PR Opened)
    OpenFaaS Gateway->>+Dispatcher Fn: Invoke
    Dispatcher Fn->>+PostgreSQL DB: INSERT {pr_id, status: 'pending'}
    Dispatcher Fn->>-OpenFaaS Gateway: Trigger Async Orchestrator
    OpenFaaS Gateway-->>-GitHub: 202 Accepted
    OpenFaaS Gateway->>+Orchestrator Fn (Async): Invoke
    Orchestrator Fn->>+PostgreSQL DB: UPDATE status='creating'
    Orchestrator Fn->>+Terraform: init, workspace new, apply
    Terraform-->>-Orchestrator Fn (Async): Resources Created
    Orchestrator Fn->>+PostgreSQL DB: UPDATE status='active', url='...'
    deactivate Orchestrator Fn (Async)

While this system successfully solved our staging bottleneck, it’s not without its own set of trade-offs and lingering issues. The most significant is cost. Spinning up a full application stack, including database clones, for every PR is expensive, especially if those environments sit idle for long periods. Future iterations must explore cost-saving measures like scaling deployments down to zero after a period of inactivity. The terraform apply process, while asynchronous, can still take 5-10 minutes, meaning the feedback loop for developers isn’t instantaneous. We are investigating if a Kubernetes-native tool like Crossplane could provision the in-cluster resources faster, reserving Terraform for the heavier lifting of external resources. Finally, the security model of a serverless function holding powerful cloud credentials to run Terraform is a major concern; it requires meticulous IAM policy design and a robust secret management strategy to minimize the blast radius of a potential compromise.


  TOC