Constructing a CAP-Aware Blue-Green Deployment Pipeline for tRPC Services with Tekton

DevOps

Word Count: 3k

Read Times: 18 Min

The transition from rolling updates to a more robust deployment strategy became unavoidable. Our core user-profile service, built with tRPC for its end-to-end typesafety, was exhibiting alarming behavior during deployments. While technically “zero-downtime” in that pods were always running, the brief period where old and new versions coexisted was causing data consistency anomalies. The service was designed with an AP (Availability, Partition Tolerance) posture in mind, relying on eventual consistency between its primary database and a Redis-based cache. Rolling updates, which gradually introduce new pods, created a state where clients could hit an old API version that was unaware of a schema change initiated by a new version, leading to transient data corruption. This wasn’t a failure of the service, but a failure of a deployment strategy that was ignorant of the system’s architectural trade-offs, a direct confrontation with the realities of the CAP theorem in a CI/CD context.

Our initial concept was to enforce atomicity on the deployment process itself: a blue-green strategy. The entire live environment (“blue”) would remain untouched while a complete, parallel environment (“green”) was spun up with the new version. Only after the green environment was verified as fully operational would we switch traffic instantaneously. This approach prevents the state of having mixed versions serving production traffic. The challenge, however, shifted from the application runtime to the CI/CD pipeline. How could we automate this process reliably within our Kubernetes ecosystem? And more critically, how could our pipeline intelligently determine that the “green” environment was truly “ready,” respecting its AP nature without demanding strict consistency (which it wasn’t designed for) during its health checks?

The technology selection was guided by our existing Kubernetes-native infrastructure and a philosophy of using declarative, composable tools. For the CI/CD engine, Tekton was the clear choice. Its CRD-based, container-centric model allowed us to define our pipeline steps as reusable Tasks directly within our cluster, avoiding the impedance mismatch of external CI systems trying to manipulate Kubernetes resources. The tRPC service was a non-negotiable part of the stack; its typesafety benefits were too significant to abandon. The task was to build a deployment harness around it. Finally, to enforce discipline, Prettier was integrated as a mandatory quality gate. A pipeline this complex could not be compromised by trivial formatting disputes or unreadable code. Code must be verifiable before it even attempts to build.

The core of our solution lies in creating a series of Tekton Tasks that codify each step of the blue-green process, from code validation to the final traffic switch. This journey begins not in Kubernetes, but with the tRPC service itself, ensuring it’s designed for this deployment model.

The tRPC Service: Designed for Verifiable Readiness

A standard health check that returns a 200 OK is insufficient here. It confirms the process is running, but says nothing about its ability to correctly interact with its dependencies. We needed a readiness probe that aligned with our AP design. The service might temporarily serve slightly stale data from its cache, which is acceptable, but it must be able to connect to its primary database and message queue.

We defined a dedicated health router within our tRPC application with a special procedure, deepCheck.

// src/server/routers/health.ts
import { z } from 'zod';
import { publicProcedure, router } from '../trpc';
import { prisma } from '../prisma';
import { redisClient } from '../redis';

export const healthRouter = router({
  /**
   * A shallow health check, useful for K8s liveness probes.
   * Just confirms the server is up and responding.
   */
  basic: publicProcedure.query(() => {
    return { status: 'ok' };
  }),

  /**
   * A deep readiness check. This is what our Tekton pipeline will call.
   * It verifies connectivity to critical downstream services.
   * In an AP system, this check confirms the node is ready to join the cluster
   * and will eventually become consistent. It doesn't check for data consistency itself.
   */
  deepCheck: publicProcedure
    .output(
      z.object({
        status: z.string(),
        database: z.string(),
        cache: z.string(),
      })
    )
    .query(async () => {
      let dbStatus: 'ok' | 'error' = 'ok';
      let cacheStatus: 'ok' | 'error' = 'ok';

      try {
        // A simple, non-blocking query to check DB connectivity.
        await prisma.$queryRaw`SELECT 1`;
      } catch (e) {
        dbStatus = 'error';
        console.error('Database connectivity check failed', e);
      }

      try {
        // Check Redis connectivity.
        const pong = await redisClient.ping();
        if (pong !== 'PONG') {
          throw new Error('Redis ping failed');
        }
      } catch (e) {
        cacheStatus = 'error';
        console.error('Cache connectivity check failed', e);
      }
      
      const overallStatus = dbStatus === 'ok' && cacheStatus === 'ok' ? 'ready' : 'degraded';

      if (overallStatus === 'degraded') {
          // This will cause the procedure to throw, which our health check script can interpret as a failure.
          throw new Error(`Deep health check failed: DB=${dbStatus}, Cache=${cacheStatus}`);
      }

      return {
        status: overallStatus,
        database: dbStatus,
        cache: cacheStatus,
      };
    }),
});

// The main appRouter combines this with other routers.
// ...

The accompanying Dockerfile is a standard multi-stage build to keep the final image lean.

# ---- Base ----
FROM node:18-alpine AS base
WORKDIR /app

# ---- Dependencies ----
FROM base AS deps
COPY package.json yarn.lock ./
RUN yarn install --frozen-lockfile --production=false

# ---- Builder ----
FROM base AS builder
COPY --from=deps /app/node_modules ./node_modules
COPY . .
# Run code format check as part of the build.
# The pipeline will do this earlier, but this is a safety net.
RUN yarn prettier:check
RUN yarn build

# ---- Production ----
FROM base AS production
ENV NODE_ENV=production
COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY package.json .
# Prune dev dependencies for a smaller image
RUN yarn install --frozen-lockfile --production=true

# Expose the port the tRPC server will run on
EXPOSE 3000

CMD ["node", "dist/server/index.js"]

Tekton Pipeline Implementation: A Symphony of Tasks

With the service prepared, we can define the pipeline. We’ll break it down into a series of distinct, reusable Tekton Tasks.

Task 1: Code Quality Gate

The first step in any sane pipeline is to validate the source code itself. This Task clones the repository and runs linting and formatting checks. If this fails, the entire PipelineRun halts immediately, saving compute resources and providing fast feedback.

# tekton/tasks/code-quality.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: code-quality-gate
spec:
  description: >-
    This task clones a repository, installs dependencies, and runs
    linting and prettier checks.
  params:
    - name: repository-url
      description: The git repository URL to clone from.
    - name: revision
      description: The git revision to check out.
      default: main
  workspaces:
    - name: source
      description: A workspace for the cloned repository.
  steps:
    - name: git-clone
      image: alpine/git:v2.36.1
      script: |
        git clone $(params.repository-url) .
        git checkout $(params.revision)
      workingDir: $(workspaces.source.path)

    - name: install-dependencies
      image: node:18-alpine
      script: |
        yarn install --frozen-lockfile
      workingDir: $(workspaces.source.path)
    
    - name: prettier-check
      image: node:18-alpine
      script: |
        # In a real-world project, you'd run all checks (lint, etc.)
        # Here we focus on Prettier as the example quality gate.
        echo "Running Prettier format check..."
        yarn prettier:check
      workingDir: $(workspaces.source.path)

A common mistake here is to bundle too many unrelated actions into a single step. By separating cloning, installation, and checking, we get clearer logs and easier debugging if one part fails.

Task 2: Build and Push Container Image

Once the code is validated, we build the container image. We use Kaniko, which builds images from a Dockerfile inside a container without needing a Docker daemon, making it more secure and Kubernetes-friendly.

# tekton/tasks/build-and-push.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: build-and-push-kaniko
spec:
  description: >-
    Builds a container image using Kaniko and pushes it to a registry.
  params:
    - name: imageUrl
      description: The URL of the image to build and push. (e.g., gcr.io/my-project/my-app)
    - name: imageTag
      description: The tag for the image.
      default: latest
    - name: pathToDockerfile
      description: The path to the Dockerfile.
      default: ./Dockerfile
    - name: pathToContext
      description: The build context path.
      default: .
  workspaces:
    - name: source
      description: Contains the repository source and Dockerfile.
  steps:
    - name: build-and-push
      image: gcr.io/kaniko-project/executor:v1.9.0
      args:
        - --dockerfile=$(params.pathToDockerfile)
        - --context=$(workspaces.source.path)/$(params.pathToContext)
        - --destination=$(params.imageUrl):$(params.imageTag)
      # Assumes a Kubernetes secret named 'docker-registry-credentials' is mounted
      # via a ServiceAccount to provide authentication to the registry.

Task 3: Deploy to Green Environment

This is where the blue-green logic begins. This Task takes the newly built image and deploys it as the “green” version. It doesn’t receive any production traffic yet. We use Kustomize to manage manifest variations, but for clarity, here’s the core kubectl action.

# tekton/tasks/deploy-green.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: deploy-green
spec:
  description: >-
    Deploys the new version of the application to the 'green' environment.
    It creates a new Deployment with a 'color: green' label.
  params:
    - name: imageUrl
      description: The full URL of the new image to deploy.
    - name: deploymentName
      description: The base name of the deployment.
  steps:
    - name: apply-green-deployment
      image: bitnami/kubectl:1.24
      script: |
        # In a real project, this manifest would come from a Git repo (GitOps)
        # or be parameterized with kustomize/helm.
        # For this example, we define it inline.
        cat <<EOF | kubectl apply -f -
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: $(params.deploymentName)-green
          labels:
            app: $(params.deploymentName)
            color: green
        spec:
          replicas: 2
          selector:
            matchLabels:
              app: $(params.deploymentName)
              color: green
          template:
            metadata:
              labels:
                app: $(params.deploymentName)
                color: green
            spec:
              containers:
              - name: app
                image: $(params.imageUrl)
                ports:
                - containerPort: 3000
                readinessProbe:
                  httpGet:
                    path: /health.basic # Use the fast, shallow check for pod readiness
                    port: 3000
                  initialDelaySeconds: 15
                  periodSeconds: 10
        EOF

Notice the readiness probe uses the health.basic endpoint. This is intentional. The pod-level probe should be fast and simple to let Kubernetes know the process is running. The much deeper, application-level health check is the responsibility of the next Task.

Task 4: Verify Green Environment Readiness

This is the most critical Task and where our CAP-aware logic resides. It repeatedly polls the health.deepCheck tRPC endpoint of the green service until it reports a “ready” status or a timeout is reached. This check ensures that the new version can connect to all its dependencies before we even consider sending traffic its way.

# tekton/tasks/verify-green.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: verify-green
spec:
  description: >-
    Performs a deep health check against the green deployment to ensure
    it's ready to receive traffic.
  params:
    - name: deploymentName
      description: The base name of the deployment.
    - name: appNamespace
      description: The namespace where the app is deployed.
  steps:
    - name: wait-for-rollout
      image: bitnami/kubectl:1.24
      script: |
        kubectl rollout status deployment/$(params.deploymentName)-green -n $(params.appNamespace) --timeout=2m

    - name: deep-health-check
      # This step uses a simple curl container, but in a production setup,
      # you would use a dedicated client that can make tRPC calls, like 'grpcurl' for gRPC.
      # For tRPC over HTTP, we can construct the GET request.
      image: curlimages/curl:7.83.1
      script: |
        set -e
        # The service for green is not exposed externally. We use the internal K8s DNS name.
        # We need a headless service to query the pods directly, or a temporary green service.
        # Let's assume a 'user-profile-green' service exists for this check.
        GREEN_SERVICE_URL="http://$(params.deploymentName)-green.$(params.appNamespace).svc.cluster.local:3000"
        
        echo "Starting deep health check against ${GREEN_SERVICE_URL}..."
        
        # Poll for up to 90 seconds
        for i in {1..18}; do
          # The path for a tRPC query procedure is `[router].[procedure]`
          # The input is URL-encoded JSON. This procedure has no input.
          res=$(curl -s -o /dev/null -w "%{http_code}" "${GREEN_SERVICE_URL}/health.deepCheck")
          
          if [ "$res" = "200" ]; then
            echo "Deep health check successful. Green environment is ready."
            exit 0
          else
            echo "Health check failed with code ${res}. Retrying in 5 seconds..."
            sleep 5
          fi
        done
        
        echo "Green environment did not become ready in time."
        exit 1

The pitfall here is using the wrong level of check. A simple check on the pod state is not enough. We must invoke the application’s own logic that validates its dependencies, confirming its place in the distributed system before promoting it.

Task 5: Promote Green to Blue (The Traffic Switch)

Once the green environment is verified, we perform the atomic switch. This is done by patching the main Kubernetes Service to change its label selector from color: blue to color: green. Kubernetes handles the rest, redirecting all new traffic instantly.

# tekton/tasks/promote-green.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: promote-green
spec:
  description: >-
    Patches the main service to select the 'green' deployment, making it live.
  params:
    - name: serviceName
      description: The name of the main production service.
  steps:
    - name: switch-traffic
      image: bitnami/kubectl:1.24
      script: |
        echo "Switching production traffic to green deployment..."
        # This is the atomic switch.
        kubectl patch svc $(params.serviceName) -p '{"spec":{"selector":{"color":"green"}}}'
        echo "Traffic switch complete."

Task 6: Decommission Old Blue

After a successful switch, we need to clean up the old environment. This Task simply deletes the old blue Deployment. In a more cautious setup, you might leave it for a period to allow for a quick rollback, but for this flow, we remove it.

# tekton/tasks/decommission-blue.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: decommission-blue
spec:
  description: >-
    Deletes the old 'blue' deployment after a successful switch to green.
  params:
    - name: deploymentName
      description: The base name for the deployment.
  steps:
    - name: delete-old-deployment
      image: bitnami/kubectl:1.24
      script: |
        echo "Deleting old blue deployment..."
        # The '|| true' ensures the step succeeds even if the blue deployment
        # doesn't exist (e.g., on the very first deployment).
        kubectl delete deployment $(params.deploymentName)-blue --ignore-not-found=true || true
        echo "Old deployment decommissioned."

Assembling the Tekton Pipeline

With all the Tasks defined, we chain them together in a Pipeline. This object defines the execution order, data flow (via workspaces), and parameters.

graph TD
    A[Start] --> B(Git Clone & Quality Gate);
    B --> C{Quality OK?};
    C -- Yes --> D[Build & Push Image];
    C -- No --> E[Fail];
    D --> F[Deploy Green];
    F --> G[Verify Green Readiness];
    G --> H{Green Ready?};
    H -- Yes --> I[Promote Green to Blue];
    H -- No --> J[Abort & Clean Up Green];
    I --> K[Decommission Old Blue];
    K --> L[End];
    J --> L;

# tekton/pipeline.yaml
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: trpc-blue-green-deploy
spec:
  description: >-
    A full blue-green deployment pipeline for the tRPC service.
  params:
    - name: repository-url
      type: string
    - name: revision
      type: string
      default: "main"
    - name: app-name
      type: string
      default: "user-profile"
    - name: image-url
      type: string
      description: Base image URL without tag
    - name: app-namespace
      type: string
  workspaces:
    - name: shared-workspace
  
  tasks:
    - name: code-quality
      taskRef:
        name: code-quality-gate
      workspaces:
        - name: source
          workspace: shared-workspace
      params:
        - name: repository-url
          value: $(params.repository-url)
        - name: revision
          value: $(params.revision)

    - name: build-image
      taskRef:
        name: build-and-push-kaniko
      runAfter: [code-quality]
      workspaces:
        - name: source
          workspace: shared-workspace
      params:
        - name: imageUrl
          value: $(params.image-url)
        - name: imageTag
          value: $(tasks.code-quality.results.commit-sha) # Assuming the git task outputs the sha
    
    - name: deploy-to-green
      taskRef:
        name: deploy-green
      runAfter: [build-image]
      params:
        - name: imageUrl
          value: "$(params.image-url):$(tasks.build-image.results.imageTag)"
        - name: deploymentName
          value: $(params.app-name)

    - name: verify-green-environment
      taskRef:
        name: verify-green
      runAfter: [deploy-to-green]
      params:
        - name: deploymentName
          value: $(params.app-name)
        - name: appNamespace
          value: $(params.app-namespace)
          
    - name: promote-to-production
      taskRef:
        name: promote-green
      runAfter: [verify-green-environment]
      params:
        - name: serviceName
          value: $(params.app-name) # Assumes service name matches app name

    # The cleanup task runs last, but we need to relabel the new live deployment
    # from 'green' to 'blue' to prepare for the next run.
    - name: relabel-green-to-blue
      runAfter: [promote-to-production]
      image: bitnami/kubectl:1.24
      script: |
        kubectl label deployment $(params.app-name)-green color=blue --overwrite=true
        kubectl delete deployment $(params.app-name)-green
        # This part is tricky. A better approach is to rename the deployment.
        # kubectl rename deployment $(params.app-name)-green $(params.app-name)-blue is not a command.
        # The logic here needs refinement for a truly robust cycle.
        # A simplified approach for now:
        # After the switch, the green deployment IS the new blue. We just need to remove the OLD blue.
        
    - name: cleanup-old-blue
      taskRef:
        name: decommission-blue
      runAfter: [promote-to-production] # This should run after the switch
      params:
        - name: deploymentName
          value: $(params.app-name)

The final Pipeline definition reveals the true complexity. The flow is logical, but managing state (which color is currently live?) and ensuring atomicity at each step is paramount. The relabeling logic from green back to blue is a common pain point. A more robust implementation might use a GitOps-style controller that derives the desired state from a Git repository, with the pipeline merely updating that repository. The current script-based approach works but is more fragile.

This entire structure provides a deployment mechanism that understands and respects the architectural decisions of the application it serves. It elevates the CI/CD pipeline from a simple build-and-run tool to a critical component of the system’s overall consistency and availability strategy. The key was to stop treating deployment as a generic process and instead tailor it to the specific guarantees—and lack thereof—provided by our tRPC service, a direct application of CAP theorem principles to the operational domain.

The current solution still has boundaries. Database schema migrations are an entirely separate and unsolved problem that must happen out-of-band. The traffic switch is instantaneous and total; there is no provision for canarying a small percentage of traffic to the green environment for smoke testing under real load. Rollback, while possible by manually repatching the service to the old blue (if not yet decommissioned), is not an automated, one-click action within the pipeline. These represent the next frontier of maturity for this system, likely requiring integration with a service mesh to manage fine-grained traffic shaping and more sophisticated release strategies.

CI/CD Kubernetes Prettier Tekton CAP Theorem tRPC

Constructing a Hybrid Graph and Vector Retrieval System with ArangoDB BentoML and a Ruby on Rails Orchestrator

2023-10-27 Architecture

Ruby on Rails RAG ArangoDB BentoML Knowledge Graph

Implementing Record-Level Traceability in an Apache Hudi Pipeline Using OpenTelemetry and Nacos on Alibaba Cloud

2023-10-27 Data Engineering

OpenTelemetry Alibaba Cloud Nacos Spark Apache Hudi Data Observability