The transition from rolling updates to a more robust deployment strategy became unavoidable. Our core user-profile service, built with tRPC for its end-to-end typesafety, was exhibiting alarming behavior during deployments. While technically “zero-downtime” in that pods were always running, the brief period where old and new versions coexisted was causing data consistency anomalies. The service was designed with an AP (Availability, Partition Tolerance) posture in mind, relying on eventual consistency between its primary database and a Redis-based cache. Rolling updates, which gradually introduce new pods, created a state where clients could hit an old API version that was unaware of a schema change initiated by a new version, leading to transient data corruption. This wasn’t a failure of the service, but a failure of a deployment strategy that was ignorant of the system’s architectural trade-offs, a direct confrontation with the realities of the CAP theorem in a CI/CD context.
Our initial concept was to enforce atomicity on the deployment process itself: a blue-green strategy. The entire live environment (“blue”) would remain untouched while a complete, parallel environment (“green”) was spun up with the new version. Only after the green environment was verified as fully operational would we switch traffic instantaneously. This approach prevents the state of having mixed versions serving production traffic. The challenge, however, shifted from the application runtime to the CI/CD pipeline. How could we automate this process reliably within our Kubernetes ecosystem? And more critically, how could our pipeline intelligently determine that the “green” environment was truly “ready,” respecting its AP nature without demanding strict consistency (which it wasn’t designed for) during its health checks?
The technology selection was guided by our existing Kubernetes-native infrastructure and a philosophy of using declarative, composable tools. For the CI/CD engine, Tekton was the clear choice. Its CRD-based, container-centric model allowed us to define our pipeline steps as reusable Tasks
directly within our cluster, avoiding the impedance mismatch of external CI systems trying to manipulate Kubernetes resources. The tRPC service was a non-negotiable part of the stack; its typesafety benefits were too significant to abandon. The task was to build a deployment harness around it. Finally, to enforce discipline, Prettier was integrated as a mandatory quality gate. A pipeline this complex could not be compromised by trivial formatting disputes or unreadable code. Code must be verifiable before it even attempts to build.
The core of our solution lies in creating a series of Tekton Tasks
that codify each step of the blue-green process, from code validation to the final traffic switch. This journey begins not in Kubernetes, but with the tRPC service itself, ensuring it’s designed for this deployment model.
The tRPC Service: Designed for Verifiable Readiness
A standard health check that returns a 200 OK
is insufficient here. It confirms the process is running, but says nothing about its ability to correctly interact with its dependencies. We needed a readiness probe that aligned with our AP design. The service might temporarily serve slightly stale data from its cache, which is acceptable, but it must be able to connect to its primary database and message queue.
We defined a dedicated health
router within our tRPC application with a special procedure, deepCheck
.
// src/server/routers/health.ts
import { z } from 'zod';
import { publicProcedure, router } from '../trpc';
import { prisma } from '../prisma';
import { redisClient } from '../redis';
export const healthRouter = router({
/**
* A shallow health check, useful for K8s liveness probes.
* Just confirms the server is up and responding.
*/
basic: publicProcedure.query(() => {
return { status: 'ok' };
}),
/**
* A deep readiness check. This is what our Tekton pipeline will call.
* It verifies connectivity to critical downstream services.
* In an AP system, this check confirms the node is ready to join the cluster
* and will eventually become consistent. It doesn't check for data consistency itself.
*/
deepCheck: publicProcedure
.output(
z.object({
status: z.string(),
database: z.string(),
cache: z.string(),
})
)
.query(async () => {
let dbStatus: 'ok' | 'error' = 'ok';
let cacheStatus: 'ok' | 'error' = 'ok';
try {
// A simple, non-blocking query to check DB connectivity.
await prisma.$queryRaw`SELECT 1`;
} catch (e) {
dbStatus = 'error';
console.error('Database connectivity check failed', e);
}
try {
// Check Redis connectivity.
const pong = await redisClient.ping();
if (pong !== 'PONG') {
throw new Error('Redis ping failed');
}
} catch (e) {
cacheStatus = 'error';
console.error('Cache connectivity check failed', e);
}
const overallStatus = dbStatus === 'ok' && cacheStatus === 'ok' ? 'ready' : 'degraded';
if (overallStatus === 'degraded') {
// This will cause the procedure to throw, which our health check script can interpret as a failure.
throw new Error(`Deep health check failed: DB=${dbStatus}, Cache=${cacheStatus}`);
}
return {
status: overallStatus,
database: dbStatus,
cache: cacheStatus,
};
}),
});
// The main appRouter combines this with other routers.
// ...
The accompanying Dockerfile
is a standard multi-stage build to keep the final image lean.
# ---- Base ----
FROM node:18-alpine AS base
WORKDIR /app
# ---- Dependencies ----
FROM base AS deps
COPY package.json yarn.lock ./
RUN yarn install --frozen-lockfile --production=false
# ---- Builder ----
FROM base AS builder
COPY /app/node_modules ./node_modules
COPY . .
# Run code format check as part of the build.
# The pipeline will do this earlier, but this is a safety net.
RUN yarn prettier:check
RUN yarn build
# ---- Production ----
FROM base AS production
ENV NODE_ENV=production
COPY /app/node_modules ./node_modules
COPY /app/dist ./dist
COPY package.json .
# Prune dev dependencies for a smaller image
RUN yarn install --frozen-lockfile --production=true
# Expose the port the tRPC server will run on
EXPOSE 3000
CMD ["node", "dist/server/index.js"]
Tekton Pipeline Implementation: A Symphony of Tasks
With the service prepared, we can define the pipeline. We’ll break it down into a series of distinct, reusable Tekton Tasks
.
Task 1: Code Quality Gate
The first step in any sane pipeline is to validate the source code itself. This Task
clones the repository and runs linting and formatting checks. If this fails, the entire PipelineRun
halts immediately, saving compute resources and providing fast feedback.
# tekton/tasks/code-quality.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: code-quality-gate
spec:
description: >-
This task clones a repository, installs dependencies, and runs
linting and prettier checks.
params:
- name: repository-url
description: The git repository URL to clone from.
- name: revision
description: The git revision to check out.
default: main
workspaces:
- name: source
description: A workspace for the cloned repository.
steps:
- name: git-clone
image: alpine/git:v2.36.1
script: |
git clone $(params.repository-url) .
git checkout $(params.revision)
workingDir: $(workspaces.source.path)
- name: install-dependencies
image: node:18-alpine
script: |
yarn install --frozen-lockfile
workingDir: $(workspaces.source.path)
- name: prettier-check
image: node:18-alpine
script: |
# In a real-world project, you'd run all checks (lint, etc.)
# Here we focus on Prettier as the example quality gate.
echo "Running Prettier format check..."
yarn prettier:check
workingDir: $(workspaces.source.path)
A common mistake here is to bundle too many unrelated actions into a single step. By separating cloning, installation, and checking, we get clearer logs and easier debugging if one part fails.
Task 2: Build and Push Container Image
Once the code is validated, we build the container image. We use Kaniko, which builds images from a Dockerfile
inside a container without needing a Docker daemon, making it more secure and Kubernetes-friendly.
# tekton/tasks/build-and-push.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: build-and-push-kaniko
spec:
description: >-
Builds a container image using Kaniko and pushes it to a registry.
params:
- name: imageUrl
description: The URL of the image to build and push. (e.g., gcr.io/my-project/my-app)
- name: imageTag
description: The tag for the image.
default: latest
- name: pathToDockerfile
description: The path to the Dockerfile.
default: ./Dockerfile
- name: pathToContext
description: The build context path.
default: .
workspaces:
- name: source
description: Contains the repository source and Dockerfile.
steps:
- name: build-and-push
image: gcr.io/kaniko-project/executor:v1.9.0
args:
- --dockerfile=$(params.pathToDockerfile)
- --context=$(workspaces.source.path)/$(params.pathToContext)
- --destination=$(params.imageUrl):$(params.imageTag)
# Assumes a Kubernetes secret named 'docker-registry-credentials' is mounted
# via a ServiceAccount to provide authentication to the registry.
Task 3: Deploy to Green Environment
This is where the blue-green logic begins. This Task
takes the newly built image and deploys it as the “green” version. It doesn’t receive any production traffic yet. We use Kustomize to manage manifest variations, but for clarity, here’s the core kubectl
action.
# tekton/tasks/deploy-green.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: deploy-green
spec:
description: >-
Deploys the new version of the application to the 'green' environment.
It creates a new Deployment with a 'color: green' label.
params:
- name: imageUrl
description: The full URL of the new image to deploy.
- name: deploymentName
description: The base name of the deployment.
steps:
- name: apply-green-deployment
image: bitnami/kubectl:1.24
script: |
# In a real project, this manifest would come from a Git repo (GitOps)
# or be parameterized with kustomize/helm.
# For this example, we define it inline.
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: $(params.deploymentName)-green
labels:
app: $(params.deploymentName)
color: green
spec:
replicas: 2
selector:
matchLabels:
app: $(params.deploymentName)
color: green
template:
metadata:
labels:
app: $(params.deploymentName)
color: green
spec:
containers:
- name: app
image: $(params.imageUrl)
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: /health.basic # Use the fast, shallow check for pod readiness
port: 3000
initialDelaySeconds: 15
periodSeconds: 10
EOF
Notice the readiness probe uses the health.basic
endpoint. This is intentional. The pod-level probe should be fast and simple to let Kubernetes know the process is running. The much deeper, application-level health check is the responsibility of the next Task
.
Task 4: Verify Green Environment Readiness
This is the most critical Task
and where our CAP-aware logic resides. It repeatedly polls the health.deepCheck
tRPC endpoint of the green service until it reports a “ready” status or a timeout is reached. This check ensures that the new version can connect to all its dependencies before we even consider sending traffic its way.
# tekton/tasks/verify-green.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: verify-green
spec:
description: >-
Performs a deep health check against the green deployment to ensure
it's ready to receive traffic.
params:
- name: deploymentName
description: The base name of the deployment.
- name: appNamespace
description: The namespace where the app is deployed.
steps:
- name: wait-for-rollout
image: bitnami/kubectl:1.24
script: |
kubectl rollout status deployment/$(params.deploymentName)-green -n $(params.appNamespace) --timeout=2m
- name: deep-health-check
# This step uses a simple curl container, but in a production setup,
# you would use a dedicated client that can make tRPC calls, like 'grpcurl' for gRPC.
# For tRPC over HTTP, we can construct the GET request.
image: curlimages/curl:7.83.1
script: |
set -e
# The service for green is not exposed externally. We use the internal K8s DNS name.
# We need a headless service to query the pods directly, or a temporary green service.
# Let's assume a 'user-profile-green' service exists for this check.
GREEN_SERVICE_URL="http://$(params.deploymentName)-green.$(params.appNamespace).svc.cluster.local:3000"
echo "Starting deep health check against ${GREEN_SERVICE_URL}..."
# Poll for up to 90 seconds
for i in {1..18}; do
# The path for a tRPC query procedure is `[router].[procedure]`
# The input is URL-encoded JSON. This procedure has no input.
res=$(curl -s -o /dev/null -w "%{http_code}" "${GREEN_SERVICE_URL}/health.deepCheck")
if [ "$res" = "200" ]; then
echo "Deep health check successful. Green environment is ready."
exit 0
else
echo "Health check failed with code ${res}. Retrying in 5 seconds..."
sleep 5
fi
done
echo "Green environment did not become ready in time."
exit 1
The pitfall here is using the wrong level of check. A simple check on the pod state is not enough. We must invoke the application’s own logic that validates its dependencies, confirming its place in the distributed system before promoting it.
Task 5: Promote Green to Blue (The Traffic Switch)
Once the green environment is verified, we perform the atomic switch. This is done by patching the main Kubernetes Service
to change its label selector from color: blue
to color: green
. Kubernetes handles the rest, redirecting all new traffic instantly.
# tekton/tasks/promote-green.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: promote-green
spec:
description: >-
Patches the main service to select the 'green' deployment, making it live.
params:
- name: serviceName
description: The name of the main production service.
steps:
- name: switch-traffic
image: bitnami/kubectl:1.24
script: |
echo "Switching production traffic to green deployment..."
# This is the atomic switch.
kubectl patch svc $(params.serviceName) -p '{"spec":{"selector":{"color":"green"}}}'
echo "Traffic switch complete."
Task 6: Decommission Old Blue
After a successful switch, we need to clean up the old environment. This Task
simply deletes the old blue Deployment
. In a more cautious setup, you might leave it for a period to allow for a quick rollback, but for this flow, we remove it.
# tekton/tasks/decommission-blue.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: decommission-blue
spec:
description: >-
Deletes the old 'blue' deployment after a successful switch to green.
params:
- name: deploymentName
description: The base name for the deployment.
steps:
- name: delete-old-deployment
image: bitnami/kubectl:1.24
script: |
echo "Deleting old blue deployment..."
# The '|| true' ensures the step succeeds even if the blue deployment
# doesn't exist (e.g., on the very first deployment).
kubectl delete deployment $(params.deploymentName)-blue --ignore-not-found=true || true
echo "Old deployment decommissioned."
Assembling the Tekton Pipeline
With all the Tasks
defined, we chain them together in a Pipeline
. This object defines the execution order, data flow (via workspaces), and parameters.
graph TD A[Start] --> B(Git Clone & Quality Gate); B --> C{Quality OK?}; C -- Yes --> D[Build & Push Image]; C -- No --> E[Fail]; D --> F[Deploy Green]; F --> G[Verify Green Readiness]; G --> H{Green Ready?}; H -- Yes --> I[Promote Green to Blue]; H -- No --> J[Abort & Clean Up Green]; I --> K[Decommission Old Blue]; K --> L[End]; J --> L;
# tekton/pipeline.yaml
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: trpc-blue-green-deploy
spec:
description: >-
A full blue-green deployment pipeline for the tRPC service.
params:
- name: repository-url
type: string
- name: revision
type: string
default: "main"
- name: app-name
type: string
default: "user-profile"
- name: image-url
type: string
description: Base image URL without tag
- name: app-namespace
type: string
workspaces:
- name: shared-workspace
tasks:
- name: code-quality
taskRef:
name: code-quality-gate
workspaces:
- name: source
workspace: shared-workspace
params:
- name: repository-url
value: $(params.repository-url)
- name: revision
value: $(params.revision)
- name: build-image
taskRef:
name: build-and-push-kaniko
runAfter: [code-quality]
workspaces:
- name: source
workspace: shared-workspace
params:
- name: imageUrl
value: $(params.image-url)
- name: imageTag
value: $(tasks.code-quality.results.commit-sha) # Assuming the git task outputs the sha
- name: deploy-to-green
taskRef:
name: deploy-green
runAfter: [build-image]
params:
- name: imageUrl
value: "$(params.image-url):$(tasks.build-image.results.imageTag)"
- name: deploymentName
value: $(params.app-name)
- name: verify-green-environment
taskRef:
name: verify-green
runAfter: [deploy-to-green]
params:
- name: deploymentName
value: $(params.app-name)
- name: appNamespace
value: $(params.app-namespace)
- name: promote-to-production
taskRef:
name: promote-green
runAfter: [verify-green-environment]
params:
- name: serviceName
value: $(params.app-name) # Assumes service name matches app name
# The cleanup task runs last, but we need to relabel the new live deployment
# from 'green' to 'blue' to prepare for the next run.
- name: relabel-green-to-blue
runAfter: [promote-to-production]
image: bitnami/kubectl:1.24
script: |
kubectl label deployment $(params.app-name)-green color=blue --overwrite=true
kubectl delete deployment $(params.app-name)-green
# This part is tricky. A better approach is to rename the deployment.
# kubectl rename deployment $(params.app-name)-green $(params.app-name)-blue is not a command.
# The logic here needs refinement for a truly robust cycle.
# A simplified approach for now:
# After the switch, the green deployment IS the new blue. We just need to remove the OLD blue.
- name: cleanup-old-blue
taskRef:
name: decommission-blue
runAfter: [promote-to-production] # This should run after the switch
params:
- name: deploymentName
value: $(params.app-name)
The final Pipeline
definition reveals the true complexity. The flow is logical, but managing state (which color is currently live?) and ensuring atomicity at each step is paramount. The relabeling logic from green back to blue is a common pain point. A more robust implementation might use a GitOps-style controller that derives the desired state from a Git repository, with the pipeline merely updating that repository. The current script-based approach works but is more fragile.
This entire structure provides a deployment mechanism that understands and respects the architectural decisions of the application it serves. It elevates the CI/CD pipeline from a simple build-and-run tool to a critical component of the system’s overall consistency and availability strategy. The key was to stop treating deployment as a generic process and instead tailor it to the specific guarantees—and lack thereof—provided by our tRPC service, a direct application of CAP theorem principles to the operational domain.
The current solution still has boundaries. Database schema migrations are an entirely separate and unsolved problem that must happen out-of-band. The traffic switch is instantaneous and total; there is no provision for canarying a small percentage of traffic to the green environment for smoke testing under real load. Rollback, while possible by manually repatching the service to the old blue (if not yet decommissioned), is not an automated, one-click action within the pipeline. These represent the next frontier of maturity for this system, likely requiring integration with a service mesh to manage fine-grained traffic shaping and more sophisticated release strategies.