The initial workflow was a predictable bottleneck. A data science team would finalize a new statistical model using Python and SciPy, run it against a validation dataset on a shared compute server, and then manually transfer the resulting CSV files and matplotlib plots to a shared drive. A separate frontend team would then be notified via email to manually update a static Nuxt.js dashboard that displayed these results. This process was not only glacial but also riddled with opportunities for human error—wrong files being copied, versions becoming mismatched, and no auditable trail of which model version produced which result. Rolling back a faulty update was a painful, manual scavenger hunt. The mandate was clear: automate this entire lifecycle, from code commit to result visualization, in a way that was repeatable, auditable, and resilient.
Our initial concept was a GitOps-driven pipeline. A git push
to the model repository’s main branch should serve as the single trigger for the entire validation and deployment sequence. This immediately brought us to the choice of a CI/CD engine. We operate entirely on Kubernetes, so a cloud-native solution was non-negotiable. While Jenkins is powerful, its plugin-heavy nature and reliance on a persistent master felt like a step backward in a world of declarative, ephemeral infrastructure. We needed something that treated pipelines as code, defined by the same Kubernetes primitives we used for our applications. Tekton was the logical choice. Its Custom Resource Definitions (Task
, Pipeline
, PipelineRun
) integrated seamlessly into our existing kubectl
and GitOps workflows.
For the frontend, the team was already proficient with Vue.js, making Nuxt.js a natural fit. Its server-side rendering capability was a key advantage, as it could pre-render pages with the latest model results, improving initial load times for stakeholders. The core challenge wasn’t in the individual components but in the orchestration: how could Tekton manage a long-running, computationally intensive SciPy workload, handle its unique artifacts, and conditionally trigger a Nuxt.js frontend deployment, all within a single, coherent pipeline?
The first step was to decouple the SciPy model from its development environment by containerizing it. A common mistake here is to create bloated, insecure images. Our first pass at a Dockerfile
was naive:
# Dockerfile.v1 - The naive approach
FROM python:3.9
WORKDIR /app
# This copies everything, including virtual environments, IDE configs, etc.
COPY . .
# This can be slow and pull in unnecessary dev dependencies
RUN pip install -r requirements.txt
CMD ["python", "run_model.py"]
This produced a multi-gigabyte image and was slow to build. In a real-world project, build time is a critical component of CI/CD feedback loops. We moved to a multi-stage build to produce a lean, production-ready artifact.
# Dockerfile.v2 - Production-grade multi-stage build
# --- Builder Stage ---
# Use a full-featured base image to build dependencies
FROM python:3.9-slim-buster as builder
WORKDIR /usr/src/app
# Install build-time dependencies for scientific packages
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
gfortran \
libopenblas-dev \
liblapack-dev
# Create a virtual environment to isolate packages
ENV VIRTUAL_ENV=/opt/venv
RUN python -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
# Copy only the requirements file first to leverage Docker layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# --- Final Stage ---
# Use a minimal base image for the final artifact
FROM python:3.9-slim-buster
WORKDIR /app
# Copy the virtual environment from the builder stage
COPY /opt/venv /opt/venv
# Copy the application source code
COPY src/ /app/src/
# Activate the virtual environment
ENV PATH="/opt/venv/bin:$PATH"
# Set up a non-root user for security
RUN useradd --create-home appuser
USER appuser
# Define the entrypoint for the model execution
ENTRYPOINT ["python", "src/run_model.py"]
This approach dramatically reduced the final image size and improved security by running as a non-root user. The pitfall here is managing system-level dependencies required by libraries like NumPy and SciPy, which is why the builder
stage explicitly installs gfortran
and libopenblas-dev
.
With a reliable way to build the model container, we designed the Tekton Tasks
. A Tekton Pipeline
is just a directed acyclic graph (DAG) of Tasks
, where each Task
is a series of steps executed in a pod. Our pipeline required several distinct logical units.
First, a standard git-clone
task to fetch the source code. Tekton Hub provides a catalog of reusable tasks for this. Next, we needed to build and push the container image from within the cluster. Using Docker-in-Docker is a common but problematic pattern due to its security implications. Instead, we opted for Kaniko, which builds images in userspace.
Here is the Tekton Task
for building the model image:
# tekton/tasks/build-model-image.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: kaniko-build-model
spec:
params:
- name: imageUrl
description: The URL of the image to build and push
type: string
- name: dockerfilePath
description: Path to the Dockerfile
type: string
default: ./Dockerfile
workspaces:
- name: source
description: The workspace containing the source code and Dockerfile
steps:
- name: build-and-push
image: gcr.io/kaniko-project/executor:v1.9.0
# Kaniko doesn't need privileged access, a major security win
securityContext:
runAsUser: 0 # Kaniko requires root to unpack the base image filesystem
args:
- --dockerfile=$(params.dockerfilePath)
- --context=dir://$(workspaces.source.path)
- --destination=$(params.imageUrl)
# Add --no-push for local testing or dry runs
The most critical part was the run-model-validation
task. Our SciPy script was designed to take an input data path and produce an output directory with results. In Tekton, passing data between tasks is handled by Workspaces
. A Workspace
is an abstraction over a storage volume (like a PersistentVolumeClaim
) that can be mounted by multiple Tasks
in a Pipeline
.
# tekton/tasks/run-model-validation.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: run-model-validation
spec:
params:
- name: modelImageUrl
description: The fully qualified image name of the model to run
type: string
workspaces:
- name: validation-data
description: Workspace for input data and output results
results:
- name: validation-status
description: The outcome of the validation ('success' or 'failure').
steps:
- name: execute-model
image: $(params.modelImageUrl)
# Mount the shared workspace
volumeMounts:
- name: data
mountPath: /data # Corresponds to the volume defined below
script: |
#!/bin/sh
set -e # Exit immediately if a command exits with a non-zero status.
echo "Starting model validation..."
# The script inside the container expects these paths
INPUT_DIR="/data/input"
OUTPUT_DIR="/data/output"
# Create output directory
mkdir -p $OUTPUT_DIR
# The Python script handles its own logging.
# It must exit with 0 on success and non-zero on failure.
run_model.py --input $INPUT_DIR --output $OUTPUT_DIR
echo "Model validation script finished successfully."
# A simple success condition: check if a results file was created.
if [ -f "$OUTPUT_DIR/summary.json" ]; then
echo -n "success" | tee $(results.validation-status.path)
else
echo "Error: summary.json not found in output."
echo -n "failure" | tee $(results.validation-status.path)
exit 1
fi
volumeDevices: [] # Required for volumeMounts
volumes:
- name: data
persistentVolumeClaim:
claimName: $(workspaces.validation-data.claimName) # This is a placeholder
A significant problem emerged here: our model validation could take anywhere from 30 minutes to several hours. A Tekton TaskRun
pod is not designed to be a long-running batch job. If the cluster node experiences issues or the pod is evicted, the entire run fails without a clear state. The correct pattern for this in a real-world project is to decouple job submission from job execution. The Tekton Task
should not run the computation itself; it should submit a Kubernetes Job
and then monitor it.
This architectural shift adds complexity but provides immense resilience. The Kubernetes Job
controller ensures the pod runs to completion, handling retries and node failures.
Here is the refined Task
definition:
# tekton/tasks/run-model-validation-job.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: run-model-validation-as-job
spec:
params:
- name: modelImageUrl
description: The image of the model to run.
- name: jobNamePrefix
description: Prefix for the Kubernetes Job name.
type: string
default: model-validation-
workspaces:
- name: shared-data
description: PVC for input/output.
steps:
- name: submit-job
image: bitnami/kubectl:latest
script: |
#!/bin/sh
set -e
JOB_NAME="$(params.jobNamePrefix)$(context.taskRun.name)"
echo "Submitting Kubernetes Job: $JOB_NAME"
# Dynamically create the Job manifest
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: ${JOB_NAME}
spec:
template:
spec:
containers:
- name: model-runner
image: $(params.modelImageUrl)
args: ["--input", "/data/input", "--output", "/data/output"]
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: $(workspaces.shared-data.claimName)
restartPolicy: Never
backoffLimit: 2 # Retry twice on failure
EOF
echo "Waiting for Job ${JOB_NAME} to complete..."
# This is the monitoring loop. It's more robust than just running the process.
kubectl wait --for=condition=complete --timeout=4h job/${JOB_NAME}
# Optional: Check for job failure and explicitly fail the task
JOB_STATUS=$(kubectl get job ${JOB_NAME} -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}')
if [ "${JOB_STATUS}" = "True" ]; then
echo "Job ${JOB_NAME} failed."
kubectl logs job/${JOB_NAME} # Dump logs for debugging
exit 1
fi
echo "Job ${JOB_NAME} completed successfully."
With these core Tasks
defined, we assembled them into a Pipeline
. The pipeline ensures a logical flow, passing parameters and workspaces between tasks. We also introduced conditional execution using when
expressions, so the deployment tasks only run if the validation was successful.
graph TD A[Start] --> B(git-clone); B --> C(build-model-image); C --> D(run-model-validation-job); D --> E{Validation Successful?}; E -- Yes --> F(publish-results-to-s3); F --> G(build-and-deploy-frontend); G --> H[End]; E -- No --> I(notify-failure); I --> H;
Here is a snippet of the Pipeline
definition showing the flow:
# tekton/pipeline.yaml
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: model-validation-and-deploy
spec:
workspaces:
- name: shared-workspace
params:
# ... git url, revision, etc.
tasks:
- name: fetch-source
taskRef:
name: git-clone
workspaces:
- name: output
workspace: shared-workspace
# ... params
- name: build-model
taskRef:
name: kaniko-build-model
runAfter: [fetch-source]
workspaces:
- name: source
workspace: shared-workspace
# ... params for image name, etc.
- name: run-validation
taskRef:
name: run-model-validation-job
runAfter: [build-model]
workspaces:
- name: shared-data
workspace: shared-workspace # Using same PVC for simplicity here
params:
- name: modelImageUrl
value: $(tasks.build-model.results.image-url) # Get image URL from previous task
- name: deploy-frontend
taskRef:
name: build-and-deploy-nuxt
runAfter: [run-validation]
# THIS IS THE CRITICAL PART FOR CONDITIONAL EXECUTION
when:
- input: "$(tasks.run-validation.results.validation-status)"
operator: in
values: ["success"]
# ...
The final piece of the puzzle was integrating the Nuxt.js frontend. The frontend doesn’t participate in the pipeline execution; it consumes the pipeline’s output. We designed the publish-results
task to upload the validation artifacts (CSV, plots, summary JSON) to an S3-compatible object store. Crucially, it also overwrites a manifest.json
file at a known location in the bucket.
// s3://model-results/manifest.json
{
"latestRunId": "model-validation-xyz-123",
"lastUpdated": "2023-10-27T10:00:00Z",
"status": "SUCCESS",
"artifacts": {
"summary": "results/model-validation-xyz-123/summary.json",
"plot": "results/model-validation-xyz-123/performance_plot.png"
},
"history": [
{ "runId": "model-validation-xyz-123", "status": "SUCCESS", "timestamp": "..." },
{ "runId": "model-validation-abc-456", "status": "FAILURE", "timestamp": "..." }
]
}
The Nuxt.js application was configured to fetch this manifest.json
file on the server-side during page generation (asyncData
or useFetch
). This meant users always saw the latest successfully validated results upon visiting the dashboard. We avoided a complex real-time WebSocket setup, as a simple polling mechanism (or even just data fetched at deploy time) was sufficient for our use case. The pragmatic choice often involves trading real-time complexity for operational simplicity.
The build-and-deploy-nuxt
task itself was straightforward: another multi-stage Dockerfile to build the Nuxt application, and a final step using kubectl
to set image
on our existing Kubernetes Deployment
for the frontend, triggering a rolling update.
The entire system is triggered by a webhook from our Git repository, which is received by a Tekton EventListener
. This listener parses the Git payload, extracts information like the commit SHA, and uses a TriggerTemplate
to instantiate and run the PipelineRun
.
The final result is a system where a data scientist can push a code change, and within minutes (or hours, depending on the model), see the validated results appear on the dashboard automatically. The entire history is captured in the PipelineRun
logs in Kubernetes and the versioned artifacts in our object store.
This architecture is not without its limitations. The use of a single, shared PVC for the workspace is a simplification. In a high-concurrency scenario, multiple pipeline runs would conflict. A more robust solution involves dynamically provisioning a new PVC for each PipelineRun
, which Tekton supports. Furthermore, the error reporting from the SciPy script back to the pipeline is coarse—it’s just a success/failure exit code. A future iteration would involve the Python script emitting a structured JSON log that the pipeline could parse, allowing for more granular failure analysis and reporting directly in the Nuxt UI. Finally, this pipeline only deploys the results; a true MLOps pipeline would also version and deploy the model itself as a callable API endpoint, possibly using a canary release strategy managed by a service mesh or ingress controller, which would be the next logical evolution of this platform.