Orchestrating Containerized SciPy Computations and Nuxt.js Deployments with a Unified Tekton Pipeline

DevOps

Word Count: 2.2k

Read Times: 14 Min

The initial workflow was a predictable bottleneck. A data science team would finalize a new statistical model using Python and SciPy, run it against a validation dataset on a shared compute server, and then manually transfer the resulting CSV files and matplotlib plots to a shared drive. A separate frontend team would then be notified via email to manually update a static Nuxt.js dashboard that displayed these results. This process was not only glacial but also riddled with opportunities for human error—wrong files being copied, versions becoming mismatched, and no auditable trail of which model version produced which result. Rolling back a faulty update was a painful, manual scavenger hunt. The mandate was clear: automate this entire lifecycle, from code commit to result visualization, in a way that was repeatable, auditable, and resilient.

Our initial concept was a GitOps-driven pipeline. A git push to the model repository’s main branch should serve as the single trigger for the entire validation and deployment sequence. This immediately brought us to the choice of a CI/CD engine. We operate entirely on Kubernetes, so a cloud-native solution was non-negotiable. While Jenkins is powerful, its plugin-heavy nature and reliance on a persistent master felt like a step backward in a world of declarative, ephemeral infrastructure. We needed something that treated pipelines as code, defined by the same Kubernetes primitives we used for our applications. Tekton was the logical choice. Its Custom Resource Definitions (Task, Pipeline, PipelineRun) integrated seamlessly into our existing kubectl and GitOps workflows.

For the frontend, the team was already proficient with Vue.js, making Nuxt.js a natural fit. Its server-side rendering capability was a key advantage, as it could pre-render pages with the latest model results, improving initial load times for stakeholders. The core challenge wasn’t in the individual components but in the orchestration: how could Tekton manage a long-running, computationally intensive SciPy workload, handle its unique artifacts, and conditionally trigger a Nuxt.js frontend deployment, all within a single, coherent pipeline?

The first step was to decouple the SciPy model from its development environment by containerizing it. A common mistake here is to create bloated, insecure images. Our first pass at a Dockerfile was naive:

# Dockerfile.v1 - The naive approach
FROM python:3.9

WORKDIR /app

# This copies everything, including virtual environments, IDE configs, etc.
COPY . .

# This can be slow and pull in unnecessary dev dependencies
RUN pip install -r requirements.txt

CMD ["python", "run_model.py"]

This produced a multi-gigabyte image and was slow to build. In a real-world project, build time is a critical component of CI/CD feedback loops. We moved to a multi-stage build to produce a lean, production-ready artifact.

# Dockerfile.v2 - Production-grade multi-stage build

# --- Builder Stage ---
# Use a full-featured base image to build dependencies
FROM python:3.9-slim-buster as builder

WORKDIR /usr/src/app

# Install build-time dependencies for scientific packages
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    gfortran \
    libopenblas-dev \
    liblapack-dev

# Create a virtual environment to isolate packages
ENV VIRTUAL_ENV=/opt/venv
RUN python -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# Copy only the requirements file first to leverage Docker layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# --- Final Stage ---
# Use a minimal base image for the final artifact
FROM python:3.9-slim-buster

WORKDIR /app

# Copy the virtual environment from the builder stage
COPY --from=builder /opt/venv /opt/venv

# Copy the application source code
COPY src/ /app/src/

# Activate the virtual environment
ENV PATH="/opt/venv/bin:$PATH"

# Set up a non-root user for security
RUN useradd --create-home appuser
USER appuser

# Define the entrypoint for the model execution
ENTRYPOINT ["python", "src/run_model.py"]

This approach dramatically reduced the final image size and improved security by running as a non-root user. The pitfall here is managing system-level dependencies required by libraries like NumPy and SciPy, which is why the builder stage explicitly installs gfortran and libopenblas-dev.

With a reliable way to build the model container, we designed the Tekton Tasks. A Tekton Pipeline is just a directed acyclic graph (DAG) of Tasks, where each Task is a series of steps executed in a pod. Our pipeline required several distinct logical units.

First, a standard git-clone task to fetch the source code. Tekton Hub provides a catalog of reusable tasks for this. Next, we needed to build and push the container image from within the cluster. Using Docker-in-Docker is a common but problematic pattern due to its security implications. Instead, we opted for Kaniko, which builds images in userspace.

Here is the Tekton Task for building the model image:

# tekton/tasks/build-model-image.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: kaniko-build-model
spec:
  params:
    - name: imageUrl
      description: The URL of the image to build and push
      type: string
    - name: dockerfilePath
      description: Path to the Dockerfile
      type: string
      default: ./Dockerfile
  workspaces:
    - name: source
      description: The workspace containing the source code and Dockerfile
  steps:
    - name: build-and-push
      image: gcr.io/kaniko-project/executor:v1.9.0
      # Kaniko doesn't need privileged access, a major security win
      securityContext:
        runAsUser: 0 # Kaniko requires root to unpack the base image filesystem
      args:
        - --dockerfile=$(params.dockerfilePath)
        - --context=dir://$(workspaces.source.path)
        - --destination=$(params.imageUrl)
        # Add --no-push for local testing or dry runs

The most critical part was the run-model-validation task. Our SciPy script was designed to take an input data path and produce an output directory with results. In Tekton, passing data between tasks is handled by Workspaces. A Workspace is an abstraction over a storage volume (like a PersistentVolumeClaim) that can be mounted by multiple Tasks in a Pipeline.

# tekton/tasks/run-model-validation.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: run-model-validation
spec:
  params:
    - name: modelImageUrl
      description: The fully qualified image name of the model to run
      type: string
  workspaces:
    - name: validation-data
      description: Workspace for input data and output results
  results:
    - name: validation-status
      description: The outcome of the validation ('success' or 'failure').
  steps:
    - name: execute-model
      image: $(params.modelImageUrl)
      # Mount the shared workspace
      volumeMounts:
        - name: data
          mountPath: /data # Corresponds to the volume defined below
      script: |
        #!/bin/sh
        set -e # Exit immediately if a command exits with a non-zero status.

        echo "Starting model validation..."
        # The script inside the container expects these paths
        INPUT_DIR="/data/input"
        OUTPUT_DIR="/data/output"
        
        # Create output directory
        mkdir -p $OUTPUT_DIR
        
        # The Python script handles its own logging.
        # It must exit with 0 on success and non-zero on failure.
        run_model.py --input $INPUT_DIR --output $OUTPUT_DIR
        
        echo "Model validation script finished successfully."
        
        # A simple success condition: check if a results file was created.
        if [ -f "$OUTPUT_DIR/summary.json" ]; then
          echo -n "success" | tee $(results.validation-status.path)
        else
          echo "Error: summary.json not found in output."
          echo -n "failure" | tee $(results.validation-status.path)
          exit 1
        fi
      volumeDevices: [] # Required for volumeMounts
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: $(workspaces.validation-data.claimName) # This is a placeholder

A significant problem emerged here: our model validation could take anywhere from 30 minutes to several hours. A Tekton TaskRun pod is not designed to be a long-running batch job. If the cluster node experiences issues or the pod is evicted, the entire run fails without a clear state. The correct pattern for this in a real-world project is to decouple job submission from job execution. The Tekton Task should not run the computation itself; it should submit a Kubernetes Job and then monitor it.

This architectural shift adds complexity but provides immense resilience. The Kubernetes Job controller ensures the pod runs to completion, handling retries and node failures.

Here is the refined Task definition:

# tekton/tasks/run-model-validation-job.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: run-model-validation-as-job
spec:
  params:
    - name: modelImageUrl
      description: The image of the model to run.
    - name: jobNamePrefix
      description: Prefix for the Kubernetes Job name.
      type: string
      default: model-validation-
  workspaces:
    - name: shared-data
      description: PVC for input/output.
  steps:
    - name: submit-job
      image: bitnami/kubectl:latest
      script: |
        #!/bin/sh
        set -e
        JOB_NAME="$(params.jobNamePrefix)$(context.taskRun.name)"
        echo "Submitting Kubernetes Job: $JOB_NAME"

        # Dynamically create the Job manifest
        cat <<EOF | kubectl apply -f -
        apiVersion: batch/v1
        kind: Job
        metadata:
          name: ${JOB_NAME}
        spec:
          template:
            spec:
              containers:
              - name: model-runner
                image: $(params.modelImageUrl)
                args: ["--input", "/data/input", "--output", "/data/output"]
                volumeMounts:
                - name: data-volume
                  mountPath: /data
              volumes:
              - name: data-volume
                persistentVolumeClaim:
                  claimName: $(workspaces.shared-data.claimName)
              restartPolicy: Never
          backoffLimit: 2 # Retry twice on failure
        EOF
        
        echo "Waiting for Job ${JOB_NAME} to complete..."
        # This is the monitoring loop. It's more robust than just running the process.
        kubectl wait --for=condition=complete --timeout=4h job/${JOB_NAME}
        
        # Optional: Check for job failure and explicitly fail the task
        JOB_STATUS=$(kubectl get job ${JOB_NAME} -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}')
        if [ "${JOB_STATUS}" = "True" ]; then
          echo "Job ${JOB_NAME} failed."
          kubectl logs job/${JOB_NAME} # Dump logs for debugging
          exit 1
        fi
        
        echo "Job ${JOB_NAME} completed successfully."

With these core Tasks defined, we assembled them into a Pipeline. The pipeline ensures a logical flow, passing parameters and workspaces between tasks. We also introduced conditional execution using when expressions, so the deployment tasks only run if the validation was successful.

graph TD
    A[Start] --> B(git-clone);
    B --> C(build-model-image);
    C --> D(run-model-validation-job);
    D --> E{Validation Successful?};
    E -- Yes --> F(publish-results-to-s3);
    F --> G(build-and-deploy-frontend);
    G --> H[End];
    E -- No --> I(notify-failure);
    I --> H;

Here is a snippet of the Pipeline definition showing the flow:

# tekton/pipeline.yaml
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: model-validation-and-deploy
spec:
  workspaces:
    - name: shared-workspace
  params:
    # ... git url, revision, etc.
  tasks:
    - name: fetch-source
      taskRef:
        name: git-clone
      workspaces:
        - name: output
          workspace: shared-workspace
      # ... params
      
    - name: build-model
      taskRef:
        name: kaniko-build-model
      runAfter: [fetch-source]
      workspaces:
        - name: source
          workspace: shared-workspace
      # ... params for image name, etc.

    - name: run-validation
      taskRef:
        name: run-model-validation-job
      runAfter: [build-model]
      workspaces:
        - name: shared-data
          workspace: shared-workspace # Using same PVC for simplicity here
      params:
        - name: modelImageUrl
          value: $(tasks.build-model.results.image-url) # Get image URL from previous task
          
    - name: deploy-frontend
      taskRef:
        name: build-and-deploy-nuxt
      runAfter: [run-validation]
      # THIS IS THE CRITICAL PART FOR CONDITIONAL EXECUTION
      when:
        - input: "$(tasks.run-validation.results.validation-status)"
          operator: in
          values: ["success"]
      # ...

The final piece of the puzzle was integrating the Nuxt.js frontend. The frontend doesn’t participate in the pipeline execution; it consumes the pipeline’s output. We designed the publish-results task to upload the validation artifacts (CSV, plots, summary JSON) to an S3-compatible object store. Crucially, it also overwrites a manifest.json file at a known location in the bucket.

// s3://model-results/manifest.json
{
  "latestRunId": "model-validation-xyz-123",
  "lastUpdated": "2023-10-27T10:00:00Z",
  "status": "SUCCESS",
  "artifacts": {
    "summary": "results/model-validation-xyz-123/summary.json",
    "plot": "results/model-validation-xyz-123/performance_plot.png"
  },
  "history": [
    { "runId": "model-validation-xyz-123", "status": "SUCCESS", "timestamp": "..." },
    { "runId": "model-validation-abc-456", "status": "FAILURE", "timestamp": "..." }
  ]
}

The Nuxt.js application was configured to fetch this manifest.json file on the server-side during page generation (asyncData or useFetch). This meant users always saw the latest successfully validated results upon visiting the dashboard. We avoided a complex real-time WebSocket setup, as a simple polling mechanism (or even just data fetched at deploy time) was sufficient for our use case. The pragmatic choice often involves trading real-time complexity for operational simplicity.

The build-and-deploy-nuxt task itself was straightforward: another multi-stage Dockerfile to build the Nuxt application, and a final step using kubectl to set image on our existing Kubernetes Deployment for the frontend, triggering a rolling update.

The entire system is triggered by a webhook from our Git repository, which is received by a Tekton EventListener. This listener parses the Git payload, extracts information like the commit SHA, and uses a TriggerTemplate to instantiate and run the PipelineRun.

The final result is a system where a data scientist can push a code change, and within minutes (or hours, depending on the model), see the validated results appear on the dashboard automatically. The entire history is captured in the PipelineRun logs in Kubernetes and the versioned artifacts in our object store.

This architecture is not without its limitations. The use of a single, shared PVC for the workspace is a simplification. In a high-concurrency scenario, multiple pipeline runs would conflict. A more robust solution involves dynamically provisioning a new PVC for each PipelineRun, which Tekton supports. Furthermore, the error reporting from the SciPy script back to the pipeline is coarse—it’s just a success/failure exit code. A future iteration would involve the Python script emitting a structured JSON log that the pipeline could parse, allowing for more granular failure analysis and reporting directly in the Nuxt UI. Finally, this pipeline only deploys the results; a true MLOps pipeline would also version and deploy the model itself as a callable API endpoint, possibly using a canary release strategy managed by a service mesh or ingress controller, which would be the next logical evolution of this platform.

SciPy Kubernetes GitOps MLOps Nuxt.js Tekton

Architecting an Offline-First ML Prediction System with a Rails Monolith and a Python DVC-Managed Service

2023-10-27 System Architecture

Ant Design DVC Python Service Workers Ruby on Rails

Architecting a WebRTC Client with Dynamic Stream Permissions Driven by AWS IAM Policies

2023-10-27 Architecture

Jest PostCSS ESLint WebRTC IAM