Automating a Resilient Serverless Visualization Pipeline with Ansible OpenFaaS and SQS on EKS

Cloud Native

Word Count: 2.6k

Read Times: 16 Min

The initial architecture was deceptively simple: a synchronous API endpoint that accepted a JSON payload, generated a moderately complex scientific plot using Matplotlib, and returned the PNG image. This worked for a dozen concurrent users. It collapsed completely at a hundred. Requests backed up, workers saturated, and timeouts became the norm. The core problem was coupling the immediate user request to a long-running, CPU-intensive task. The clear path forward was a decoupled, asynchronous pipeline, but the implementation details were where the real engineering challenges lay.

Our goal became to build a system that could ingest thousands of visualization requests per minute, process them reliably, and scale compute resources on demand without manual intervention. The user would submit a job and get a job ID back immediately; the result could be polled for later. This pointed directly to a message queue and a pool of scalable workers.

The technology selection process was driven by pragmatism and our existing ecosystem. We run on AWS, so a managed Kubernetes service was a given. AWS EKS abstracts away the control plane’s operational burden, which is a significant win. For the queue, AWS SQS was the default choice—it’s robust, managed, and its dead-letter queue (DLQ) feature is critical for building resilient systems. The real debate was around the compute layer. We could have built a custom Kubernetes deployment with Horizontal Pod Autoscalers (HPAs) listening to SQS queue depth, but that felt like reinventing the wheel. OpenFaaS, running on top of EKS, offered a higher level of abstraction. Its serverless model, with built-in SQS triggers and auto-scaling based on in-flight requests, promised to handle the boilerplate orchestration, letting us focus on the function code.

Finally, for automation, we chose Ansible. While Terraform is excellent for declarative infrastructure provisioning, our team’s expertise was stronger with Ansible. More importantly, Ansible is adept at both provisioning infrastructure (via AWS modules) and handling the subsequent configuration management—like installing Helm charts, creating Kubernetes secrets, and managing application deployments. A single tool to manage the entire lifecycle from VPC to running function was a compelling proposition.

This is the log of how we built that pipeline, the configuration code that underpins it, and the pitfalls discovered along the way.

Stage 1: Provisioning the EKS Foundation with Ansible

The first step is laying the groundwork: the EKS cluster itself. A common mistake here is to create a monolithic Ansible playbook. In a real-world project, this becomes unmaintainable. We structure our Ansible project using roles, promoting reusability and separation of concerns.

The project structure looks like this:

ansible-openfaas-eks/
├── ansible.cfg
├── inventory/
│   └── hosts
├── roles/
│   ├── 01_eks_cluster/
│   │   └── tasks/
│   │       └── main.yml
│   ├── 02_k8s_tools/
│   │   └── tasks/
│   │       └── main.yml
│   ├── 03_openfaas_deploy/
│   │   └── tasks/
│   │       └── main.yml
│   └── 04_sqs_setup/
│       └── tasks/
│           └── main.yml
└── playbook.yml

The playbook.yml orchestrates these roles. The first role, 01_eks_cluster, handles the creation of the VPC, subnets, and the EKS cluster. We use the amazon.aws.eks_cluster module, which is a high-level abstraction that simplifies this process significantly.

Here is a core part of roles/01_eks_cluster/tasks/main.yml. It’s not just about creating the cluster; it’s about doing it idempotently and capturing the output for subsequent steps.

# roles/01_eks_cluster/tasks/main.yml
- name: Ensure VPC for EKS exists
  amazon.aws.ec2_vpc_net:
    name: "openfaas-eks-vpc"
    cidr_block: "10.10.0.0/16"
    region: "{{ aws_region }}"
    state: present
    tags:
      Project: "OpenFaaS-Viz-Pipeline"
  register: vpc_result

- name: Create two public subnets for EKS
  amazon.aws.ec2_vpc_subnet:
    state: present
    vpc_id: "{{ vpc_result.vpc.id }}"
    cidr: "{{ item }}"
    region: "{{ aws_region }}"
    map_public: yes
    tags:
      "kubernetes.io/cluster/openfaas-viz-cluster": "shared"
  loop:
    - "10.10.1.0/24"
    - "10.10.2.0/24"
  register: subnet_results

- name: Create the EKS cluster IAM role
  amazon.aws.iam_role:
    name: "EKSClusterRole-VizPipeline"
    state: present
    assume_role_policy_document: "{{ lookup('file', 'eks_cluster_assume_role_policy.json') }}"
    managed_policy:
      - "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
  register: cluster_role

- name: Create the EKS node group IAM role
  amazon.aws.iam_role:
    name: "EKSNodeRole-VizPipeline"
    state: present
    assume_role_policy_document: "{{ lookup('file', 'ec2_assume_role_policy.json') }}"
    managed_policy:
      - "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
      - "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
      - "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
  register: node_role

- name: Provision the EKS cluster
  # This task can take 15-20 minutes. It's idempotent.
  amazon.aws.eks_cluster:
    name: "openfaas-viz-cluster"
    state: present
    region: "{{ aws_region }}"
    role_arn: "{{ cluster_role.arn }}"
    vpc_config:
      subnet_ids: "{{ subnet_results.results | map(attribute='subnet.id') | list }}"
      public_access: yes
      private_access: no

- name: Create a managed node group for OpenFaaS functions
  # We specify instance types suitable for CPU-bound Matplotlib tasks
  amazon.aws.eks_nodegroup:
    name: "faas-workers"
    cluster_name: "openfaas-viz-cluster"
    state: present
    node_role: "{{ node_role.arn }}"
    subnets: "{{ subnet_results.results | map(attribute='subnet.id') | list }}"
    instance_types: ["c5.large", "c5a.large"]
    scaling_config:
      min_size: 1
      max_size: 5
      desired_size: 2
    disk_size: 40
    region: "{{ aws_region }}"
    tags:
      Purpose: "OpenFaaS-Function-Workers"

The key here is the use of register to capture results and pass them to subsequent tasks, and the explicit definition of IAM roles with minimal necessary policies. The pitfall to avoid is using overly permissive IAM roles, a common mistake in quick tutorials.

Stage 2: Deploying OpenFaaS with Ansible and Helm

Once the EKS cluster is running, we need kubectl configured and Helm installed to deploy OpenFaaS. The 02_k8s_tools role handles this setup on the Ansible control node. Then, 03_openfaas_deploy uses the community.kubernetes.helm module to deploy OpenFaaS.

Using the Helm module is superior to shelling out to helm commands because it’s more declarative and idempotent.

# roles/03_openfaas_deploy/tasks/main.yml
- name: Add OpenFaaS Helm repository
  community.kubernetes.helm_repository:
    name: openfaas
    repo_url: "https://openfaas.github.io/faas-netes/"
    state: present

- name: Create namespaces for OpenFaaS
  community.kubernetes.k8s:
    name: "{{ item }}"
    api_version: v1
    kind: Namespace
    state: present
  loop:
    - openfaas
    - openfaas-fn

- name: Generate a password for the OpenFaaS gateway
  # We store this securely, for example in Ansible Vault
  ansible.builtin.command: "openssl rand -base64 32"
  register: faas_password
  changed_when: false

- name: Create secret for OpenFaaS gateway password
  community.kubernetes.k8s:
    state: present
    definition:
      apiVersion: v1
      kind: Secret
      metadata:
        name: basic-auth
        namespace: openfaas
      type: Opaque
      stringData:
        basic-auth-user: admin
        basic-auth-password: "{{ faas_password.stdout }}"

- name: Deploy OpenFaaS using Helm chart
  community.kubernetes.helm:
    name: openfaas
    chart_ref: openfaas/openfaas
    release_namespace: openfaas
    state: present
    values:
      functionNamespace: "openfaas-fn"
      generateBasicAuth: false # We created the secret manually
      serviceType: LoadBalancer
      operator:
        create: true
      faasnetes:
        readTimeout: "5m"
        writeTimeout: "5m"
        imagePullPolicy: "Always"
      queueWorker:
        ackWait: "5m" # Critical for long-running functions

A crucial detail here is increasing the timeouts (readTimeout, writeTimeout, ackWait). Matplotlib can take a non-trivial amount of time to generate a complex plot, especially during a cold start. The default timeouts of 60 seconds are often too short for data processing tasks, leading to premature request termination and unnecessary retries. Setting them to 5 minutes provides a safe buffer.

Stage 3: The SQS Bridge and Resilience Pattern

This is where the architecture’s resilience is defined. We need an SQS queue for incoming jobs and a Dead-Letter Queue (DLQ) to catch messages that fail processing repeatedly. If a function fails to process a message, the SQS connector for OpenFaaS will ensure the message becomes visible again in the queue after a visibility timeout. If it fails a configured number of times, SQS automatically moves it to the DLQ for manual inspection. This prevents a single malformed message from poisoning the queue and halting the entire pipeline.

Ansible manages this AWS infrastructure cleanly.

# roles/04_sqs_setup/tasks/main.yml
- name: Create the Dead-Letter Queue (DLQ) for failed jobs
  amazon.aws.sqs_queue:
    name: "visualization-jobs-dlq"
    region: "{{ aws_region }}"
    state: present
    # Keep failed messages for 14 days for analysis
    message_retention_period: 1209600
  register: dlq

- name: Create the main SQS queue with a Redrive Policy
  amazon.aws.sqs_queue:
    name: "visualization-jobs-queue"
    region: "{{ aws_region }}"
    state: present
    visibility_timeout: 300 # Must be > function execution time
    redrive_policy:
      deadLetterTargetArn: "{{ dlq.queue_arn }}"
      maxReceiveCount: 3 # Retry a failing message 3 times before moving to DLQ
  register: main_queue

- name: Store queue URLs for the function deployment
  ansible.builtin.set_fact:
    main_queue_url: "{{ main_queue.queue_url }}"
    dlq_queue_arn: "{{ dlq.queue_arn }}"

The maxReceiveCount: 3 is a critical business decision. It balances giving a transient failure a chance to resolve against the latency of processing valid messages behind a failing one. The visibility_timeout: 300 seconds (5 minutes) is synchronized with our OpenFaaS gateway timeouts. If the function takes 4 minutes to run, the message must not become visible again in the queue during that time, which would cause another worker to pick it up and perform duplicate work.

Stage 4: The Visualization Function - Code and Container

Now for the core logic. The function lives in a directory with a stack.yml manifest, a handler.py, and a requirements.txt.

stack.yml defines the function, its dependencies, and crucially, the SQS trigger annotation.

# stack.yml
version: 1.0
provider:
  name: openfaas
  gateway: http://127.0.0.1:8080

functions:
  plot-generator:
    lang: python3-debian
    handler: ./plot-generator
    image: my-docker-registry/plot-generator:latest
    secrets:
      - aws-credentials # Kubernetes secret containing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
    environment:
      # AWS_REGION is needed by the SQS connector
      AWS_REGION: "us-east-1"
      # Disable Matplotlib's GUI backend in a headless environment
      MPLBACKEND: "Agg"
    annotations:
      # This connects the function to the SQS queue
      topic: "arn:aws:sqs:us-east-1:123456789012:visualization-jobs-queue"

The topic annotation is the magic that instructs the OpenFaaS SQS connector to poll our queue and invoke this function. The aws-credentials secret must be created in the openfaas-fn namespace to grant the connector permissions to access SQS.

The function code itself needs to be robust. It must handle malformed input and potential errors from Matplotlib.

plot-generator/handler.py:

# plot-generator/handler.py
import json
import logging
import sys
import os
import base64

# A common pitfall: Matplotlib can cause issues with font caches in containers.
# It's best practice to set the cache directory to a writable location.
os.environ['MPLCONFIGDIR'] = "/tmp/matplotlib"

import matplotlib.pyplot as plt
import numpy as np

# Configure logging to be more informative than default prints
logging.basicConfig(
    stream=sys.stdout,
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

def handle(event, context):
    """
    Handles an SQS event to generate a Matplotlib plot.

    Args:
        event (object): The event payload from SQS via OpenFaaS.
                       The message body is in event.body.
        context (object): The invocation context.

    Returns:
        A dict with statusCode and a body. On success, the body contains a
        base64 encoded PNG. On failure, it contains an error message.
        An HTTP 5xx status code signals OpenFaaS to not delete the SQS message.
    """
    try:
        # The SQS message body is passed as a byte string in event.body
        body = json.loads(event.body)
        logger.info(f"Processing job for plot type: {body.get('plot_type')}")

        # Basic input validation
        if 'data' not in body or 'plot_type' not in body:
            logger.error("Invalid payload: 'data' or 'plot_type' missing.")
            return {
                "statusCode": 400, # Bad request, don't retry
                "body": "Invalid payload: 'data' and 'plot_type' are required."
            }

        # Matplotlib plotting logic
        fig, ax = plt.subplots()
        
        if body['plot_type'] == 'scatter':
            x = np.array(body['data']['x'])
            y = np.array(body['data']['y'])
            ax.scatter(x, y)
        elif body['plot_type'] == 'histogram':
            data = np.array(body['data']['values'])
            ax.hist(data, bins=body.get('bins', 10))
        else:
            logger.error(f"Unsupported plot type: {body['plot_type']}")
            return {
                "statusCode": 400, # Bad request, don't retry
                "body": f"Unsupported plot type: {body['plot_type']}"
            }

        ax.set_title(body.get('title', 'Generated Plot'))
        
        # Save plot to a memory buffer instead of a file
        from io import BytesIO
        buf = BytesIO()
        plt.savefig(buf, format='png')
        plt.close(fig) # Important: release memory used by the plot
        buf.seek(0)
        
        image_base64 = base64.b64encode(buf.getvalue()).decode('utf-8')
        
        logger.info("Successfully generated plot.")
        return {
            "statusCode": 200,
            "headers": {"Content-Type": "application/json"},
            "body": json.dumps({
                "message": "Plot generated successfully",
                "image_png_base64": image_base64
            })
        }

    except json.JSONDecodeError as e:
        logger.exception("Failed to decode JSON from SQS message body.")
        # This is a bad message, don't retry it. Returning 400 would be ideal,
        # but the SQS connector might not support it. Returning 500 will send
        # it to DLQ after retries. Let's signal a permanent failure if possible,
        # but rely on DLQ as the safety net. Best to return 200 and log the error
        # if we want to avoid DLQ for malformed JSON. A better approach is to
        # have a validation layer before the queue. For this example, let's assume
        # DLQ is for all unprocessable messages.
        return {"statusCode": 500, "body": f"JSON Decode Error: {str(e)}"}
        
    except Exception as e:
        logger.exception("An unexpected error occurred during plot generation.")
        # Return a 500-level error. The OpenFaaS SQS connector will see this
        # and will NOT delete the message from the queue. SQS will make it
        # visible again after the timeout for another attempt.
        return {
            "statusCode": 502, # Using 502 Bad Gateway to indicate upstream failure
            "body": f"Plot generation failed: {str(e)}"
        }

The error handling logic is paramount. A statusCode of 200 signals success, and the SQS connector deletes the message. Any other status code (particularly 5xx) signals failure, and the message is kept for another attempt. After maxReceiveCount failures, it goes to the DLQ. We also must explicitly close the Matplotlib figure (plt.close(fig)) to prevent memory leaks in the function worker process, which can be reused across invocations.

The final architectural diagram, representing both the success and failure paths, solidifies the design.

graph TD
    subgraph Ansible Control Node
        A[playbook.yml] -- Provisions & Configures --> B((AWS EKS));
        A -- Provisions & Configures --> C((AWS SQS));
    end

    subgraph User Interaction
        D[Client App] -- 1. POST /job (JSON Payload) --> E(API Endpoint);
    end

    subgraph AWS Cloud
        E -- 2. Pushes message --> C(visualization-jobs-queue);
        C -- 3. SQS Event (maxReceiveCount: 3) --> F{OpenFaaS SQS Connector};
        C -- After 3 failures --> G(visualization-jobs-dlq);
        
        subgraph EKS Cluster [B]
            F -- 4. Invokes function --> H[plot-generator Pod];
        end

        H -- 5a. On Success (HTTP 200) --> F;
        F -- 6a. Deletes message --> C;
        
        H -- 5b. On Failure (HTTP 502) --> F;
        F -- 6b. NACK (message becomes visible again) --> C;
    end

    subgraph Operations
        I[On-call Engineer] -- Investigates --> G;
    end

This system achieves the initial goals. It’s asynchronous, scalable due to OpenFaaS on EKS, and resilient thanks to the SQS Redrive Policy. Ansible provides a single, idempotent workflow for creating and managing the entire stack.

However, this architecture isn’t without its limitations. The cold start performance of a Python function with heavy dependencies like Matplotlib and NumPy can be a significant factor. For workloads requiring sub-second latency after a period of inactivity, this design might be insufficient. Solutions could involve OpenFaaS’s scale-to-one-or-more setting (minReplicas) to keep at least one pod warm, at the cost of idle resources. Furthermore, the reliance on Ansible for both provisioning and configuration, while convenient, blurs the line between infrastructure state and software state. In a larger, more complex system, splitting these concerns—using Terraform for the underlying EKS and VPC infrastructure and Ansible purely for the in-cluster software deployment—could lead to a more robust and maintainable IaC practice. The current observability is also minimal, relying on function logs. A production-grade system would require exporting metrics on queue depth, function execution duration, and error rates to a monitoring system like Prometheus to provide true operational insight.

AWS EKS Matplotlib AWS SQS OpenFaaS Ansible

Protecting a Saga Orchestrator with APISIX Circuit Breaking and Consul Health Checks

2023-10-27 Distributed Systems

Consul APISIX Saga Pattern Resilience Engineering Polyrepo

Automating Graph-Enhanced RAG Pipeline Deployments Using Spinnaker ArangoDB and NumPy

2023-10-27 DevOps

CI/CD MLOps Spinnaker RAG NumPy ArangoDB