The initial architecture was deceptively simple: a synchronous API endpoint that accepted a JSON payload, generated a moderately complex scientific plot using Matplotlib, and returned the PNG image. This worked for a dozen concurrent users. It collapsed completely at a hundred. Requests backed up, workers saturated, and timeouts became the norm. The core problem was coupling the immediate user request to a long-running, CPU-intensive task. The clear path forward was a decoupled, asynchronous pipeline, but the implementation details were where the real engineering challenges lay.
Our goal became to build a system that could ingest thousands of visualization requests per minute, process them reliably, and scale compute resources on demand without manual intervention. The user would submit a job and get a job ID back immediately; the result could be polled for later. This pointed directly to a message queue and a pool of scalable workers.
The technology selection process was driven by pragmatism and our existing ecosystem. We run on AWS, so a managed Kubernetes service was a given. AWS EKS abstracts away the control plane’s operational burden, which is a significant win. For the queue, AWS SQS was the default choice—it’s robust, managed, and its dead-letter queue (DLQ) feature is critical for building resilient systems. The real debate was around the compute layer. We could have built a custom Kubernetes deployment with Horizontal Pod Autoscalers (HPAs) listening to SQS queue depth, but that felt like reinventing the wheel. OpenFaaS, running on top of EKS, offered a higher level of abstraction. Its serverless model, with built-in SQS triggers and auto-scaling based on in-flight requests, promised to handle the boilerplate orchestration, letting us focus on the function code.
Finally, for automation, we chose Ansible. While Terraform is excellent for declarative infrastructure provisioning, our team’s expertise was stronger with Ansible. More importantly, Ansible is adept at both provisioning infrastructure (via AWS modules) and handling the subsequent configuration management—like installing Helm charts, creating Kubernetes secrets, and managing application deployments. A single tool to manage the entire lifecycle from VPC to running function was a compelling proposition.
This is the log of how we built that pipeline, the configuration code that underpins it, and the pitfalls discovered along the way.
Stage 1: Provisioning the EKS Foundation with Ansible
The first step is laying the groundwork: the EKS cluster itself. A common mistake here is to create a monolithic Ansible playbook. In a real-world project, this becomes unmaintainable. We structure our Ansible project using roles, promoting reusability and separation of concerns.
The project structure looks like this:
ansible-openfaas-eks/
├── ansible.cfg
├── inventory/
│ └── hosts
├── roles/
│ ├── 01_eks_cluster/
│ │ └── tasks/
│ │ └── main.yml
│ ├── 02_k8s_tools/
│ │ └── tasks/
│ │ └── main.yml
│ ├── 03_openfaas_deploy/
│ │ └── tasks/
│ │ └── main.yml
│ └── 04_sqs_setup/
│ └── tasks/
│ └── main.yml
└── playbook.yml
The playbook.yml
orchestrates these roles. The first role, 01_eks_cluster
, handles the creation of the VPC, subnets, and the EKS cluster. We use the amazon.aws.eks_cluster
module, which is a high-level abstraction that simplifies this process significantly.
Here is a core part of roles/01_eks_cluster/tasks/main.yml
. It’s not just about creating the cluster; it’s about doing it idempotently and capturing the output for subsequent steps.
# roles/01_eks_cluster/tasks/main.yml
- name: Ensure VPC for EKS exists
amazon.aws.ec2_vpc_net:
name: "openfaas-eks-vpc"
cidr_block: "10.10.0.0/16"
region: "{{ aws_region }}"
state: present
tags:
Project: "OpenFaaS-Viz-Pipeline"
register: vpc_result
- name: Create two public subnets for EKS
amazon.aws.ec2_vpc_subnet:
state: present
vpc_id: "{{ vpc_result.vpc.id }}"
cidr: "{{ item }}"
region: "{{ aws_region }}"
map_public: yes
tags:
"kubernetes.io/cluster/openfaas-viz-cluster": "shared"
loop:
- "10.10.1.0/24"
- "10.10.2.0/24"
register: subnet_results
- name: Create the EKS cluster IAM role
amazon.aws.iam_role:
name: "EKSClusterRole-VizPipeline"
state: present
assume_role_policy_document: "{{ lookup('file', 'eks_cluster_assume_role_policy.json') }}"
managed_policy:
- "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
register: cluster_role
- name: Create the EKS node group IAM role
amazon.aws.iam_role:
name: "EKSNodeRole-VizPipeline"
state: present
assume_role_policy_document: "{{ lookup('file', 'ec2_assume_role_policy.json') }}"
managed_policy:
- "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
- "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
- "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
register: node_role
- name: Provision the EKS cluster
# This task can take 15-20 minutes. It's idempotent.
amazon.aws.eks_cluster:
name: "openfaas-viz-cluster"
state: present
region: "{{ aws_region }}"
role_arn: "{{ cluster_role.arn }}"
vpc_config:
subnet_ids: "{{ subnet_results.results | map(attribute='subnet.id') | list }}"
public_access: yes
private_access: no
- name: Create a managed node group for OpenFaaS functions
# We specify instance types suitable for CPU-bound Matplotlib tasks
amazon.aws.eks_nodegroup:
name: "faas-workers"
cluster_name: "openfaas-viz-cluster"
state: present
node_role: "{{ node_role.arn }}"
subnets: "{{ subnet_results.results | map(attribute='subnet.id') | list }}"
instance_types: ["c5.large", "c5a.large"]
scaling_config:
min_size: 1
max_size: 5
desired_size: 2
disk_size: 40
region: "{{ aws_region }}"
tags:
Purpose: "OpenFaaS-Function-Workers"
The key here is the use of register
to capture results and pass them to subsequent tasks, and the explicit definition of IAM roles with minimal necessary policies. The pitfall to avoid is using overly permissive IAM roles, a common mistake in quick tutorials.
Stage 2: Deploying OpenFaaS with Ansible and Helm
Once the EKS cluster is running, we need kubectl
configured and Helm installed to deploy OpenFaaS. The 02_k8s_tools
role handles this setup on the Ansible control node. Then, 03_openfaas_deploy
uses the community.kubernetes.helm
module to deploy OpenFaaS.
Using the Helm module is superior to shelling out to helm
commands because it’s more declarative and idempotent.
# roles/03_openfaas_deploy/tasks/main.yml
- name: Add OpenFaaS Helm repository
community.kubernetes.helm_repository:
name: openfaas
repo_url: "https://openfaas.github.io/faas-netes/"
state: present
- name: Create namespaces for OpenFaaS
community.kubernetes.k8s:
name: "{{ item }}"
api_version: v1
kind: Namespace
state: present
loop:
- openfaas
- openfaas-fn
- name: Generate a password for the OpenFaaS gateway
# We store this securely, for example in Ansible Vault
ansible.builtin.command: "openssl rand -base64 32"
register: faas_password
changed_when: false
- name: Create secret for OpenFaaS gateway password
community.kubernetes.k8s:
state: present
definition:
apiVersion: v1
kind: Secret
metadata:
name: basic-auth
namespace: openfaas
type: Opaque
stringData:
basic-auth-user: admin
basic-auth-password: "{{ faas_password.stdout }}"
- name: Deploy OpenFaaS using Helm chart
community.kubernetes.helm:
name: openfaas
chart_ref: openfaas/openfaas
release_namespace: openfaas
state: present
values:
functionNamespace: "openfaas-fn"
generateBasicAuth: false # We created the secret manually
serviceType: LoadBalancer
operator:
create: true
faasnetes:
readTimeout: "5m"
writeTimeout: "5m"
imagePullPolicy: "Always"
queueWorker:
ackWait: "5m" # Critical for long-running functions
A crucial detail here is increasing the timeouts (readTimeout
, writeTimeout
, ackWait
). Matplotlib can take a non-trivial amount of time to generate a complex plot, especially during a cold start. The default timeouts of 60 seconds are often too short for data processing tasks, leading to premature request termination and unnecessary retries. Setting them to 5 minutes provides a safe buffer.
Stage 3: The SQS Bridge and Resilience Pattern
This is where the architecture’s resilience is defined. We need an SQS queue for incoming jobs and a Dead-Letter Queue (DLQ) to catch messages that fail processing repeatedly. If a function fails to process a message, the SQS connector for OpenFaaS will ensure the message becomes visible again in the queue after a visibility timeout. If it fails a configured number of times, SQS automatically moves it to the DLQ for manual inspection. This prevents a single malformed message from poisoning the queue and halting the entire pipeline.
Ansible manages this AWS infrastructure cleanly.
# roles/04_sqs_setup/tasks/main.yml
- name: Create the Dead-Letter Queue (DLQ) for failed jobs
amazon.aws.sqs_queue:
name: "visualization-jobs-dlq"
region: "{{ aws_region }}"
state: present
# Keep failed messages for 14 days for analysis
message_retention_period: 1209600
register: dlq
- name: Create the main SQS queue with a Redrive Policy
amazon.aws.sqs_queue:
name: "visualization-jobs-queue"
region: "{{ aws_region }}"
state: present
visibility_timeout: 300 # Must be > function execution time
redrive_policy:
deadLetterTargetArn: "{{ dlq.queue_arn }}"
maxReceiveCount: 3 # Retry a failing message 3 times before moving to DLQ
register: main_queue
- name: Store queue URLs for the function deployment
ansible.builtin.set_fact:
main_queue_url: "{{ main_queue.queue_url }}"
dlq_queue_arn: "{{ dlq.queue_arn }}"
The maxReceiveCount: 3
is a critical business decision. It balances giving a transient failure a chance to resolve against the latency of processing valid messages behind a failing one. The visibility_timeout: 300
seconds (5 minutes) is synchronized with our OpenFaaS gateway timeouts. If the function takes 4 minutes to run, the message must not become visible again in the queue during that time, which would cause another worker to pick it up and perform duplicate work.
Stage 4: The Visualization Function - Code and Container
Now for the core logic. The function lives in a directory with a stack.yml
manifest, a handler.py
, and a requirements.txt
.
stack.yml
defines the function, its dependencies, and crucially, the SQS trigger annotation.
# stack.yml
version: 1.0
provider:
name: openfaas
gateway: http://127.0.0.1:8080
functions:
plot-generator:
lang: python3-debian
handler: ./plot-generator
image: my-docker-registry/plot-generator:latest
secrets:
- aws-credentials # Kubernetes secret containing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
environment:
# AWS_REGION is needed by the SQS connector
AWS_REGION: "us-east-1"
# Disable Matplotlib's GUI backend in a headless environment
MPLBACKEND: "Agg"
annotations:
# This connects the function to the SQS queue
topic: "arn:aws:sqs:us-east-1:123456789012:visualization-jobs-queue"
The topic
annotation is the magic that instructs the OpenFaaS SQS connector to poll our queue and invoke this function. The aws-credentials
secret must be created in the openfaas-fn
namespace to grant the connector permissions to access SQS.
The function code itself needs to be robust. It must handle malformed input and potential errors from Matplotlib.
plot-generator/handler.py
:
# plot-generator/handler.py
import json
import logging
import sys
import os
import base64
# A common pitfall: Matplotlib can cause issues with font caches in containers.
# It's best practice to set the cache directory to a writable location.
os.environ['MPLCONFIGDIR'] = "/tmp/matplotlib"
import matplotlib.pyplot as plt
import numpy as np
# Configure logging to be more informative than default prints
logging.basicConfig(
stream=sys.stdout,
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
def handle(event, context):
"""
Handles an SQS event to generate a Matplotlib plot.
Args:
event (object): The event payload from SQS via OpenFaaS.
The message body is in event.body.
context (object): The invocation context.
Returns:
A dict with statusCode and a body. On success, the body contains a
base64 encoded PNG. On failure, it contains an error message.
An HTTP 5xx status code signals OpenFaaS to not delete the SQS message.
"""
try:
# The SQS message body is passed as a byte string in event.body
body = json.loads(event.body)
logger.info(f"Processing job for plot type: {body.get('plot_type')}")
# Basic input validation
if 'data' not in body or 'plot_type' not in body:
logger.error("Invalid payload: 'data' or 'plot_type' missing.")
return {
"statusCode": 400, # Bad request, don't retry
"body": "Invalid payload: 'data' and 'plot_type' are required."
}
# Matplotlib plotting logic
fig, ax = plt.subplots()
if body['plot_type'] == 'scatter':
x = np.array(body['data']['x'])
y = np.array(body['data']['y'])
ax.scatter(x, y)
elif body['plot_type'] == 'histogram':
data = np.array(body['data']['values'])
ax.hist(data, bins=body.get('bins', 10))
else:
logger.error(f"Unsupported plot type: {body['plot_type']}")
return {
"statusCode": 400, # Bad request, don't retry
"body": f"Unsupported plot type: {body['plot_type']}"
}
ax.set_title(body.get('title', 'Generated Plot'))
# Save plot to a memory buffer instead of a file
from io import BytesIO
buf = BytesIO()
plt.savefig(buf, format='png')
plt.close(fig) # Important: release memory used by the plot
buf.seek(0)
image_base64 = base64.b64encode(buf.getvalue()).decode('utf-8')
logger.info("Successfully generated plot.")
return {
"statusCode": 200,
"headers": {"Content-Type": "application/json"},
"body": json.dumps({
"message": "Plot generated successfully",
"image_png_base64": image_base64
})
}
except json.JSONDecodeError as e:
logger.exception("Failed to decode JSON from SQS message body.")
# This is a bad message, don't retry it. Returning 400 would be ideal,
# but the SQS connector might not support it. Returning 500 will send
# it to DLQ after retries. Let's signal a permanent failure if possible,
# but rely on DLQ as the safety net. Best to return 200 and log the error
# if we want to avoid DLQ for malformed JSON. A better approach is to
# have a validation layer before the queue. For this example, let's assume
# DLQ is for all unprocessable messages.
return {"statusCode": 500, "body": f"JSON Decode Error: {str(e)}"}
except Exception as e:
logger.exception("An unexpected error occurred during plot generation.")
# Return a 500-level error. The OpenFaaS SQS connector will see this
# and will NOT delete the message from the queue. SQS will make it
# visible again after the timeout for another attempt.
return {
"statusCode": 502, # Using 502 Bad Gateway to indicate upstream failure
"body": f"Plot generation failed: {str(e)}"
}
The error handling logic is paramount. A statusCode
of 200 signals success, and the SQS connector deletes the message. Any other status code (particularly 5xx) signals failure, and the message is kept for another attempt. After maxReceiveCount
failures, it goes to the DLQ. We also must explicitly close the Matplotlib figure (plt.close(fig)
) to prevent memory leaks in the function worker process, which can be reused across invocations.
The final architectural diagram, representing both the success and failure paths, solidifies the design.
graph TD subgraph Ansible Control Node A[playbook.yml] -- Provisions & Configures --> B((AWS EKS)); A -- Provisions & Configures --> C((AWS SQS)); end subgraph User Interaction D[Client App] -- 1. POST /job (JSON Payload) --> E(API Endpoint); end subgraph AWS Cloud E -- 2. Pushes message --> C(visualization-jobs-queue); C -- 3. SQS Event (maxReceiveCount: 3) --> F{OpenFaaS SQS Connector}; C -- After 3 failures --> G(visualization-jobs-dlq); subgraph EKS Cluster [B] F -- 4. Invokes function --> H[plot-generator Pod]; end H -- 5a. On Success (HTTP 200) --> F; F -- 6a. Deletes message --> C; H -- 5b. On Failure (HTTP 502) --> F; F -- 6b. NACK (message becomes visible again) --> C; end subgraph Operations I[On-call Engineer] -- Investigates --> G; end
This system achieves the initial goals. It’s asynchronous, scalable due to OpenFaaS on EKS, and resilient thanks to the SQS Redrive Policy. Ansible provides a single, idempotent workflow for creating and managing the entire stack.
However, this architecture isn’t without its limitations. The cold start performance of a Python function with heavy dependencies like Matplotlib and NumPy can be a significant factor. For workloads requiring sub-second latency after a period of inactivity, this design might be insufficient. Solutions could involve OpenFaaS’s scale-to-one-or-more setting (minReplicas
) to keep at least one pod warm, at the cost of idle resources. Furthermore, the reliance on Ansible for both provisioning and configuration, while convenient, blurs the line between infrastructure state and software state. In a larger, more complex system, splitting these concerns—using Terraform for the underlying EKS and VPC infrastructure and Ansible purely for the in-cluster software deployment—could lead to a more robust and maintainable IaC practice. The current observability is also minimal, relying on function logs. A production-grade system would require exporting metrics on queue depth, function execution duration, and error rates to a monitoring system like Prometheus to provide true operational insight.