A 15% latency regression in the p99 response time for our core checkout API made it to production. The root cause was a seemingly innocuous library update. Our existing CI performance checks, which only asserted against average response time, remained green. The average hadn’t moved much, but the tail latency had ballooned. This incident was the catalyst for building a system that moved beyond single-value assertions and towards comprehensive, visual performance regression analysis on every pull request. The goal was simple: no developer should have to guess the performance impact of their change; a clear, data-driven visualization should tell them.
Our initial concept was an automated pipeline that would:
- On every PR, deploy the new version of the service to an isolated, production-like staging environment.
- Run a standardized load test against it.
- Simultaneously, run the same test against the current
main
branch baseline. - Generate a comparative visual report showing the full latency distributions of both versions.
- Attach this report to the PR and fail the build if a statistically significant regression was detected.
This required integrating our application code, infrastructure management, CI orchestration, and a powerful data visualization tool. After some deliberation, we settled on a stack. A monorepo to manage the coupled concerns of application code and its performance tests. Docker Swarm for a pragmatic, low-overhead orchestration of our testing environments. CircleCI to run the workflow. And Python’s Seaborn library, running headlessly in CI, to generate the statistical visualizations that were the entire point of this endeavor.
In a real-world project, the choice of Docker Swarm over Kubernetes often comes down to operational cost. For an internal, ephemeral testing platform, the complexity of managing a full-blown Kubernetes cluster, its networking, and RBAC was overkill. Docker Swarm’s simplicity, using familiar docker-compose
syntax via docker stack deploy
, provided the necessary container orchestration with a fraction of the setup and maintenance burden. This was a classic “good enough” engineering decision that let us focus on the pipeline’s logic rather than the underlying infrastructure.
The monorepo structure was non-negotiable. A change to an API endpoint in the application service must be reviewed and merged in the same PR as the corresponding change to its load test script. Separating these into different repositories creates coordination overhead and risk. We organized our project to reflect this coupling.
# Monorepo Root
.
├── .circleci
│ └── config.yml # The heart of the automation
├── apps
│ └── api-service # The Go application under test
│ ├── Dockerfile
│ ├── go.mod
│ └── main.go
├── perf-tests
│ └── scenarios
│ └── checkout_api.js # k6 load test script
├── report-generator
│ ├── Dockerfile
│ ├── generate_report.py # The Python/Seaborn script
│ └── requirements.txt
└── docker-stack.yml # For Docker Swarm deployment
Our application under test is a simple Go service with endpoints designed to exhibit different performance profiles. This allows us to test the sensitivity of our reporting pipeline.
apps/api-service/main.go
:
package main
import (
"encoding/json"
"log"
"math/rand"
"net/http"
"time"
)
func main() {
http.HandleFunc("/fast", fastHandler)
http.HandleFunc("/variable", variableLatencyHandler) // Endpoint with jitter
log.Println("API Service starting on port 8080")
if err := http.ListenAndServe(":8080", nil); err != nil {
log.Fatalf("Could not start server: %s\n", err)
}
}
func fastHandler(w http.ResponseWriter, r *http.Request) {
time.Sleep(50 * time.Millisecond) // Consistent 50ms latency
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
}
// This handler introduces variability, a common source of p99 regressions.
func variableLatencyHandler(w http.ResponseWriter, r *http.Request) {
// Base latency of 80ms
baseSleep := 80 * time.Millisecond
// 1 in 10 requests gets an additional 200ms of latency
if rand.Intn(10) == 0 {
baseSleep += 200 * time.Millisecond
}
time.Sleep(baseSleep)
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
}
The Dockerfile for this service is standard. The key is to produce a small, self-contained image.
apps/api-service/Dockerfile
:
# Stage 1: Build the Go binary
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod ./
RUN go mod download
COPY . .
# Build the binary with optimizations
RUN CGO_ENABLED=0 GOOS=linux go build -v -o api-service .
# Stage 2: Create the final small image
FROM alpine:latest
WORKDIR /root/
# Copy the binary from the builder stage
COPY /app/api-service .
# Expose port and run
EXPOSE 8080
CMD ["./api-service"]
The Docker Swarm stack definition file describes the services needed for the performance test. For each PR, we deploy a new stack with a unique name to ensure isolation.
docker-stack.yml
:
version: '3.8'
services:
api_service:
# The image tag will be dynamically replaced by CircleCI
image: my-private-registry/api-service:placeholder
networks:
- perf_test_net
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
networks:
perf_test_net:
driver: overlay
The core of the automation lies within the CircleCI configuration. We define a workflow that chains together building the image, deploying two separate stacks to Swarm (one for the PR branch, one for the main
baseline), running tests against both, generating a report, and finally tearing everything down. A common pitfall here is not ensuring the teardown step runs regardless of whether the tests pass or fail, leading to orphaned resources.
.circleci/config.yml
:
version: 2.1
# Orbs simplify complex operations
orbs:
k6: loadimpact/[email protected]
aws-cli: circleci/aws-[email protected]
path-filtering: circleci/path-[email protected]
# Reusable executor for Docker operations
executors:
docker-executor:
docker:
- image: cimg/base:2023.08
environment:
SWARM_MANAGER_IP: << pipeline.parameters.swarm_manager_ip >>
# DOCKER_HOST is used by the docker client
DOCKER_HOST: ssh://ci-user@<< pipeline.parameters.swarm_manager_ip >>
# Pipeline parameters to inject dynamic configuration
parameters:
swarm_manager_ip:
type: string
default: "192.168.1.100" # Example IP
workflows:
build_test_and_report:
when: << pipeline.parameters.run_perf_tests >>
jobs:
- path-filtering/filter:
base-revision: main
config-path: .circleci/continue_config.yml
mapping: |
apps/api-service/.* run_pipeline true
perf-tests/.* run_pipeline true
report-generator/.* run_pipeline true
performance_pipeline:
jobs:
- build_images:
context: docker-registry-creds
- run_perf_test_suite:
context: swarm-creds
requires:
- build_images
# Continuation workflow for path filtering
continue_config:
run_perf_tests:
description: "A trigger parameter for the main workflow"
type: boolean
default: false
jobs:
build_images:
executor: docker-executor
steps:
- checkout
- setup_remote_docker:
version: 20.10.18
- run:
name: Log in to Docker Registry
command: echo $DOCKER_REG_PASS | docker login -u $DOCKER_REG_USER --password-stdin $DOCKER_REG_URL
- run:
name: Build and Push API Service Image
command: |
docker build -t $DOCKER_REG_URL/api-service:$CIRCLE_SHA1 apps/api-service
docker push $DOCKER_REG_URL/api-service:$CIRCLE_SHA1
- run:
name: Build and Push Report Generator Image
command: |
docker build -t $DOCKER_REG_URL/report-generator:latest report-generator
docker push $DOCKER_REG_URL/report-generator:latest
run_perf_test_suite:
executor: docker-executor
steps:
- checkout
- setup_remote_docker
- aws-cli/setup
- run:
name: Setup SSH for Docker Swarm access
command: |
mkdir -p ~/.ssh
echo "$SWARM_PRIVATE_KEY" > ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
ssh-keyscan -H << pipeline.parameters.swarm_manager_ip >> >> ~/.ssh/known_hosts
# Deploy, test, and collect results for the PR branch
- deploy_and_test:
stack_name: "pr-${CIRCLE_BUILD_NUM}"
image_tag: "${CIRCLE_SHA1}"
output_file: "pr_results.json"
# Deploy, test, and collect results for the main branch baseline
- deploy_and_test:
stack_name: "main-${CIRCLE_BUILD_NUM}"
image_tag: "main" # Assuming 'main' tag points to the latest main build
output_file: "main_results.json"
- generate_and_upload_report:
pr_results: "pr_results.json"
main_results: "main_results.json"
- run:
name: Always Teardown Stacks
command: |
docker stack rm "pr-${CIRCLE_BUILD_NUM}" || true
docker stack rm "main-${CIRCLE_BUILD_NUM}" || true
when: always
# Reusable command definitions to avoid repetition
commands:
deploy_and_test:
parameters:
stack_name: { type: string }
image_tag: { type: string }
output_file: { type: string }
steps:
- run:
name: Deploy << parameters.stack_name >> stack to Swarm
command: |
sed "s/image:.*/image: $DOCKER_REG_URL\/api-service:<< parameters.image_tag >>/" docker-stack.yml > docker-stack-<< parameters.stack_name >>.yml
docker stack deploy -c docker-stack-<< parameters.stack_name >>.yml << parameters.stack_name >>
# A common mistake is not waiting for the service to be ready.
# In a production script, you'd poll the service status until it's converged.
echo "Waiting for service to stabilize..."
sleep 30
- k6/run:
script: perf-tests/scenarios/checkout_api.js
out: json=<< parameters.output_file >>
env: "TARGET_URL=http://<< parameters.stack_name >>_api_service:8080"
- persist_to_workspace:
root: .
paths:
- << parameters.output_file >>
generate_and_upload_report:
parameters:
pr_results: { type: string }
main_results: { type: string }
steps:
- attach_workspace:
at: /tmp/workspace
- run:
name: Generate Visual Report
command: |
docker run --rm \
-v /tmp/workspace:/data \
$DOCKER_REG_URL/report-generator:latest \
python generate_report.py \
--current /data/<< parameters.pr_results >> \
--baseline /data/<< parameters.main_results >> \
--output /data/report.png \
--endpoint "/variable"
- store_artifacts:
path: /tmp/workspace/report.png
This CircleCI configuration is complex but robust. It uses path filtering to avoid running expensive tests on documentation changes. It parameterizes secrets and environment-specific details. Crucially, it uses a custom command deploy_and_test
to encapsulate the logic for deploying a stack, running k6
, and saving the results, which is then called twice for the PR and baseline.
The k6
test script itself defines the load profile.
perf-tests/scenarios/checkout_api.js
:
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Trend } from 'k6/metrics';
// Custom Trend metric to capture response times for a specific endpoint
const variableLatencyTrend = new Trend('variable_endpoint_latency', true);
export const options = {
stages: [
{ duration: '30s', target: 20 }, // Ramp up to 20 virtual users
{ duration: '1m', target: 20 }, // Stay at 20 VUs for 1 minute
{ duration: '10s', target: 0 }, // Ramp down
],
thresholds: {
// We don't fail here; failure is determined by the report generator.
'http_req_failed': ['rate<0.01'],
'http_req_duration': ['p(95)<500'],
},
};
export default function () {
const targetHost = __ENV.TARGET_URL || 'http://localhost:8080';
// Test the endpoint with variable latency
const res = http.get(`${targetHost}/variable`);
check(res, { 'status was 200': (r) => r.status === 200 });
variableLatencyTrend.add(res.timings.duration); // Add timing to our custom metric
sleep(1);
}
The final piece is the Python script that consumes the k6
JSON output and uses Seaborn to create the comparison plot. This script is the “brain” of the operation, turning raw numbers into actionable insight.
report-generator/generate_report.py
:
import argparse
import json
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
def load_k6_results(file_path: str, endpoint_filter: str) -> pd.DataFrame:
"""Loads k6 JSON output and filters for relevant metrics."""
latencies = []
with open(file_path, 'r') as f:
for line in f:
try:
data = json.loads(line)
# We only care about point metrics that represent individual requests
if data['type'] == 'Point' and 'variable_endpoint_latency' in data['data']['tags']['name'] and data['data']['tags']['url'].endswith(endpoint_filter):
latencies.append(data['data']['value'])
except (json.JSONDecodeError, KeyError):
continue
if not latencies:
# A common pitfall is empty results files. Handle this gracefully.
print(f"Warning: No latency data found in {file_path} for endpoint {endpoint_filter}")
return pd.DataFrame()
return pd.DataFrame({'latency_ms': latencies})
def create_comparison_plot(current_df: pd.DataFrame, baseline_df: pd.DataFrame, output_path: str, endpoint: str):
"""Generates and saves a violin plot comparing two latency distributions."""
# Add a 'version' column to each dataframe before concatenating
current_df['version'] = 'Current PR'
baseline_df['version'] = 'Baseline (main)'
combined_df = pd.concat([current_df, baseline_df])
sns.set_theme(style="whitegrid")
plt.figure(figsize=(12, 8))
# A violin plot is excellent for visualizing distribution shape and density.
ax = sns.violinplot(x='latency_ms', y='version', data=combined_df, palette='muted', inner='quartile', cut=0)
plt.title(f'Performance Regression Analysis for Endpoint: {endpoint}', fontsize=16)
plt.xlabel('Response Time (ms)', fontsize=12)
plt.ylabel('')
# Calculate and annotate key statistics
current_p95 = np.percentile(current_df['latency_ms'], 95)
baseline_p95 = np.percentile(baseline_df['latency_ms'], 95)
current_avg = current_df['latency_ms'].mean()
baseline_avg = baseline_df['latency_ms'].mean()
p95_change = ((current_p95 - baseline_p95) / baseline_p95) * 100 if baseline_p95 else 0
stats_text = (
f"Baseline (main):\n"
f" - Avg: {baseline_avg:.2f} ms\n"
f" - p95: {baseline_p95:.2f} ms\n\n"
f"Current PR:\n"
f" - Avg: {current_avg:.2f} ms\n"
f" - p95: {current_p95:.2f} ms\n\n"
f"p95 Change: {p95_change:+.2f}%"
)
# Place text box on the plot
props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
ax.text(0.95, 0.95, stats_text, transform=ax.transAxes, fontsize=10,
verticalalignment='top', horizontalalignment='right', bbox=props)
plt.tight_layout()
plt.savefig(output_path)
print(f"Report generated at {output_path}")
# The crucial step: return a failure code if regression threshold is met
# In a real-world scenario, this threshold would be more nuanced.
REGRESSION_THRESHOLD_PERCENT = 10.0
if p95_change > REGRESSION_THRESHOLD_PERCENT:
print(f"Error: p95 latency regression of {p95_change:.2f}% detected, exceeding threshold of {REGRESSION_THRESHOLD_PERCENT}%.")
return 1
return 0
def main():
parser = argparse.ArgumentParser(description="Generate performance comparison report.")
parser.add_argument("--current", required=True, help="Path to current run's k6 JSON output.")
parser.add_argument("--baseline", required=True, help="Path to baseline run's k6 JSON output.")
parser.add_argument("--output", required=True, help="Path to save the output PNG report.")
parser.add_argument("--endpoint", required=True, help="The specific endpoint URL path to analyze.")
args = parser.parse_args()
current_data = load_k6_results(args.current, args.endpoint)
baseline_data = load_k6_results(args.baseline, args.endpoint)
if current_data.empty or baseline_data.empty:
print("Error: Cannot generate report due to missing data in one or both result files.")
exit(1)
exit_code = create_comparison_plot(current_data, baseline_data, args.output, args.endpoint)
exit(exit_code)
if __name__ == "__main__":
main()
When this pipeline runs, the final artifact stored in CircleCI is a PNG file. This image clearly shows the two latency distributions side-by-side. A developer can immediately see if their change has tightened the distribution, had no effect, or, more critically, introduced a long tail of slow requests. This visual feedback loop, automated and integrated directly into the development workflow, is infinitely more powerful than a simple pass/fail number. It transforms performance from an abstract metric into a concrete, visual artifact that fosters a culture of performance awareness.
This system is not without its limitations. The primary challenge is the “noisy neighbor” problem in the testing environment; resource contention can introduce variability and cause flaky test results. A more robust solution would involve running multiple baseline tests and using a moving average of their results to establish a more stable performance benchmark. Furthermore, the pass/fail logic is currently a simple percentage threshold. A more statistically rigorous approach would employ hypothesis testing, such as a Mann-Whitney U test, to determine if the difference between the two distributions is significant, reducing the rate of false positives from random noise. Finally, while Docker Swarm served well as a starting point, scaling this system to support dozens of teams might necessitate a move to Kubernetes for its superior multi-tenancy, resource management, and extensibility via the operator pattern.