The requirement was for zero-dependency, sub-second visibility into the performance of a new service stack. Standard observability solutions like Prometheus and Grafana felt disproportionately heavy for an initial deployment, introducing operational overhead we weren’t ready to absorb. The goal was to create a closed-loop monitoring system where the service stack reports on its own health, presenting the data on an extremely lightweight, server-rendered dashboard. The entire feedback loop, from request hitting the edge to its metrics appearing on the dashboard, needed to be near-instantaneous.
This led to a design where our web server, Caddy, and our application server, Sanic, would stream their own performance metrics into a time-series database, InfluxDB. The same Sanic application would then query this data and use Server-Side Rendering (SSR) to build a real-time status page. This avoids the complexity of a client-side rendering framework and ensures the dashboard itself has a minimal performance footprint.
The architecture establishes a direct data flow:
graph TD subgraph "User Interaction" A[User Request] --> B{Caddy}; E[Dashboard View] --> B; end subgraph "Application Stack" B -- Reverse Proxy --> C[Sanic App]; C -- HTTP Response --> B; B -- HTML Response --> E; end subgraph "Telemetry Pipeline" B -- Structured JSON Logs --> D[Sanic Background Worker]; C -- Middleware Metrics --> F[InfluxDB Writer]; D -- Parsed Metrics --> F; F -- Batch Writes --> G[(InfluxDB)]; end subgraph "Dashboard Rendering" C -- Dashboard Route --> H[InfluxDB Querier]; H -- Flux Query --> G; G -- Time-Series Data --> H; H -- Render Context --> I[Jinja2 SSR Engine]; I -- Generated HTML --> C; end style F fill:#f9f,stroke:#333,stroke-width:2px style D fill:#f9f,stroke:#333,stroke-width:2px
This entire system is designed to run within a single docker-compose
setup, making it portable and self-contained. The initial pain point of observability overhead is solved by integrating the monitoring capability directly into the application’s runtime.
Phase 1: Foundational Infrastructure with Docker Compose
In a real-world project, cohesive container orchestration is non-negotiable. We’ll define our three core services: caddy
, app
(Sanic), and influxdb
. The key is ensuring they share a network and that volumes are correctly configured for Caddy’s configuration and InfluxDB’s data persistence.
docker-compose.yml
:
version: '3.8'
services:
influxdb:
image: influxdb:2.7
container_name: telemetry_influxdb
volumes:
- influxdb_data:/var/lib/influxdb2
ports:
- "8086:8086"
environment:
- DOCKER_INFLUXDB_INIT_MODE=setup
- DOCKER_INFLUXDB_INIT_USERNAME=admin
- DOCKER_INFLUXDB_INIT_PASSWORD=password123
- DOCKER_INFLUXDB_INIT_ORG=my-org
- DOCKER_INFLUXDB_INIT_BUCKET=telemetry
- DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=my-super-secret-token
networks:
- monitor_net
app:
build: .
container_name: telemetry_app
ports:
- "8000:8000"
depends_on:
- influxdb
environment:
- INFLUXDB_URL=http://influxdb:8086
- INFLUXDB_TOKEN=my-super-secret-token
- INFLUXDB_ORG=my-org
- INFLUXDB_BUCKET=telemetry
- SANIC_APP_HOST=0.0.0.0
- SANIC_APP_PORT=8000
volumes:
- ./app:/app
networks:
- monitor_net
caddy:
image: caddy:2.7-alpine
container_name: telemetry_caddy
ports:
- "80:80"
- "443:443"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- caddy_data:/data
- caddy_config:/config
depends_on:
- app
networks:
- monitor_net
networks:
monitor_net:
driver: bridge
volumes:
influxdb_data:
caddy_data:
caddy_config:
This configuration initializes InfluxDB with a default organization, bucket, and user token, which the Sanic application will use to connect. Caddy is configured via a mounted Caddyfile
.
Phase 2: Configuring Caddy for Structured Logging
The default Caddy access logs are human-readable, but for automated processing, we need structured data. Caddy’s support for JSON logging is a critical feature for this architecture. We configure it to proxy requests to our Sanic application and, more importantly, to format its logs as JSON, which will be the input for our telemetry pipeline.
Caddyfile
:
# Global options block
{
# Enable structured JSON logging for all sites.
# This is the lifeblood of our Caddy-side telemetry.
log {
output stdout
format json {
# We explicitly include fields that are crucial for performance analysis.
time_format "2006-01-02T15:04:05.000Z07:00"
}
level INFO
}
}
# Define the primary site.
# Using a placeholder for production domains. Caddy handles HTTPS automatically.
# For local dev, it will generate a self-signed cert.
localhost {
# Reverse proxy all requests to the Sanic application container.
# The 'app' hostname is resolved by Docker's internal DNS.
reverse_proxy app:8000
}
With output stdout
and format json
, Caddy will write detailed, machine-readable logs for every request to its standard output. Docker Compose aggregates these logs, making them available for our Sanic background worker to consume. The log entries will contain precise timing, status codes, and URI information.
Phase 3: Building the Asynchronous Sanic Application
The core of our system is the Sanic application. It serves three purposes: handling regular API requests, processing telemetry data in the background, and rendering the SSR dashboard.
First, the project structure:
.
├── Caddyfile
├── Dockerfile
├── docker-compose.yml
├── package.json
├── requirements.txt
└── app
├── __init__.py
├── core
│ ├── __init__.py
│ ├── influx_client.py
│ └── log_processor.py
├── server.py
├── static
│ └── css
│ └── dashboard.css
├── styles
│ ├── _base.scss
│ ├── _variables.scss
│ └── dashboard.scss
└── templates
└── dashboard.html
The Dockerfile
for the Sanic app must install Python dependencies and also Node.js/Sass for our styling pipeline.
Dockerfile
:
# Stage 1: Build CSS from SCSS
FROM node:18-alpine AS builder
WORKDIR /build
COPY app/styles/ ./styles
COPY package.json .
RUN npm install
RUN npm run build-css
# Stage 2: Build the final Python application
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY ./app .
# Copy compiled CSS from the builder stage
COPY /build/static/css/dashboard.css ./static/css/dashboard.css
ENV PYTHONUNBUFFERED 1
ENV SANIC_APP_HOST 0.0.0.0
ENV SANIC_APP_PORT 8000
CMD ["sanic", "server.server:app", "--host", "0.0.0.0", "--port", "8000", "--workers=1"]
requirements.txt
:
sanic==23.6.0
sanic-ext==23.6.0
influxdb-client==1.37.0
jinja2==3.1.2
aiofiles==23.1.0
python-dotenv==1.0.0
Phase 4: Robust InfluxDB Integration
Connecting to a database in an async application requires careful management of the connection lifecycle. A common mistake is creating a new client for every request. Instead, we initialize the client when the Sanic application starts and close it gracefully on shutdown.
app/core/influx_client.py
:
import os
from logging import getLogger
from influxdb_client import Point
from influxdb_client.client.influxdb_client_async import InfluxDBClientAsync
from sanic import Sanic
logger = getLogger(__name__)
class InfluxDB:
"""A wrapper for managing the InfluxDB client lifecycle within a Sanic app."""
def __init__(self):
self._client: InfluxDBClientAsync | None = None
self._write_api = None
self._query_api = None
async def connect(self, app: Sanic):
"""Initialize the async client and APIs."""
url = os.environ["INFLUXDB_URL"]
token = os.environ["INFLUXDB_TOKEN"]
org = os.environ["INFLUXDB_ORG"]
logger.info(f"Connecting to InfluxDB at {url}")
self._client = InfluxDBClientAsync(url=url, token=token, org=org)
self._write_api = self._client.write_api()
self._query_api = self._client.query_api()
app.ctx.influx_bucket = os.environ["INFLUXDB_BUCKET"]
logger.info("InfluxDB connection established.")
async def disconnect(self):
"""Gracefully close the client."""
if self._client:
await self._client.close()
logger.info("InfluxDB connection closed.")
async def write_point(self, point: Point):
"""Write a single data point. In a high-throughput system, batching is better."""
if not self._write_api:
logger.error("InfluxDB write API not available.")
return
try:
await self._write_api.write(bucket=os.environ["INFLUXDB_BUCKET"], record=point)
except Exception as e:
# In production, this needs a retry mechanism or DLQ.
logger.error(f"Failed to write point to InfluxDB: {e}")
async def query(self, query: str):
"""Execute a Flux query."""
if not self._query_api:
logger.error("InfluxDB query API not available.")
return None
try:
return await self._query_api.query(query=query)
except Exception as e:
logger.error(f"Failed to query InfluxDB: {e}")
return None
# Singleton instance
influx_db = InfluxDB()
We integrate this into the main application file, server.py
, using Sanic’s lifecycle listeners.
app/server.py
:
from sanic import Sanic, Request, response
from sanic.log import logger
from sanic_ext import render
from influxdb_client import Point
import time
from .core.influx_client import influx_db
from .core.log_processor import start_log_processing
# --- Application Setup ---
app = Sanic("TelemetryApp")
app.config.TEMPLATING_ENABLE_ASYNC = True
# --- Lifecycle Hooks ---
@app.main_process_start
async def start(app: Sanic, _):
await influx_db.connect(app)
# Start the background task for processing Caddy logs
app.add_task(start_log_processing(app))
@app.main_process_stop
async def stop(app: Sanic, _):
await influx_db.disconnect()
# --- Middleware for Sanic Metrics ---
@app.middleware("request")
async def measure_request_time(request: Request):
request.ctx.start_time = time.perf_counter()
@app.middleware("response")
async def record_sanic_metric(request: Request, response):
if hasattr(request.ctx, "start_time"):
duration_ms = (time.perf_counter() - request.ctx.start_time) * 1000
point = (
Point("request_metrics")
.tag("source", "sanic")
.tag("endpoint", request.name or "unknown")
.tag("method", request.method)
.tag("status_code", response.status)
.field("duration_ms", duration_ms)
)
# Using add_task to avoid blocking the response path
app.add_task(influx_db.write_point(point))
# --- API Routes ---
@app.get("/api/ping")
async def ping(request: Request):
return response.json({"message": "pong"})
# --- SSR Dashboard Route ---
@app.get("/")
@app.ext.template("dashboard.html")
async def dashboard(request: Request):
# This will be filled in Phase 7
return {"data": "Placeholder"}
The middleware demonstrates a key pattern: it calculates the request duration and then uses app.add_task
to send the metric to InfluxDB. This fires off the write operation without making the client wait for the database write to complete, which is crucial for maintaining low latency.
Phase 5: Processing Caddy Logs Asynchronously
This is where the system becomes self-aware. We need a process to consume the structured JSON logs from Caddy. A simple but effective way to handle this within our containerized setup is to read Docker’s log stream for the Caddy container. We’ll use asyncio.create_subprocess_exec
to run docker logs -f
and process the stream line by line.
app/core/log_processor.py
:
import asyncio
import json
from logging import getLogger
from sanic import Sanic
from influxdb_client import Point
from .influx_client import influx_db
logger = getLogger(__name__)
# The container name must match what's in docker-compose.yml
CADDY_CONTAINER_NAME = "telemetry_caddy"
async def process_log_stream(stream):
"""Reads from a stream and processes each line as a JSON log entry."""
while True:
line = await stream.readline()
if not line:
await asyncio.sleep(0.1)
continue
try:
log_entry = json.loads(line.decode('utf-8'))
# A pitfall here is assuming fields always exist.
# Real-world code needs more robust validation.
if log_entry.get("logger") == "http.log.access":
req = log_entry.get("request", {})
point = (
Point("request_metrics")
.tag("source", "caddy")
.tag("host", req.get("host"))
.tag("method", req.get("method"))
.tag("status_code", log_entry.get("status"))
.field("duration_ms", log_entry.get("duration", 0) * 1000)
.field("uri", req.get("uri"))
.field("size", log_entry.get("size", 0))
.time(log_entry.get("ts")) # Use Caddy's timestamp
)
await influx_db.write_point(point)
except (json.JSONDecodeError, KeyError) as e:
# Ignoring malformed lines, but should log them for debugging.
logger.warning(f"Could not parse Caddy log line: {line.strip()}, error: {e}")
except Exception as e:
logger.error(f"Unexpected error processing log line: {e}")
async def start_log_processing(app: Sanic):
"""
Starts a subprocess to tail Caddy's logs and processes them.
This is a fragile approach for production but demonstrates the concept.
A dedicated log shipper (Vector, Fluentd) is the production-grade solution.
"""
logger.info("Starting Caddy log processor...")
while True:
try:
proc = await asyncio.create_subprocess_exec(
"docker", "logs", "-f", "--since", "1s", CADDY_CONTAINER_NAME,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
logger.info(f"Connected to Docker logs stream for '{CADDY_CONTAINER_NAME}'")
await process_log_stream(proc.stdout)
# If the process exits, wait and retry
await proc.wait()
stderr_output = await proc.stderr.read()
if stderr_output:
logger.error(f"Log processor subprocess error: {stderr_output.decode().strip()}")
except FileNotFoundError:
logger.error("`docker` command not found. Is Docker installed and in the PATH?")
break # Stop trying if docker command doesn't exist
except Exception as e:
logger.error(f"Log processor crashed: {e}. Restarting in 10 seconds.")
await asyncio.sleep(10)
This background task is resilient; if the docker logs
command fails or the container isn’t ready, it will wait and retry. In a production scenario, you would replace this with a more robust log shipping agent that writes directly to a message queue or an HTTP endpoint on the Sanic app.
Phase 6: Server-Side Rendering the Dashboard with Jinja2 and Flux
The final piece is the dashboard itself. The Sanic route will execute several queries against InfluxDB using the Flux language. These queries aggregate the raw metrics into meaningful statistics. The results are then passed to a Jinja2 template for rendering.
Updated app/server.py
dashboard route:
# ... (imports and other routes) ...
@app.get("/")
@app.ext.template("dashboard.html")
async def dashboard(request: Request):
"""
Queries InfluxDB for metrics and renders them server-side.
"""
bucket = app.ctx.influx_bucket
time_range = 'start: -1h' # Query data from the last hour
# A common mistake is to run queries serially. asyncio.gather runs them concurrently.
query_tasks = {
"request_counts": f'''
from(bucket: "{bucket}")
|> range({time_range})
|> filter(fn: (r) => r._measurement == "request_metrics")
|> group(columns: ["source"])
|> count()
|> group()
''',
"avg_latency": f'''
from(bucket: "{bucket}")
|> range({time_range})
|> filter(fn: (r) => r._measurement == "request_metrics" and r._field == "duration_ms")
|> group(columns: ["source"])
|> mean()
|> group()
''',
"status_codes": f'''
from(bucket: "{bucket}")
|> range({time_range})
|> filter(fn: (r) => r._measurement == "request_metrics")
|> group(columns: ["status_code"])
|> count()
|> group()
''',
"latency_over_time": f'''
from(bucket: "{bucket}")
|> range({time_range})
|> filter(fn: (r) => r._measurement == "request_metrics" and r._field == "duration_ms")
|> aggregateWindow(every: 1m, fn: mean, createEmpty: false)
|> yield(name: "mean_latency")
'''
}
results = await asyncio.gather(
*(influx_db.query(q) for q in query_tasks.values())
)
# Process raw query results into a clean dictionary for the template
data_context = {}
raw_results = dict(zip(query_tasks.keys(), results))
# Helper function to parse Flux results
def parse_flux_result(tables, key_col, val_col="_value"):
data = {}
if tables:
for table in tables:
for record in table.records:
data[record.values.get(key_col)] = record.get_value()
return data
data_context["request_counts"] = parse_flux_result(raw_results.get("request_counts"), "source")
data_context["avg_latency"] = parse_flux_result(raw_results.get("avg_latency"), "source")
data_context["status_codes"] = parse_flux_result(raw_results.get("status_codes"), "status_code")
# Process time-series data for the chart
latency_data = []
if raw_results.get("latency_over_time"):
for table in raw_results["latency_over_time"]:
for record in table.records:
latency_data.append({
"time": record.get_time().isoformat(),
"value": f"{record.get_value():.2f}"
})
data_context["latency_over_time"] = latency_data
return {"data": data_context}
The queries are run concurrently using asyncio.gather
for maximum efficiency. The raw Flux results, which are structured as a list of tables, are parsed into a simple dictionary that the template can easily consume.
The Jinja2 template then renders this data.
app/templates/dashboard.html
:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="refresh" content="5">
<title>Live Telemetry Dashboard</title>
<link rel="stylesheet" href="{{ url_for('static', name='css/dashboard.css') }}">
</head>
<body>
<div class="container">
<header>
<h1>Live System Telemetry (Last 1 Hour)</h1>
</header>
<main>
<section class="metrics-grid">
<div class="card">
<h2>Requests (Caddy)</h2>
<p class="metric">{{ data.request_counts.get('caddy', 0) }}</p>
</div>
<div class="card">
<h2>Requests (Sanic)</h2>
<p class="metric">{{ data.request_counts.get('sanic', 0) }}</p>
</div>
<div class="card">
<h2>Avg Latency (Caddy)</h2>
<p class="metric">{{ "%.2f"|format(data.avg_latency.get('caddy', 0)) }} ms</p>
</div>
<div class="card">
<h2>Avg Latency (Sanic)</h2>
<p class="metric">{{ "%.2f"|format(data.avg_latency.get('sanic', 0)) }} ms</p>
</div>
</section>
<section class="details-grid">
<div class="card">
<h3>Status Code Distribution</h3>
<table>
<thead>
<tr><th>Status</th><th>Count</th></tr>
</thead>
<tbody>
{% for code, count in data.status_codes.items()|sort %}
<tr><td>{{ code }}</td><td>{{ count }}</td></tr>
{% else %}
<tr><td colspan="2">No data</td></tr>
{% endfor %}
</tbody>
</table>
</div>
<div class="card">
<h3>Latency Over Time (ms)</h3>
<div class="table-scroll">
<table>
<thead>
<tr><th>Timestamp</th><th>Avg Latency</th></tr>
</thead>
<tbody>
{% for point in data.latency_over_time|reverse %}
<tr><td>{{ point.time }}</td><td>{{ point.value }}</td></tr>
{% else %}
<tr><td colspan="2">No data</td></tr>
{% endfor %}
</tbody>
</table>
</div>
</div>
</section>
</main>
</div>
</body>
</html>
A simple <meta http-equiv="refresh" content="5">
tag is used for auto-refreshing, staying true to the minimal/zero JavaScript philosophy of this design.
Phase 7: Maintainable Styling with SCSS
Plain CSS for a dashboard can quickly become a mess. Using SCSS allows for variables, nesting, and mixins, which drastically improves maintainability.
package.json
:
{
"name": "telemetry-dashboard-styles",
"version": "1.0.0",
"scripts": {
"build-css": "sass app/styles/dashboard.scss app/static/css/dashboard.css --style=compressed"
},
"devDependencies": {
"sass": "^1.68.0"
}
}
app/styles/_variables.scss
:
$primary-bg: #1a1a2e;
$secondary-bg: #16213e;
$card-bg: #0f3460;
$text-color: #e94560;
$text-light: #dcdcdc;
$border-color: #e94560;
$font-family: 'Consolas', 'Menlo', monospace;
app/styles/dashboard.scss
:
@import 'variables';
@import 'base';
.container {
width: 90%;
max-width: 1200px;
margin: 2rem auto;
}
header h1 {
text-align: center;
margin-bottom: 2rem;
color: $text-color;
font-size: 2rem;
}
.metrics-grid, .details-grid {
display: grid;
gap: 1.5rem;
margin-bottom: 2rem;
}
.metrics-grid {
grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
}
.details-grid {
grid-template-columns: 1fr 2fr;
}
.card {
background-color: $card-bg;
padding: 1.5rem;
border-radius: 8px;
border: 1px solid $border-color;
color: $text-light;
h2, h3 {
margin-top: 0;
margin-bottom: 1rem;
color: $text-color;
border-bottom: 1px solid darken($border-color, 10%);
padding-bottom: 0.5rem;
}
.metric {
font-size: 2.5rem;
font-weight: bold;
text-align: center;
margin: 0;
color: white;
}
}
table {
width: 100%;
border-collapse: collapse;
th, td {
padding: 0.75rem;
text-align: left;
border-bottom: 1px solid $secondary-bg;
}
th {
color: $text-color;
}
}
.table-scroll {
max-height: 400px;
overflow-y: auto;
}
Running npm run build-css
(as done in our Dockerfile
) compiles these structured SCSS files into a single, minified CSS file that is served statically by Sanic.
This implementation achieves the initial goal: a self-contained, high-performance monitoring system. Caddy and Sanic metrics are captured and stored with minimal overhead, and the SSR dashboard provides immediate insight without the weight of a traditional observability stack.
The primary limitation of this design is the log processing mechanism. Tailing Docker logs via a subprocess is functional for a demonstration but lacks the robustness for a production system. It’s susceptible to failures if the Docker daemon is unresponsive or if the log format changes unexpectedly. A production-grade architecture would replace this component with a dedicated log shipper like Vector, configured to parse the Caddy JSON logs and forward them to a dedicated /ingest
endpoint on the Sanic application. Furthermore, the dashboard is read-only; adding features like dynamic time-range selection would require introducing client-side state management, deviating from the initial SSR-purity principle. The error handling for database writes is also simplistic, lacking a retry strategy or a dead-letter queue, which would be essential for guaranteeing data integrity in a less-than-perfect network environment.