The initial deployment of our new passwordless authentication service, built on WebAuthn, across the Docker Swarm cluster felt like a success. The service was stateless, scaled beautifully, and user feedback was positive. The operational tranquility was shattered during a routine security audit. The request was simple: “Provide a comprehensive log of all WebAuthn registration and authentication failures, correlated by user ID and source IP, from all cluster nodes over the last 90 days.” We couldn’t. Our logging was a chaotic stream of unstructured text sent to stdout
, aggregated by a basic log collector. It was impossible to perform the kind of structured, forensic analysis required. The ephemeral nature of containers meant logs from failed tasks were often lost forever. This wasn’t just an operational inconvenience; it was a critical security and compliance failure.
Our immediate goal was to engineer a robust, auditable pipeline for security-sensitive events. The architecture had to guarantee message delivery, enforce a structured format, and deposit the data into a system capable of complex, relational queries. After evaluating the components we already had in production, the proposed data flow became clear: Application services running on Docker Swarm would generate structured JSON events for every WebAuthn operation. The Docker fluentd
logging driver would capture these events and forward them to a per-node Fluentd agent. These agents would buffer, filter, and forward the events to a central Fluentd aggregator, which would then be responsible for batch-inserting the structured data into a dedicated table in our existing SQL Server instance.
graph TD subgraph Docker Swarm Node 1 A1[App Container 1] -- stdout/stderr --> D1[Docker Daemon]; D1 -- fluentd driver --> F_Agent1[Fluentd Agent Service]; end subgraph Docker Swarm Node 2 A2[App Container 2] -- stdout/stderr --> D2[Docker Daemon]; D2 -- fluentd driver --> F_Agent2[Fluentd Agent Service]; end subgraph Docker Swarm Node N AN[App Container N] -- stdout/stderr --> DN[Docker Daemon]; DN -- fluentd driver --> F_AgentN[Fluentd Agent Service]; end F_Agent1 -- TCP Forward --> F_Aggregator[Fluentd Aggregator Service]; F_Agent2 -- TCP Forward --> F_Aggregator; F_AgentN -- TCP Forward --> F_Aggregator; subgraph Central Services F_Aggregator -- Batch INSERT --> SQL[SQL Server]; end subgraph Analyst SecOps[Security Analyst] -- SQL Query --> SQL; end
This design leverages existing infrastructure (Docker Swarm, SQL Server) while introducing Fluentd as the critical transport and processing layer. The choice of SQL Server over something like Elasticsearch was deliberate. While Elasticsearch excels at full-text search, our primary requirement was strict schema enforcement and the ability to perform relational joins and aggregations for audit reports—a task for which SQL is unparalleled.
Phase 1: Application-Level Structured Event Generation
The foundation of any reliable logging pipeline is the quality of the data at its source. Unstructured string logs are the enemy of automation and analysis. We mandated that all security-sensitive events must be logged as single-line JSON objects to stdout
.
Here is a snippet from our Go-based authentication service, demonstrating how we handle a WebAuthn login assertion. We use the zerolog
library for high-performance, structured logging.
package main
import (
"net/http"
"os"
"time"
"github.com/go-webauthn/webauthn/webauthn"
"github.com/rs/zerolog"
)
// In a real application, these would be managed properly.
var webAuthn *webauthn.WebAuthn
var userStore = make(map[string]webauthn.User)
var sessionStore = make(map[string]interface{})
// Global logger instance
var securityLogger zerolog.Logger
func init() {
// The logger is configured to write structured JSON to stdout.
// We add a static field 'log_type' to easily filter these specific logs in Fluentd.
securityLogger = zerolog.New(os.Stdout).With().
Timestamp().
Str("service", "auth-service").
Str("log_type", "security_audit").
Logger()
}
// Simplified WebAuthn login handler
func handleLogin(w http.ResponseWriter, r *http.Request) {
// ... (Code to parse request, get user from database)
var user webauthn.User // Assume this is fetched
// The core WebAuthn validation logic
credential, err := webAuthn.Get(r, user, webauthn.WrapMap(sessionStore))
// This is the critical logging section.
if err != nil {
// Log the failure event with rich context.
securityLogger.Error().
Str("event_type", "WEBAUTHN_AUTH_FAILURE").
Str("user_id", user.WebAuthnID()).
Str("client_ip", r.RemoteAddr).
Str("user_agent", r.UserAgent()).
AnErr("error", err). // Structured error logging
Msg("WebAuthn assertion validation failed")
http.Error(w, "Authentication failed", http.StatusUnauthorized)
return
}
// Log the success event.
securityLogger.Info().
Str("event_type", "WEBAUTHN_AUTH_SUCCESS").
Str("user_id", user.WebAuthnID()).
Str("credential_id", string(credential.ID)).
Str("client_ip", r.RemoteAddr).
Str("user_agent", r.UserAgent()).
Uint32("sign_count", credential.Authenticator.SignCount).
Msg("WebAuthn assertion validated successfully")
// ... (Code to establish user session)
w.WriteHeader(http.StatusOK)
w.Write([]byte("Login successful"))
}
func main() {
// This setup is for demonstration. A real app would have more complex handlers.
http.HandleFunc("/login", handleLogin)
// A common mistake is to not handle server shutdown gracefully.
// This can lead to lost logs if the logger buffer hasn't flushed.
server := &http.Server{Addr: ":8080"}
// ... setup graceful shutdown logic ...
err := server.ListenAndServe()
if err != nil {
// Use the same structured logger for application lifecycle events.
securityLogger.Fatal().Err(err).Msg("Server failed to start")
}
}
The key takeaway is the discipline enforced by this approach. Every security event produces a JSON object with a consistent schema (event_type
, user_id
, etc.). This preemptively solves the parsing nightmare that plagues traditional text-based logs. The inclusion of log_type: "security_audit"
provides a simple, high-performance way for Fluentd to distinguish these critical events from general application debug logs.
Phase 2: Docker Swarm and Fluentd Agent Deployment
With the application producing the correct format, the next step is to configure the infrastructure to transport it. We deploy Fluentd as a global service in Docker Swarm, ensuring exactly one instance runs on each node of the cluster. This agent is responsible for receiving logs from all containers on its node.
Here is the relevant section of our docker-stack.yml
file:
version: '3.8'
services:
# ... other services like traefik, databases, etc.
auth-service:
image: my-registry/auth-service:1.2.3
networks:
- internal_net
deploy:
replicas: 3
mode: replicated
placement:
constraints: [node.role == worker]
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
# This is the crucial part for log forwarding.
logging:
driver: "fluentd"
options:
# We point to the local fluentd agent, listening on the default port.
fluentd-address: 127.0.0.1:24224
# The tag is essential for routing and metadata. We enrich it with Swarm metadata.
tag: "swarm.audit.{{.Node.Hostname}}.{{.Service.Name}}.{{.Task.ID}}"
# Asynchronous logging prevents the application from blocking on log writes.
mode: "non-blocking"
max-buffer-size: "4m"
fluentd-agent:
image: my-registry/fluentd-agent:latest
networks:
- internal_net
volumes:
# Mount the Docker socket to allow fluentd to access container metadata.
- /var/run/docker.sock:/var/run/docker.sock
# Mount a host path for file-based buffering to survive agent restarts.
- /mnt/fluentd-buffer:/fluentd/log/buffer
ports:
# Expose the port on the host network for the logging driver to connect.
- "24224:24224"
- "24224:24224/udp"
deploy:
# Global mode ensures one agent per node.
mode: global
placement:
constraints: [node.role == worker]
restart_policy:
condition: on-failure
networks:
internal_net:
driver: overlay
A common pitfall here is neglecting the fluentd-agent
‘s own resource management. Without a persistent volume for its buffer (/mnt/fluentd-buffer
), any agent restart would cause the loss of all logs currently in its memory buffer. This is unacceptable for an audit trail. We use a file-based buffer to ensure at-least-once delivery semantics between the node and the central aggregator.
Phase 3: Fluentd Agent Configuration for Filtering and Forwarding
The configuration for the per-node fluentd-agent
is focused on three tasks: receiving logs, filtering for only the relevant security events, and forwarding them reliably to the aggregator.
Here is the fluentd.conf
used in the fluentd-agent
image:
# /etc/fluent/fluentd.conf on the agent
# Source: Listen for logs from the Docker logging driver.
<source>
@type forward
@id input_docker
port 24224
bind 0.0.0.0
</source>
# Filter 1: We only care about logs from our auth-service.
# The tag was set to "swarm.audit.{{.Node.Hostname}}.{{.Service.Name}}.{{.Task.ID}}"
<filter swarm.audit.**>
@type grep
# This ensures we only process records where the service name contains 'auth-service'.
# In a real-world project, you might use more specific naming or tags.
<regexp>
key $.service.name
pattern /auth-service/
</regexp>
</filter>
# Filter 2: Parse the JSON log message and enrich the record.
# This filter only applies to records that passed the previous grep.
<filter swarm.audit.**>
@type parser
key_name log
reserve_data true # Keep the original log field
<parse>
@type json
</parse>
</filter>
# Filter 3: A final check to ensure this is a security audit log.
# This is our second layer of defense, based on the 'log_type' field from the app.
<filter swarm.audit.**>
@type grep
<regexp>
key log_type
pattern /^security_audit$/
</regexp>
</filter>
# Filter 4: Add metadata from the Docker Swarm tag into the record itself.
# This makes node and service information queryable in SQL Server.
<filter swarm.audit.**>
@type record_transformer
<record>
# Use Ruby expressions to parse the tag.
# Tag format: swarm.audit.NODE_HOSTNAME.SERVICE_NAME.TASK_ID
node_hostname ${tag_parts[2]}
service_name ${tag_parts[3]}
task_id ${tag_parts[4]}
</record>
</filter>
# Match: Forward processed security logs to the central aggregator.
<match swarm.audit.**>
@type forward
@id output_forward
send_timeout 60s
recover_wait 10s
hard_timeout 60s
# List of aggregator nodes. Fluentd will load balance and handle failover.
<server>
host fluentd-aggregator-1.internal
port 24225
</server>
<server>
host fluentd-aggregator-2.internal
port 24225
</server>
# This buffer is the most critical part for ensuring data is not lost.
<buffer>
@type file
path /fluentd/log/buffer/security.audit
flush_mode interval
flush_interval 5s
retry_type exponential_backoff
retry_wait 1s
retry_max_interval 60s
retry_timeout 12h
chunk_limit_size 8M
total_limit_size 1G # Max disk space for buffer
overflow_action block # Block the input if buffer is full, creating backpressure.
</buffer>
</match>
# All other logs that don't match are discarded or sent elsewhere.
<match **>
@type null
</match>
The <buffer>
configuration is paramount. overflow_action block
creates backpressure, preventing the agent from dropping messages if the aggregator is down for an extended period. The application’s logging call will block, which, while not ideal, is preferable to silently losing security data. The retry_type exponential_backoff
prevents a thundering herd problem when the aggregator comes back online.
Phase 4: Fluentd Aggregator and SQL Server Integration
The aggregator’s role is simpler: receive logs from all agents and write them to SQL Server. The complexity here lies in the interaction with the database. We use the fluent-plugin-sql
plugin.
First, the Dockerfile for our aggregator image must install the necessary components:
FROM fluent/fluentd:v1.16-1
# Install build tools, ODBC driver, and the fluent-plugin-sql gem.
# This is for a Debian-based image. Adjust for Alpine.
USER root
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential unixodbc-dev && \
# Install the Microsoft ODBC Driver for SQL Server
# (Commands omitted for brevity, follow official MS documentation)
# ...
# Clean up
apt-get clean && rm -rf /var/lib/apt/lists/*
# Install the Fluentd plugin and its dependencies.
RUN gem install fluent-plugin-sql tiny_tds
USER fluent
Next, the fluentd.conf
for the aggregator:
# /etc/fluent/fluentd.conf on the aggregator
# Source: Listen for logs from the agents.
<source>
@type forward
@id input_agents
port 24225
bind 0.0.0.0
</source>
# Match: The core logic to write to SQL Server.
<match swarm.audit.**>
@type sql
@id output_sqlserver
# Connection details are loaded from environment variables for security.
host "#{ENV['SQL_SERVER_HOST']}"
port "#{ENV['SQL_SERVER_PORT']}"
database "#{ENV['SQL_SERVER_DATABASE']}"
adapter sqlserver
username "#{ENV['SQL_SERVER_USER']}"
password "#{ENV['SQL_SERVER_PASSWORD']}"
# The target table for our audit events.
table security_audit_events
# Map keys from the Fluentd record to database columns.
# This provides a strong contract between the log format and the DB schema.
column_mapping 'timestamp:event_timestamp, event_type:event_type, user_id:user_id, client_ip:client_ip, user_agent:user_agent_raw, error:error_details, credential_id:credential_id, sign_count:sign_count, node_hostname:swarm_node, service_name:swarm_service, task_id:swarm_task_id'
# Buffer configuration is just as critical here as on the agent.
# If the database is down, we must not lose the data.
<buffer>
@type file
path /fluentd/log/buffer/sql.output
flush_interval 10s
chunk_limit_size 16M # Larger chunks for better DB insert performance
queue_limit_length 1024 # Max number of chunks in queue
retry_max_times 15
retry_type exponential_backoff
# We set a shorter timeout here. If the DB is down for more than 2 hours,
# we want manual intervention. Alerts should be configured on buffer size.
retry_timeout 2h
</buffer>
</match>
A common mistake is hardcoding credentials in this file. We use Ruby expression substitution (#{ENV['VAR']}
) to pull sensitive information from environment variables, which are injected securely at runtime. The column_mapping
is another critical piece; it acts as a schema enforcement layer. If a log record is missing a required key or has a malformed one, the insert will fail for that chunk, which can then be investigated from the Fluentd logs.
Phase 5: Database Schema and Final Validation
The final component is the SQL Server table itself. The schema must match the data being sent by Fluentd and be optimized for the types of queries the security team will run.
-- DDL for the security audit events table in SQL Server
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[security_audit_events]') AND type in (N'U'))
BEGIN
CREATE TABLE [dbo].[security_audit_events](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[event_timestamp] [datetime2](7) NOT NULL,
[ingested_at] [datetime2](7) NOT NULL CONSTRAINT [DF_security_audit_events_ingested_at] DEFAULT (SYSUTCDATETIME()),
[event_type] [varchar](50) NOT NULL,
[user_id] [varchar](255) NULL,
[client_ip] [varchar](45) NULL,
[user_agent_raw] [nvarchar](1024) NULL,
[error_details] [nvarchar](max) NULL,
[credential_id] [varchar](512) NULL,
[sign_count] [int] NULL,
[swarm_node] [varchar](255) NOT NULL,
[swarm_service] [varchar](255) NOT NULL,
[swarm_task_id] [varchar](255) NOT NULL,
CONSTRAINT [PK_security_audit_events] PRIMARY KEY CLUSTERED ([id] ASC)
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
END
GO
-- Indexes are crucial for query performance.
CREATE NONCLUSTERED INDEX [IX_security_audit_events_timestamp_type] ON [dbo].[security_audit_events]
(
[event_timestamp] DESC,
[event_type] ASC
)
INCLUDE([user_id], [client_ip])
GO
CREATE NONCLUSTERED INDEX [IX_security_audit_events_user_id] ON [dbo].[security_audit_events]
(
[user_id] ASC
)
WHERE [user_id] IS NOT NULL
GO
With this table in place, the original request from the security team becomes a straightforward SQL query:
SELECT
event_timestamp,
user_id,
client_ip,
error_details,
swarm_node
FROM
dbo.security_audit_events
WHERE
event_type = 'WEBAUTHN_AUTH_FAILURE'
AND event_timestamp >= DATEADD(day, -90, SYSUTCDATETIME())
ORDER BY
event_timestamp DESC;
The pipeline successfully transforms a chaotic, distributed stream of text into a structured, queryable, and resilient audit trail.
The current implementation provides a robust baseline, but it has boundaries. The integrity of the log is only guaranteed from the point the Docker daemon receives it. A compromised container could still write malicious or forged log entries to stdout
. A higher level of assurance would require application-level log signing. Furthermore, while SQL Server is excellent for structured queries, its cost and performance characteristics might become a concern if the event volume grows by orders of magnitude. At that point, the Fluentd aggregator’s configuration could be changed to route data to a more cost-effective cold storage or a specialized data warehouse, using the SQL Server instance as a hot-tier query cache for recent events. This pipeline is not a final state, but an extensible foundation for security observability.