Constructing a Resilient Audit Trail for WebAuthn Events from Docker Swarm to SQL Server via Fluentd

Observability

Word Count: 2.7k

Read Times: 16 Min

The initial deployment of our new passwordless authentication service, built on WebAuthn, across the Docker Swarm cluster felt like a success. The service was stateless, scaled beautifully, and user feedback was positive. The operational tranquility was shattered during a routine security audit. The request was simple: “Provide a comprehensive log of all WebAuthn registration and authentication failures, correlated by user ID and source IP, from all cluster nodes over the last 90 days.” We couldn’t. Our logging was a chaotic stream of unstructured text sent to stdout, aggregated by a basic log collector. It was impossible to perform the kind of structured, forensic analysis required. The ephemeral nature of containers meant logs from failed tasks were often lost forever. This wasn’t just an operational inconvenience; it was a critical security and compliance failure.

Our immediate goal was to engineer a robust, auditable pipeline for security-sensitive events. The architecture had to guarantee message delivery, enforce a structured format, and deposit the data into a system capable of complex, relational queries. After evaluating the components we already had in production, the proposed data flow became clear: Application services running on Docker Swarm would generate structured JSON events for every WebAuthn operation. The Docker fluentd logging driver would capture these events and forward them to a per-node Fluentd agent. These agents would buffer, filter, and forward the events to a central Fluentd aggregator, which would then be responsible for batch-inserting the structured data into a dedicated table in our existing SQL Server instance.

graph TD
    subgraph Docker Swarm Node 1
        A1[App Container 1] -- stdout/stderr --> D1[Docker Daemon];
        D1 -- fluentd driver --> F_Agent1[Fluentd Agent Service];
    end
    subgraph Docker Swarm Node 2
        A2[App Container 2] -- stdout/stderr --> D2[Docker Daemon];
        D2 -- fluentd driver --> F_Agent2[Fluentd Agent Service];
    end
    subgraph Docker Swarm Node N
        AN[App Container N] -- stdout/stderr --> DN[Docker Daemon];
        DN -- fluentd driver --> F_AgentN[Fluentd Agent Service];
    end

    F_Agent1 -- TCP Forward --> F_Aggregator[Fluentd Aggregator Service];
    F_Agent2 -- TCP Forward --> F_Aggregator;
    F_AgentN -- TCP Forward --> F_Aggregator;

    subgraph Central Services
        F_Aggregator -- Batch INSERT --> SQL[SQL Server];
    end

    subgraph Analyst
        SecOps[Security Analyst] -- SQL Query --> SQL;
    end

This design leverages existing infrastructure (Docker Swarm, SQL Server) while introducing Fluentd as the critical transport and processing layer. The choice of SQL Server over something like Elasticsearch was deliberate. While Elasticsearch excels at full-text search, our primary requirement was strict schema enforcement and the ability to perform relational joins and aggregations for audit reports—a task for which SQL is unparalleled.

Phase 1: Application-Level Structured Event Generation

The foundation of any reliable logging pipeline is the quality of the data at its source. Unstructured string logs are the enemy of automation and analysis. We mandated that all security-sensitive events must be logged as single-line JSON objects to stdout.

Here is a snippet from our Go-based authentication service, demonstrating how we handle a WebAuthn login assertion. We use the zerolog library for high-performance, structured logging.

package main

import (
	"net/http"
	"os"
	"time"

	"github.com/go-webauthn/webauthn/webauthn"
	"github.com/rs/zerolog"
)

// In a real application, these would be managed properly.
var webAuthn *webauthn.WebAuthn
var userStore = make(map[string]webauthn.User)
var sessionStore = make(map[string]interface{})

// Global logger instance
var securityLogger zerolog.Logger

func init() {
	// The logger is configured to write structured JSON to stdout.
	// We add a static field 'log_type' to easily filter these specific logs in Fluentd.
	securityLogger = zerolog.New(os.Stdout).With().
		Timestamp().
		Str("service", "auth-service").
		Str("log_type", "security_audit").
		Logger()
}

// Simplified WebAuthn login handler
func handleLogin(w http.ResponseWriter, r *http.Request) {
	// ... (Code to parse request, get user from database)
	var user webauthn.User // Assume this is fetched
	
	// The core WebAuthn validation logic
	credential, err := webAuthn.Get(r, user, webauthn.WrapMap(sessionStore))

	// This is the critical logging section.
	if err != nil {
		// Log the failure event with rich context.
		securityLogger.Error().
			Str("event_type", "WEBAUTHN_AUTH_FAILURE").
			Str("user_id", user.WebAuthnID()).
			Str("client_ip", r.RemoteAddr).
			Str("user_agent", r.UserAgent()).
			AnErr("error", err). // Structured error logging
			Msg("WebAuthn assertion validation failed")
			
		http.Error(w, "Authentication failed", http.StatusUnauthorized)
		return
	}

	// Log the success event.
	securityLogger.Info().
		Str("event_type", "WEBAUTHN_AUTH_SUCCESS").
		Str("user_id", user.WebAuthnID()).
		Str("credential_id", string(credential.ID)).
		Str("client_ip", r.RemoteAddr).
		Str("user_agent", r.UserAgent()).
		Uint32("sign_count", credential.Authenticator.SignCount).
		Msg("WebAuthn assertion validated successfully")

	// ... (Code to establish user session)
	w.WriteHeader(http.StatusOK)
	w.Write([]byte("Login successful"))
}

func main() {
    // This setup is for demonstration. A real app would have more complex handlers.
	http.HandleFunc("/login", handleLogin)
	
	// A common mistake is to not handle server shutdown gracefully.
	// This can lead to lost logs if the logger buffer hasn't flushed.
	server := &http.Server{Addr: ":8080"}
	// ... setup graceful shutdown logic ...
	err := server.ListenAndServe()
    if err != nil {
        // Use the same structured logger for application lifecycle events.
        securityLogger.Fatal().Err(err).Msg("Server failed to start")
    }
}

The key takeaway is the discipline enforced by this approach. Every security event produces a JSON object with a consistent schema (event_type, user_id, etc.). This preemptively solves the parsing nightmare that plagues traditional text-based logs. The inclusion of log_type: "security_audit" provides a simple, high-performance way for Fluentd to distinguish these critical events from general application debug logs.

Phase 2: Docker Swarm and Fluentd Agent Deployment

With the application producing the correct format, the next step is to configure the infrastructure to transport it. We deploy Fluentd as a global service in Docker Swarm, ensuring exactly one instance runs on each node of the cluster. This agent is responsible for receiving logs from all containers on its node.

Here is the relevant section of our docker-stack.yml file:

version: '3.8'

services:
  # ... other services like traefik, databases, etc.

  auth-service:
    image: my-registry/auth-service:1.2.3
    networks:
      - internal_net
    deploy:
      replicas: 3
      mode: replicated
      placement:
        constraints: [node.role == worker]
      update_config:
        parallelism: 1
        delay: 10s
      restart_policy:
        condition: on-failure
    # This is the crucial part for log forwarding.
    logging:
      driver: "fluentd"
      options:
        # We point to the local fluentd agent, listening on the default port.
        fluentd-address: 127.0.0.1:24224
        # The tag is essential for routing and metadata. We enrich it with Swarm metadata.
        tag: "swarm.audit.{{.Node.Hostname}}.{{.Service.Name}}.{{.Task.ID}}"
        # Asynchronous logging prevents the application from blocking on log writes.
        mode: "non-blocking"
        max-buffer-size: "4m"

  fluentd-agent:
    image: my-registry/fluentd-agent:latest
    networks:
      - internal_net
    volumes:
      # Mount the Docker socket to allow fluentd to access container metadata.
      - /var/run/docker.sock:/var/run/docker.sock
      # Mount a host path for file-based buffering to survive agent restarts.
      - /mnt/fluentd-buffer:/fluentd/log/buffer
    ports:
      # Expose the port on the host network for the logging driver to connect.
      - "24224:24224"
      - "24224:24224/udp"
    deploy:
      # Global mode ensures one agent per node.
      mode: global
      placement:
        constraints: [node.role == worker]
      restart_policy:
        condition: on-failure

networks:
  internal_net:
    driver: overlay

A common pitfall here is neglecting the fluentd-agent‘s own resource management. Without a persistent volume for its buffer (/mnt/fluentd-buffer), any agent restart would cause the loss of all logs currently in its memory buffer. This is unacceptable for an audit trail. We use a file-based buffer to ensure at-least-once delivery semantics between the node and the central aggregator.

Phase 3: Fluentd Agent Configuration for Filtering and Forwarding

The configuration for the per-node fluentd-agent is focused on three tasks: receiving logs, filtering for only the relevant security events, and forwarding them reliably to the aggregator.

Here is the fluentd.conf used in the fluentd-agent image:

# /etc/fluent/fluentd.conf on the agent

# Source: Listen for logs from the Docker logging driver.
<source>
  @type forward
  @id input_docker
  port 24224
  bind 0.0.0.0
</source>

# Filter 1: We only care about logs from our auth-service.
# The tag was set to "swarm.audit.{{.Node.Hostname}}.{{.Service.Name}}.{{.Task.ID}}"
<filter swarm.audit.**>
  @type grep
  # This ensures we only process records where the service name contains 'auth-service'.
  # In a real-world project, you might use more specific naming or tags.
  <regexp>
    key $.service.name
    pattern /auth-service/
  </regexp>
</filter>

# Filter 2: Parse the JSON log message and enrich the record.
# This filter only applies to records that passed the previous grep.
<filter swarm.audit.**>
  @type parser
  key_name log
  reserve_data true # Keep the original log field
  <parse>
    @type json
  </parse>
</filter>

# Filter 3: A final check to ensure this is a security audit log.
# This is our second layer of defense, based on the 'log_type' field from the app.
<filter swarm.audit.**>
  @type grep
  <regexp>
    key log_type
    pattern /^security_audit$/
  </regexp>
</filter>

# Filter 4: Add metadata from the Docker Swarm tag into the record itself.
# This makes node and service information queryable in SQL Server.
<filter swarm.audit.**>
    @type record_transformer
    <record>
        # Use Ruby expressions to parse the tag.
        # Tag format: swarm.audit.NODE_HOSTNAME.SERVICE_NAME.TASK_ID
        node_hostname ${tag_parts[2]}
        service_name ${tag_parts[3]}
        task_id ${tag_parts[4]}
    </record>
</filter>

# Match: Forward processed security logs to the central aggregator.
<match swarm.audit.**>
  @type forward
  @id output_forward
  send_timeout 60s
  recover_wait 10s
  hard_timeout 60s

  # List of aggregator nodes. Fluentd will load balance and handle failover.
  <server>
    host fluentd-aggregator-1.internal
    port 24225
  </server>
  <server>
    host fluentd-aggregator-2.internal
    port 24225
  </server>
  
  # This buffer is the most critical part for ensuring data is not lost.
  <buffer>
    @type file
    path /fluentd/log/buffer/security.audit
    flush_mode interval
    flush_interval 5s
    retry_type exponential_backoff
    retry_wait 1s
    retry_max_interval 60s
    retry_timeout 12h
    chunk_limit_size 8M
    total_limit_size 1G # Max disk space for buffer
    overflow_action block # Block the input if buffer is full, creating backpressure.
  </buffer>
</match>

# All other logs that don't match are discarded or sent elsewhere.
<match **>
    @type null
</match>

The <buffer> configuration is paramount. overflow_action block creates backpressure, preventing the agent from dropping messages if the aggregator is down for an extended period. The application’s logging call will block, which, while not ideal, is preferable to silently losing security data. The retry_type exponential_backoff prevents a thundering herd problem when the aggregator comes back online.

Phase 4: Fluentd Aggregator and SQL Server Integration

The aggregator’s role is simpler: receive logs from all agents and write them to SQL Server. The complexity here lies in the interaction with the database. We use the fluent-plugin-sql plugin.

First, the Dockerfile for our aggregator image must install the necessary components:

FROM fluent/fluentd:v1.16-1

# Install build tools, ODBC driver, and the fluent-plugin-sql gem.
# This is for a Debian-based image. Adjust for Alpine.
USER root
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    build-essential unixodbc-dev && \
    # Install the Microsoft ODBC Driver for SQL Server
    # (Commands omitted for brevity, follow official MS documentation)
    # ...
    # Clean up
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Install the Fluentd plugin and its dependencies.
RUN gem install fluent-plugin-sql tiny_tds

USER fluent

Next, the fluentd.conf for the aggregator:

# /etc/fluent/fluentd.conf on the aggregator

# Source: Listen for logs from the agents.
<source>
  @type forward
  @id input_agents
  port 24225
  bind 0.0.0.0
</source>

# Match: The core logic to write to SQL Server.
<match swarm.audit.**>
  @type sql
  @id output_sqlserver
  
  # Connection details are loaded from environment variables for security.
  host "#{ENV['SQL_SERVER_HOST']}"
  port "#{ENV['SQL_SERVER_PORT']}"
  database "#{ENV['SQL_SERVER_DATABASE']}"
  adapter sqlserver
  username "#{ENV['SQL_SERVER_USER']}"
  password "#{ENV['SQL_SERVER_PASSWORD']}"

  # The target table for our audit events.
  table security_audit_events

  # Map keys from the Fluentd record to database columns.
  # This provides a strong contract between the log format and the DB schema.
  column_mapping 'timestamp:event_timestamp, event_type:event_type, user_id:user_id, client_ip:client_ip, user_agent:user_agent_raw, error:error_details, credential_id:credential_id, sign_count:sign_count, node_hostname:swarm_node, service_name:swarm_service, task_id:swarm_task_id'

  # Buffer configuration is just as critical here as on the agent.
  # If the database is down, we must not lose the data.
  <buffer>
    @type file
    path /fluentd/log/buffer/sql.output
    flush_interval 10s
    chunk_limit_size 16M # Larger chunks for better DB insert performance
    queue_limit_length 1024 # Max number of chunks in queue
    retry_max_times 15
    retry_type exponential_backoff
    # We set a shorter timeout here. If the DB is down for more than 2 hours,
    # we want manual intervention. Alerts should be configured on buffer size.
    retry_timeout 2h 
  </buffer>
</match>

A common mistake is hardcoding credentials in this file. We use Ruby expression substitution (#{ENV['VAR']}) to pull sensitive information from environment variables, which are injected securely at runtime. The column_mapping is another critical piece; it acts as a schema enforcement layer. If a log record is missing a required key or has a malformed one, the insert will fail for that chunk, which can then be investigated from the Fluentd logs.

Phase 5: Database Schema and Final Validation

The final component is the SQL Server table itself. The schema must match the data being sent by Fluentd and be optimized for the types of queries the security team will run.

-- DDL for the security audit events table in SQL Server
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[security_audit_events]') AND type in (N'U'))
BEGIN
CREATE TABLE [dbo].[security_audit_events](
	[id] [bigint] IDENTITY(1,1) NOT NULL,
	[event_timestamp] [datetime2](7) NOT NULL,
	[ingested_at] [datetime2](7) NOT NULL CONSTRAINT [DF_security_audit_events_ingested_at] DEFAULT (SYSUTCDATETIME()),
	[event_type] [varchar](50) NOT NULL,
	[user_id] [varchar](255) NULL,
	[client_ip] [varchar](45) NULL,
	[user_agent_raw] [nvarchar](1024) NULL,
	[error_details] [nvarchar](max) NULL,
	[credential_id] [varchar](512) NULL,
	[sign_count] [int] NULL,
	[swarm_node] [varchar](255) NOT NULL,
	[swarm_service] [varchar](255) NOT NULL,
	[swarm_task_id] [varchar](255) NOT NULL,
    CONSTRAINT [PK_security_audit_events] PRIMARY KEY CLUSTERED ([id] ASC)
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
END
GO

-- Indexes are crucial for query performance.
CREATE NONCLUSTERED INDEX [IX_security_audit_events_timestamp_type] ON [dbo].[security_audit_events]
(
	[event_timestamp] DESC,
	[event_type] ASC
)
INCLUDE([user_id], [client_ip])
GO

CREATE NONCLUSTERED INDEX [IX_security_audit_events_user_id] ON [dbo].[security_audit_events]
(
	[user_id] ASC
)
WHERE [user_id] IS NOT NULL
GO

With this table in place, the original request from the security team becomes a straightforward SQL query:

SELECT
    event_timestamp,
    user_id,
    client_ip,
    error_details,
    swarm_node
FROM
    dbo.security_audit_events
WHERE
    event_type = 'WEBAUTHN_AUTH_FAILURE'
    AND event_timestamp >= DATEADD(day, -90, SYSUTCDATETIME())
ORDER BY
    event_timestamp DESC;

The pipeline successfully transforms a chaotic, distributed stream of text into a structured, queryable, and resilient audit trail.

The current implementation provides a robust baseline, but it has boundaries. The integrity of the log is only guaranteed from the point the Docker daemon receives it. A compromised container could still write malicious or forged log entries to stdout. A higher level of assurance would require application-level log signing. Furthermore, while SQL Server is excellent for structured queries, its cost and performance characteristics might become a concern if the event volume grows by orders of magnitude. At that point, the Fluentd aggregator’s configuration could be changed to route data to a more cost-effective cold storage or a specialized data warehouse, using the SQL Server instance as a hot-tier query cache for recent events. This pipeline is not a final state, but an extensible foundation for security observability.

Fluentd Security WebAuthn Docker Swarm SQL Server Logging

Implementing eBPF-Based Kernel Observability in AWS Lambda Environments

2023-10-27 Cloud Native

Observability eBPF Cilium AWS Lambda NoSQL

Engineering an Idempotent Data Pipeline for Synchronizing Apache Hudi Commits to a Meilisearch Index via AWS SNS

2023-10-27 Data Engineering

Meilisearch Idempotency AWS SNS Spark Apache Hudi Code Review