Orchestrating TiDB Chaos Experiments with Tekton Pipelines and WAF-based L7 Fault Injection


The post-mortem was clear. A transient network partition between two availability zones caused a minor leadership election hiccup in our TiDB cluster. TiDB itself recovered in seconds, as designed. The real problem was the cascading failure in the upstream Go services. A combination of default HTTP client timeouts, no request jitter in the retry logic, and a stampeding herd of retries exhausted the connection pool to the newly elected TiDB leader. We had a resilient database but a brittle application stack. It was obvious that unit and integration tests were insufficient; we needed to validate the resilience of the entire socio-technical system.

Our initial concept was to build a declarative, automated chaos engineering platform. It couldn’t be a separate, siloed tool. It had to live within our existing Kubernetes-native CI/CD ecosystem. The goal was to define chaos experiments as code, store them in Git, and run them as part of a pre-production deployment pipeline. The system needed to inject faults at multiple layers, observe the system’s response in a quantifiable way, and automatically determine a pass/fail outcome.

The technology selection process was driven by pragmatism and a preference for leveraging our existing stack. For orchestration, Tekton was the logical choice. It’s a Kubernetes CRD-based pipeline engine, meaning our experiments could be defined as PipelineRun YAMLs. This fit perfectly with our GitOps approach. The system under test was our user-service application, a Go-based microservice backed by a multi-node TiDB cluster.

For observability, we already used Zipkin for distributed tracing. This was a critical component. We weren’t just checking if a service was “up”; we needed to analyze the trace data during an experiment to see how latency was affected, if retries were happening correctly, and if circuit breakers were tripping as expected.

The most contentious decision was the fault injection mechanism. The obvious choice for L7 faults (e.g., HTTP errors, latency) is a service mesh like Istio. However, introducing a full service mesh into our test environments solely for this purpose felt like overkill. The operational overhead was a real concern. The pitfall here is adopting a complex technology for a narrow use case. Instead, we took an unconventional route. Our ingress is fronted by a programmable WAF that has a rich administrative API. We realized we could treat the WAF as a fault injector: dynamically create rules to match specific traffic and return a 503 error or add a delay, for a configurable percentage of requests.

Finally, we needed a way to manage the “blast radius” and experiment parameters dynamically without modifying the Tekton Pipeline definition itself. Consul’s Key-Value store was already part of our infrastructure for service configuration, making it a natural fit for storing the runtime configuration of our chaos experiments. A Tekton Task would simply read from a predefined key in Consul to fetch its parameters before execution.

This architecture—Tekton for orchestration, a programmable WAF for L7 injection, Zipkin for observation, and Consul for dynamic configuration—formed the foundation of our internal resilience testing platform targeting our critical TiDB-backed services.

The Tekton Pipeline Structure

The core of the system is a Tekton Pipeline. It defines the sequence of operations for a single chaos experiment. The structure is designed to be generic, with specifics passed in via parameters and workspaces.

graph TD
    A[Start] --> B(fetch-experiment-config);
    B --> C(setup-steady-state-probe);
    C --> D{Inject Fault};
    D -- WAF L7 Fault --> E1[inject-http-fault];
    D -- Kube API L4 Fault --> E2[delete-tikv-pod];
    E1 --> F(apply-load);
    E2 --> F;
    F --> G(wait-for-duration);
    G --> H(revert-fault);
    H --> I(analyze-traces);
    I --> J{Check SLO};
    J -- Pass --> K(Report Success);
    J -- Fail --> L(Report Failure);

The corresponding Tekton Pipeline resource is verbose but declarative. It wires together shared Workspaces for passing data between Tasks and defines a clear, repeatable workflow.

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: tidb-chaos-experiment-pipeline
spec:
  description: |
    Runs a chaos experiment against a TiDB-backed service.
    1. Fetches experiment configuration from Consul.
    2. Injects a specified fault (e.g., L7 via WAF, or pod deletion).
    3. Applies synthetic load to the target service.
    4. Waits for the experiment duration.
    5. Reverts the fault.
    6. Analyzes Zipkin traces to verify system behavior against SLOs.
  params:
    - name: consul-key
      type: string
      description: The Consul KV key where the experiment configuration is stored.
    - name: load-generator-image
      type: string
      description: Container image for the load generator.
    - name: target-service-url
      type: string
      description: The URL of the service to target with load.
  workspaces:
    - name: shared-data
      description: Workspace for sharing files like experiment configs and results.
  tasks:
    - name: fetch-config-from-consul
      taskRef:
        name: consul-kv-read
      params:
        - name: key
          value: $(params.consul-key)
      workspaces:
        - name: output
          workspace: shared-data

    - name: inject-l7-fault-via-waf
      taskRef:
        name: waf-fault-injector
      runAfter: ["fetch-config-from-consul"]
      params:
        - name: config-path
          value: "config.json"
        - name: action
          value: "apply"
      workspaces:
        - name: config-source
          workspace: shared-data
      # This task would only run if the config specifies an L7 fault
      # In a real implementation, this would use 'when' expressions.

    - name: apply-synthetic-load
      taskRef:
        name: k6-load-test
      runAfter: ["inject-l7-fault-via-waf"]
      params:
        - name: SCRIPT_PATH
          value: /workspace/source/load-script.js
        - name: TARGET_URL
          value: $(params.target-service-url)
      workspaces:
        - name: source
          workspace: shared-data

    - name: revert-l7-fault-via-waf
      taskRef:
        name: waf-fault-injector
      runAfter: ["apply-synthetic-load"]
      params:
        - name: config-path
          value: "config.json"
        - name: action
          value: "revert"
      workspaces:
        - name: config-source
          workspace: shared-data

    - name: analyze-zipkin-traces
      taskRef:
        name: zipkin-trace-analyzer
      runAfter: ["revert-l7-fault-via-waf"]
      params:
        - name: config-path
          value: "config.json"
      workspaces:
        - name: config-source
          workspace: shared-data

Dynamic Configuration with Consul

Hardcoding experiment parameters like fault type, percentage, or duration directly into the PipelineRun is inflexible. We manage this through a JSON object stored in Consul KV. The fetch-config-from-consul task retrieves this at runtime.

Consul KV Path: chaos/experiments/user-service/http-503

Value (JSON):

{
  "experimentName": "user-service-503-resilience",
  "enabled": true,
  "fault": {
    "type": "L7_HTTP_ERROR",
    "targetHost": "user-service.production.svc",
    "errorCode": 503,
    "percentage": 10
  },
  "durationSeconds": 120,
  "load": {
    "rps": 50
  },
  "slo": {
    "maxP99LatencyMs": 1500,
    "minSuccessRate": 0.90
  },
  "zipkin": {
    "serviceName": "user-service",
    "lookbackSeconds": 180
  }
}

This configuration-as-code approach, stored outside the pipeline definition, allows the SRE team to tweak, enable, or disable experiments via a simple Consul API call or UI change, without touching any CI/CD YAML.

Custom Tekton Task for WAF-based Fault Injection

This is the most critical and custom piece of the implementation. We created a container image based on a Go program that acts as a Tekton Task. This program parses the configuration from the workspace and interacts with our WAF’s admin API.

Here is a simplified but functional representation of the main.go for this custom task. A common mistake is to write such tools without proper error handling or clear logging, making pipeline debugging nearly impossible.

package main

import (
	"bytes"
	"context"
	"encoding/json"
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
	"os"
	"path/filepath"
	"time"
)

// Represents the experiment configuration read from the workspace file.
type ExperimentConfig struct {
	Fault struct {
		Type       string `json:"type"`
		TargetHost string `json:"targetHost"`
		ErrorCode  int    `json:"errorCode"`
		Percentage int    `json:"percentage"`
	} `json:"fault"`
}

// Represents the payload for the WAF Admin API.
type WAFRulePayload struct {
	ID          string `json:"id"`
	Description string `json:"description"`
	Priority    int    `json:"priority"`
	Match       struct {
		Host   string `json:"host"`
		Path   string `json:"path"`
	} `json:"match"`
	Action      struct {
		Type       string `json:"type"` // e.g., "deny", "add_delay"
		StatusCode int    `json:"statusCode"`
	} `json:"action"`
	SamplingRate float64 `json:"samplingRate"` // 0.0 to 1.0
}


// getWAFAdminURL retrieves the WAF API endpoint from an environment variable.
// In a production setup, this would come from a secret.
func getWAFAdminURL() string {
	url := os.Getenv("WAF_ADMIN_API_URL")
	if url == "" {
		log.Fatal("FATAL: WAF_ADMIN_API_URL environment variable not set.")
	}
	return url
}

func main() {
	// Tekton passes parameters as command-line arguments.
	if len(os.Args) < 3 {
		log.Fatal("Usage: <program> --config-path <path> --action <apply|revert>")
	}
	configPathFlag := os.Args[2]
	actionFlag := os.Args[4]

	// Tekton mounts workspaces at /workspace/<workspace-name>.
	fullConfigPath := filepath.Join("/workspace/config-source", configPathFlag)
	
	log.Printf("Action: %s. Reading config from: %s", actionFlag, fullConfigPath)

	configData, err := ioutil.ReadFile(fullConfigPath)
	if err != nil {
		log.Fatalf("FATAL: could not read config file %s: %v", fullConfigPath, err)
	}

	var config ExperimentConfig
	if err := json.Unmarshal(configData, &config); err != nil {
		log.Fatalf("FATAL: could not parse experiment config JSON: %v", err)
	}
	
	// The rule ID must be predictable to allow for reverting it later.
	ruleID := fmt.Sprintf("chaos-tekton-%s", config.Fault.TargetHost)

	client := &http.Client{Timeout: 15 * time.Second}
	wafURL := getWAFAdminURL()

	switch actionFlag {
	case "apply":
		log.Printf("Applying chaos rule '%s' to host '%s'", ruleID, config.Fault.TargetHost)
		
		payload := WAFRulePayload{
			ID:          ruleID,
			Description: "Chaos engineering fault injection via Tekton",
			Priority:    1, // High priority to ensure it's evaluated first.
			Match: struct {
				Host string `json:"host"`
				Path string `json:"path"`
			}{
				Host: config.Fault.TargetHost,
				Path: "/*",
			},
			Action: struct {
				Type       string `json:"type"`
				StatusCode int    `json:"statusCode"`
			}{
				Type:       "deny",
				StatusCode: config.Fault.ErrorCode,
			},
			SamplingRate: float64(config.Fault.Percentage) / 100.0,
		}
		
		if err := sendWAFRequest(client, http.MethodPost, wafURL+"/rules", payload); err != nil {
			log.Fatalf("FATAL: failed to apply WAF rule: %v", err)
		}
		log.Println("Successfully applied WAF fault injection rule.")

	case "revert":
		log.Printf("Reverting chaos rule '%s'", ruleID)
		ruleURL := fmt.Sprintf("%s/rules/%s", wafURL, ruleID)

		if err := sendWAFRequest(client, http.MethodDelete, ruleURL, nil); err != nil {
			// Don't fail the pipeline if the rule is already gone, just log a warning.
			log.Printf("WARN: failed to revert WAF rule (maybe it was already removed?): %v", err)
		} else {
			log.Println("Successfully reverted WAF fault injection rule.")
		}
		
	default:
		log.Fatalf("FATAL: unknown action '%s'. Must be 'apply' or 'revert'.", actionFlag)
	}
}

func sendWAFRequest(client *http.Client, method, url string, payload interface{}) error {
	var reqBody []byte
	var err error

	if payload != nil {
		reqBody, err = json.Marshal(payload)
		if err != nil {
			return fmt.Errorf("failed to marshal request payload: %w", err)
		}
	}
	
	req, err := http.NewRequestWithContext(context.Background(), method, url, bytes.NewBuffer(reqBody))
	if err != nil {
		return fmt.Errorf("failed to create HTTP request: %w", err)
	}
	req.Header.Set("Content-Type", "application/json")
    // Auth token would be injected here via a K8s secret
	req.Header.Set("X-Auth-Token", os.Getenv("WAF_ADMIN_API_TOKEN")) 

	resp, err := client.Do(req)
	if err != nil {
		return fmt.Errorf("HTTP request to WAF admin API failed: %w", err)
	}
	defer resp.Body.Close()

	if resp.StatusCode >= 400 {
		body, _ := ioutil.ReadAll(resp.Body)
		return fmt.Errorf("WAF admin API returned error status %d: %s", resp.StatusCode, string(body))
	}

	return nil
}

This code is then packaged into a Docker image and referenced in a Task definition.

Analyzing Impact with Zipkin

The most important step is analyze-zipkin-traces. A “successful” chaos experiment is not one where nothing fails, but one where the system degrades gracefully according to its design. This task queries the Zipkin API to get aggregate data for the target service during the experiment window.

The custom Tekton Task for this step needs to:

  1. Read the experiment configuration to get the serviceName and slo thresholds.
  2. Calculate the start and end timestamps for the query.
  3. Query the Zipkin /api/v2/traces and /api/v2/dependencies endpoints.
  4. Process the returned trace data to calculate the actual P99 latency and success rate.
  5. Compare these calculated values against the SLOs defined in the Consul config.
  6. Exit with status code 0 on success (SLOs met) or 1 on failure (SLOs breached).

A key code snippet for this analysis task might look like this:

package main

import (
	"encoding/json"
	"fmt"
	"log"
	"net/http"
	"sort"
	"time"
)

// Simplified SLO definition
type SLO struct {
	MaxP99LatencyMs int     `json:"maxP99LatencyMs"`
	MinSuccessRate  float64 `json:"minSuccessRate"`
}

// ... other structs and main function setup ...

func analyzeTraces(zipkinURL, serviceName string, start, end time.Time, slo SLO) error {
	// Construct Zipkin API query URL
	// The query finds all root spans for the service within the time window.
	query := fmt.Sprintf(
		"%s/api/v2/traces?serviceName=%s&endTs=%d&lookback=%d&limit=10000",
		zipkinURL,
		serviceName,
		end.UnixMilli(),
		end.Sub(start).Milliseconds(),
	)

	resp, err := http.Get(query)
	// ... error handling ...

	var traces [][]Span // Zipkin returns a list of traces, each trace is a list of spans
	json.NewDecoder(resp.Body).Decode(&traces)
	
	if len(traces) == 0 {
		return fmt.Errorf("no traces found for service '%s' in the given time window", serviceName)
	}

	var latencies []int
	var errorCount, totalCount int

	for _, trace := range traces {
		if len(trace) > 0 {
			rootSpan := trace[0]
			// We only care about server-side root spans for this analysis.
			if rootSpan.Kind == "SERVER" && rootSpan.LocalEndpoint.ServiceName == serviceName {
				totalCount++
				latencies = append(latencies, rootSpan.Duration/1000) // Duration is in microseconds
				if _, ok := rootSpan.Tags["error"]; ok {
					errorCount++
				}
			}
		}
	}
	
	if totalCount == 0 {
		return fmt.Errorf("found traces, but none were root spans for service '%s'", serviceName)
	}
	
	// Calculate metrics
	successRate := float64(totalCount-errorCount) / float64(totalCount)
	sort.Ints(latencies)
	p99Index := int(float64(len(latencies))*0.99) - 1
	if p99Index < 0 { p99Index = 0 }
	p99Latency := latencies[p99Index]

	log.Printf("Analysis Results: P99 Latency: %dms, Success Rate: %.2f%%", p99Latency, successRate*100)
	
	// Check against SLOs
	if p99Latency > slo.MaxP99LatencyMs {
		return fmt.Errorf("SLO BREACHED: P99 latency %dms is greater than threshold %dms", p99Latency, slo.MaxP99LatencyMs)
	}
	if successRate < slo.MinSuccessRate {
		return fmt.Errorf("SLO BREACHED: Success rate %.2f%% is less than threshold %.2f%%", successRate*100, slo.MinSuccessRate*100)
	}

	log.Println("SLO validation PASSED.")
	return nil
}

type Span struct {
	// Simplified Zipkin Span model
	Kind          string            `json:"kind"`
	Duration      int               `json:"duration"`
	LocalEndpoint Endpoint          `json:"localEndpoint"`
	Tags          map[string]string `json:"tags"`
}

type Endpoint struct {
	ServiceName string `json:"serviceName"`
}

This automated, data-driven analysis is what elevates this from simple fault injection to true resilience engineering. It provides a non-negotiable, objective measure of the system’s performance under duress.

The primary limitation of this implementation is the fidelity of the fault injector. Using a WAF for L7 errors is a clever workaround but cannot accurately simulate deeper network issues like packet loss, jitter, or full partitions. For those scenarios, a CNI-level tool or a full service mesh is unavoidable. Furthermore, our Zipkin analysis task is still relatively simple. A more advanced version could build a baseline performance profile before the fault is injected and then measure the delta during the experiment, allowing for more nuanced SLOs like “P99 latency must not increase by more than 200%”. The next iteration will likely focus on integrating a more powerful network-level fault injector and building this baseline analysis capability directly into the Tekton Task.


  TOC