Constructing a Concurrent Go Tool for Auditing AWS IAM Role Trust Policies


The audit finding was unambiguous: an EC2 instance in a development account, compromised via a Log4j vulnerability, successfully assumed a role granting read access to a production S3 bucket. The attack path wasn’t a flaw in the role’s permissions policy, but in its trust policy. The sts:AssumeRole action was permitted for arn:aws:iam::*:root, a common but dangerous shortcut taken by a developer years ago. Manually auditing thousands of roles across dozens of accounts for this and other subtle trust policy misconfigurations was untenable. Our existing shell scripts were slow, single-threaded, and failed silently on paginated results. A production-grade solution was required.

We decided to build a dedicated, high-concurrency auditing tool in Go. The initial requirements were specific:

  1. Define compliance rules in a simple, declarative format (YAML).
  2. Concurrently scan all IAM roles within a target AWS account.
  3. For each role, parse its trust policy document.
  4. Validate the principals listed in the trust policy against the rules.
  5. Provide a clear, actionable report of non-compliant roles.

The core of the problem lies in the structure of an IAM role’s trust policy. It’s a JSON document defining who can “assume” the role. A simple example looks like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

The Principal block can become complex, containing AWS accounts, services, federated identities, or even arrays of different principal types. Our tool needed to parse this structure reliably and validate it against our organizational standards.

Initial Architecture and Configuration

The first step was designing the configuration file. We needed a structure that was expressive enough to capture our security requirements but simple enough for security engineers to manage without code changes. YAML was the obvious choice.

A rule would need to select a set of target roles (e.g., by name pattern or tags) and define the allowed principals for their trust policies.

config/config.yml:

# rules define the compliance checks to be performed.
rules:
  - name: "ProdRolesTrustPolicy"
    description: "Production roles must only be assumed by the Prod Ops account, specific AWS services, or our SAML provider."
    # role_selector defines which roles this rule applies to.
    # An empty selector applies the rule to all roles.
    role_selector:
      tags:
        environment: "production"
    # trust_policy_validation specifies the allowed principals.
    trust_policy_validation:
      # allowed_principals is a list of valid principal identifiers.
      # - AWS Account ARN (e.g., "arn:aws:iam::111122223333:root")
      # - AWS Service Principal (e.g., "ec2.amazonaws.com")
      # - Federated Identity Provider ARN (e.g., "arn:aws:iam::444455556666:saml-provider/OurOkta")
      allowed_principals:
        - "arn:aws:iam::111122223333:root" # Prod Ops Account
        - "ec2.amazonaws.com"
        - "lambda.amazonaws.com"
        - "ecs-tasks.amazonaws.com"
        - "arn:aws:iam::444455556666:saml-provider/OurOkta"

  - name: "NonProdReadOnlyTrusts"
    description: "Non-production read-only roles can be assumed by the developer account."
    role_selector:
      name_pattern: "^ReadOnly-.*"
      tags:
        environment: "development"
    trust_policy_validation:
      allowed_principals:
        - "arn:aws:iam::777788889999:root" # Developer Account
        - "ec2.amazonaws.com"

With the configuration defined, we could lay out the Go structs to represent it.

config/types.go:

package config

type Config struct {
	Rules []Rule `yaml:"rules"`
}

type Rule struct {
	Name                  string                `yaml:"name"`
	Description           string                `yaml:"description"`
	RoleSelector          RoleSelector          `yaml:"role_selector"`
	TrustPolicyValidation TrustPolicyValidation `yaml:"trust_policy_validation"`
}

type RoleSelector struct {
	NamePattern string            `yaml:"name_pattern"`
	Tags        map[string]string `yaml:"tags"`
}

type TrustPolicyValidation struct {
	AllowedPrincipals []string `yaml:"allowed_principals"`
}

Building the Concurrent Scanning Engine

A naive implementation would list all roles and then, in a single loop, check each one. For an account with 5,000 roles, this would be painfully slow due to the latency of individual GetRole API calls. A concurrent worker pool pattern is essential for performance.

The architecture would look like this:

graph TD
    A[Main Goroutine] -- Start --> B(Role Lister);
    B -- Role ARN --> C{Role ARN Channel};
    A -- Start --> D1[Worker 1];
    A -- Start --> D2[Worker 2];
    A -- Start --> D3[Worker N...];
    C -- Role ARN --> D1;
    C -- Role ARN --> D2;
    C -- Role ARN --> D3;
    D1 -- Validation Result --> E{Results Channel};
    D2 -- Validation Result --> E;
    D3 -- Validation Result --> E;
    A -- Collects from --> E;
    A -- Wait for all workers --> F[Report Generation];

The implementation starts in main.go. It handles configuration loading, setting up the AWS session, and orchestrating the channels and goroutines.

main.go:

package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"sync"

	"github.com/aws/aws-sdk-go-v2/aws"
	awsconfig "github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/service/iam"
	"github.com/aws/aws-sdk-go-v2/service/iam/types"

	"iam-auditor/auditor"
	"iam-auditor/config"
)

const (
	// Number of concurrent workers to process roles.
	// In a real-world project, this should be configurable.
	numWorkers = 20
)

func main() {
	// --- 1. Initialization ---
	ctx := context.Background()
	cfg, err := awsconfig.LoadDefaultConfig(ctx)
	if err != nil {
		log.Fatalf("failed to load AWS configuration: %v", err)
	}

	appConfig, err := config.LoadConfig("config/config.yml")
	if err != nil {
		log.Fatalf("failed to load application config: %v", err)
	}

	iamClient := iam.NewFromConfig(cfg)

	// --- 2. Setup Concurrent Pipeline ---
	// rolesChan will stream roles from the lister to the workers.
	rolesChan := make(chan types.Role, numWorkers)
	// resultsChan will collect validation results from workers.
	resultsChan := make(chan auditor.ValidationResult, 100)
	var wg sync.WaitGroup

	// --- 3. Start Workers ---
	log.Printf("Starting %d workers to audit roles...", numWorkers)
	for i := 0; i < numWorkers; i++ {
		wg.Add(1)
		go auditor.Worker(ctx, &wg, i+1, iamClient, appConfig.Rules, rolesChan, resultsChan)
	}

	// --- 4. Start Role Lister ---
	// This goroutine lists all roles and sends them to the rolesChan.
	// It closes the channel when it's done.
	go func() {
		defer close(rolesChan)
		if err := listAndQueueRoles(ctx, iamClient, rolesChan); err != nil {
			log.Printf("Error listing roles: %v", err)
		}
	}()

	// --- 5. Collect Results ---
	// This goroutine waits for all workers to finish then closes the results channel.
	go func() {
		wg.Wait()
		close(resultsChan)
	}()

	// --- 6. Process and Display Results ---
	var nonCompliantResults []auditor.ValidationResult
	log.Println("Waiting for audit results...")
	for result := range resultsChan {
		if !result.IsCompliant {
			nonCompliantResults = append(nonCompliantResults, result)
		}
	}

	if len(nonCompliantResults) > 0 {
		fmt.Println("\n--- AUDIT FAILED: NON-COMPLIANT ROLES FOUND ---")
		for _, result := range nonCompliantResults {
			fmt.Printf("\n[!] Role: %s (Rule: %s)\n", result.RoleName, result.RuleName)
			fmt.Printf("    Reason: %s\n", result.Reason)
			fmt.Printf("    Violating Principals: %v\n", result.ViolatingPrincipals)
		}
		os.Exit(1) // Exit with a non-zero code for CI/CD integration.
	}

	fmt.Println("\n--- AUDIT PASSED: All checked roles are compliant. ---")
}

// listAndQueueRoles handles pagination for iam.ListRoles and sends each role to a channel.
func listAndQueueRoles(ctx context.Context, client *iam.Client, rolesChan chan<- types.Role) error {
	paginator := iam.NewListRolesPaginator(client, &iam.ListRolesInput{})
	pageCount := 0
	roleCount := 0
	for paginator.HasMorePages() {
		page, err := paginator.NextPage(ctx)
		if err != nil {
			return fmt.Errorf("failed to retrieve page %d of roles: %w", pageCount, err)
		}
		for _, role := range page.Roles {
			rolesChan <- role
			roleCount++
		}
		pageCount++
	}
	log.Printf("Finished listing. Found %d roles across %d pages.", roleCount, pageCount)
	return nil
}

The Core Logic: The Auditor and Policy Parsing

This is where the real work happens. The worker receives a role, determines which rule (if any) applies to it, fetches the full role details (including the trust policy), parses it, and executes the validation logic.

A common pitfall is that the AssumeRolePolicyDocument in the types.Role struct returned by ListRoles is a URL-encoded string. It must be decoded before it can be unmarshalled from JSON.

auditor/worker.go:

package auditor

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"net/url"
	"regexp"
	"strings"
	"sync"

	"github.com/aws/aws-sdk-go-v2/service/iam"
	"github.com/aws/aws-sdk-go-v2/service/iam/types"

	"iam-auditor/config"
)

// ValidationResult holds the outcome of a single role audit against a single rule.
type ValidationResult struct {
	RoleName            string
	RuleName            string
	IsCompliant         bool
	Reason              string
	ViolatingPrincipals []string
}

// Worker function executed by each goroutine.
func Worker(ctx context.Context, wg *sync.WaitGroup, id int, client *iam.Client, rules []config.Rule, rolesChan <-chan types.Role, resultsChan chan<- ValidationResult) {
	defer wg.Done()
	log.Printf("Worker %d started", id)
	for role := range rolesChan {
		processRole(ctx, client, role, rules, resultsChan)
	}
	log.Printf("Worker %d finished", id)
}

// processRole contains the logic for auditing a single IAM role.
func processRole(ctx context.Context, client *iam.Client, role types.Role, rules []config.Rule, resultsChan chan<- ValidationResult) {
	roleName := *role.RoleName

	// The AssumeRolePolicyDocument from ListRoles is URL-encoded.
	decodedPolicy, err := url.QueryUnescape(*role.AssumeRolePolicyDocument)
	if err != nil {
		resultsChan <- ValidationResult{
			RoleName:    roleName,
			IsCompliant: false,
			Reason:      fmt.Sprintf("Failed to decode trust policy document: %v", err),
		}
		return
	}

	var policyDocument TrustPolicyDocument
	if err := json.Unmarshal([]byte(decodedPolicy), &policyDocument); err != nil {
		resultsChan <- ValidationResult{
			RoleName:    roleName,
			IsCompliant: false,
			Reason:      fmt.Sprintf("Failed to unmarshal trust policy JSON: %v", err),
		}
		return
	}

	for _, rule := range rules {
		// This check can be expensive, a real implementation might pre-fetch tags for all roles in bulk.
		isApplicable, err := doesRuleApply(ctx, client, &role, &rule)
		if err != nil {
			resultsChan <- ValidationResult{
				RoleName:    roleName,
				RuleName:    rule.Name,
				IsCompliant: false,
				Reason:      fmt.Sprintf("Error checking rule applicability: %v", err),
			}
			continue
		}

		if isApplicable {
			validateTrustPolicy(roleName, &policyDocument, &rule, resultsChan)
		}
	}
}

// doesRuleApply checks if a given rule's selector matches the role.
func doesRuleApply(ctx context.Context, client *iam.Client, role *types.Role, rule *config.Rule) (bool, error) {
	// Match by name pattern if specified
	if rule.RoleSelector.NamePattern != "" {
		matched, err := regexp.MatchString(rule.RoleSelector.NamePattern, *role.RoleName)
		if err != nil {
			return false, fmt.Errorf("invalid regex pattern in rule '%s': %w", rule.Name, err)
		}
		if !matched {
			return false, nil // Does not match, so rule is not applicable
		}
	}

	// Match by tags if specified
	if len(rule.RoleSelector.Tags) > 0 {
		roleTags := make(map[string]string)
		for _, tag := a range role.Tags {
			roleTags[*tag.Key] = *tag.Value
		}

		for key, value := range rule.RoleSelector.Tags {
			if roleTags[key] != value {
				return false, nil // Tag doesn't match, rule is not applicable
			}
		}
	}

	// If we passed all checks, the rule applies.
	return true, nil
}

The TrustPolicyDocument struct and its validation logic are the most critical parts. The Principal field in an IAM policy is notoriously flexible; it can be a single string, an array of strings, or a map. Our parsing logic must handle this polymorphism correctly.

auditor/policy.go:

package auditor

import (
	"fmt"
	"iam-auditor/config"
	"reflect"
)

// TrustPolicyDocument represents the structure of an IAM trust policy.
type TrustPolicyDocument struct {
	Version   string
	Statement []StatementEntry
}

type StatementEntry struct {
	Effect    string
	Principal PrincipalEntry
	Action    interface{} // Action can be a string or an array of strings
}

// PrincipalEntry is the tricky part. It can be a simple string, an array, or a map.
// We use a custom unmarshaler to handle this polymorphism.
type PrincipalEntry struct {
	Principals map[string][]string // e.g., "AWS": ["arn1", "arn2"], "Service": ["ec2.amazonaws.com"]
}

// UnmarshalJSON implements custom unmarshaling for the Principal field.
func (p *PrincipalEntry) UnmarshalJSON(data []byte) error {
	p.Principals = make(map[string][]string)

	var raw interface{}
	if err := json.Unmarshal(data, &raw); err != nil {
		return err
	}

	switch v := raw.(type) {
	case string:
		// e.g. "Principal": "*"
		// This is a wildcard, we represent it with a special key.
		p.Principals["Wildcard"] = []string{v}
	case map[string]interface{}:
		// e.g. "Principal": { "AWS": "arn:...", "Service": "ec2.amazonaws.com" }
		for key, value := range v {
			switch val := value.(type) {
			case string:
				p.Principals[key] = append(p.Principals[key], val)
			case []interface{}:
				for _, item := range val {
					if strItem, ok := item.(string); ok {
						p.Principals[key] = append(p.Principals[key], strItem)
					}
				}
			}
		}
	default:
		return fmt.Errorf("unsupported Principal format: %v", reflect.TypeOf(v))
	}
	return nil
}

// validateTrustPolicy executes the core validation logic.
func validateTrustPolicy(roleName string, policy *TrustPolicyDocument, rule *config.Rule, resultsChan chan<- ValidationResult) {
	allowedSet := make(map[string]struct{}, len(rule.TrustPolicyValidation.AllowedPrincipals))
	for _, p := range rule.TrustPolicyValidation.AllowedPrincipals {
		allowedSet[p] = struct{}{}
	}

	var violatingPrincipals []string
	isCompliant := true

	for _, statement := range policy.Statement {
		// In a real-world tool, we would also check the Effect ("Allow" vs "Deny")
		// and the Action (must contain sts:AssumeRole). For brevity, we focus on the principal.
		if statement.Effect != "Allow" {
			continue
		}

		for _, principalList := range statement.Principal.Principals {
			for _, principal := range principalList {
				if _, ok := allowedSet[principal]; !ok {
					// Found a principal that is not in the allowed list.
					isCompliant = false
					violatingPrincipals = append(violatingPrincipals, principal)
				}
			}
		}
	}

	if !isCompliant {
		resultsChan <- ValidationResult{
			RoleName:            roleName,
			RuleName:            rule.Name,
			IsCompliant:         false,
			Reason:              "Role trust policy contains principals not in the allowed list.",
			ViolatingPrincipals: violatingPrincipals,
		}
	} else {
		// Optionally, you could send compliant results as well for verbose logging.
		// For now, we only care about failures.
	}
}

Testing the Logic

For a tool like this, unit tests are not optional. They are the only way to ensure our complex parsing and validation logic works correctly across all the edge cases of IAM policy syntax. A table-driven test is a clean way to cover multiple scenarios.

auditor/policy_test.go:

package auditor

import (
	"encoding/json"
	"testing"
)

func TestPrincipalEntry_UnmarshalJSON(t *testing.T) {
	testCases := []struct {
		name           string
		jsonInput      string
		expectedAWS    []string
		expectedService []string
		expectError    bool
	}{
		{
			name:      "Simple AWS Principal",
			jsonInput: `{ "AWS": "arn:aws:iam::123456789012:root" }`,
			expectedAWS: []string{"arn:aws:iam::123456789012:root"},
		},
		{
			name:      "Simple Service Principal",
			jsonInput: `{ "Service": "ec2.amazonaws.com" }`,
			expectedService: []string{"ec2.amazonaws.com"},
		},
		{
			name:        "Array of AWS Principals",
			jsonInput:   `{ "AWS": ["arn:aws:iam::123456789012:root", "arn:aws:iam::111122223333:root"] }`,
			expectedAWS: []string{"arn:aws:iam::123456789012:root", "arn:aws:iam::111122223333:root"},
		},
		{
			name:      "Mixed Principals",
			jsonInput: `{ "AWS": "arn:aws:iam::123456789012:root", "Service": "lambda.amazonaws.com" }`,
			expectedAWS: []string{"arn:aws:iam::123456789012:root"},
			expectedService: []string{"lambda.amazonaws.com"},
		},
		{
			name:        "Invalid Principal format",
			jsonInput:   `[ "arn:aws:iam::123456789012:root" ]`, // Top-level array not allowed
			expectError: true,
		},
	}

	for _, tc := range testCases {
		t.Run(tc.name, func(t *testing.T) {
			var p PrincipalEntry
			err := json.Unmarshal([]byte(tc.jsonInput), &p)

			if (err != nil) != tc.expectError {
				t.Fatalf("expected error: %v, got: %v", tc.expectError, err)
			}

			if !tc.expectError {
				if !equalSlices(p.Principals["AWS"], tc.expectedAWS) {
					t.Errorf("expected AWS principals %v, got %v", tc.expectedAWS, p.Principals["AWS"])
				}
				if !equalSlices(p.Principals["Service"], tc.expectedService) {
					t.Errorf("expected Service principals %v, got %v", tc.expectedService, p.Principals["Service"])
				}
			}
		})
	}
}

// Helper function to compare string slices
func equalSlices(a, b []string) bool {
	if len(a) != len(b) {
		return false
	}
	for i, v := range a {
		if v != b[i] {
			return false
		}
	}
	return true
}

This tool, built over a couple of days, transformed our IAM posture management. Integrated into our CI/CD pipeline, it runs nightly against all accounts, failing the pipeline if any new non-compliant roles are detected. It’s a concrete implementation of “Compliance as Code,” shifting security from a reactive, manual audit process to a proactive, automated guardrail.

The current implementation, however, is not without its limitations. It only inspects role trust policies, ignoring the equally critical identity-based and resource-based policies that grant actual permissions. The check for rule applicability involves a GetRole call to fetch tags for each role individually, which can be inefficient; a more optimized version would list all roles first and then perform a batch tag lookup. Furthermore, the rule engine logic is hardcoded in Go; a future iteration could leverage a dedicated policy language like Rego (from Open Policy Agent) to allow for more complex and dynamic rule definitions without requiring a recompile of the tool itself.


  TOC