Engineering Dynamic Service Topology Observability from a Puppet-Managed Monolith Using Consul Connect Grafana and a Test-Driven React Interface


The outage began, as they often do, with a seemingly innocuous change. A Puppet-driven configuration update modified a connection string for a tertiary reporting database. What we didn’t know was that a critical, undocumented payment processing service had developed a parasitic dependency on that database’s host for an obscure health check. When the host was decommissioned post-migration, the payment service fell over. The resulting scramble involved three teams, six hours of downtime, and a painful realization: we had no idea how our own systems were truly wired together. Our architecture diagram was a fantasy, a relic of a design meeting from three years ago. The ground truth was locked away in a maze of implicit dependencies and network rules, all managed by a sprawling, decade-old Puppet repository.

The immediate post-mortem goal was clear: achieve real-time, dynamic service topology visibility. Static diagrams were dead to us. We needed a living map, one that reflected reality, not intention. Our initial thought was to instrument everything with a commercial APM tool, but the cost was prohibitive and the integration with our legacy stack, a mix of Java monoliths and Perl scripts, was a nightmare. The second idea was to build a dependency discovery tool ourselves, perhaps by parsing network traffic or Puppet code. This path felt like building a solution as complex as the problem itself.

This led us to the concept of a service mesh. If all service-to-service communication was forced through a proxy that we controlled, we could not only map the traffic but also secure it. HashiCorp’s Consul Connect was the chosen tool. Its integration with our existing Consul service discovery was a natural fit, and its promise of automatic mTLS was a security win we could sell to management. The key challenge, however, was architectural. We couldn’t afford a big-bang migration. We had to introduce the service mesh into our Puppet-managed environment, service by service, without disrupting operations. This is the log of how we did it.

Our first task was deploying the Consul agent to every virtual machine in a consistent, repeatable manner. Puppet was our immutable fact for infrastructure configuration, so the integration had to start there. We created a new consul profile in our Puppet codebase to handle the installation, configuration, and service lifecycle of the agent.

A common mistake in such rollouts is to create a monolithic configuration. Instead, we designed the Puppet profile to be highly modular, allowing different node types (e.g., application servers, database nodes) to receive slightly different configurations.

Here is a simplified version of the core consul::install and consul::config classes in our Puppet manifest.

# modules/consul/manifests/install.pp
class consul::install {
  # We manage our own package repository, hence the custom source.
  # In a real-world project, ensure the package version is pinned.
  package { 'consul':
    ensure => '1.14.3', # Pinning versions is critical for stability
    source => "https://our-internal-repo.corp/packages/consul-1.14.3-amd64.deb",
    provider => 'dpkg',
  }

  # Ensure the directory structure exists with correct permissions.
  # Puppet's file resource is idempotent by nature.
  file { '/etc/consul.d':
    ensure => directory,
    owner  => 'consul',
    group  => 'consul',
    mode   => '0755',
  }

  file { '/opt/consul/data':
    ensure => directory,
    owner  => 'consul',
    group  => 'consul',
    mode   => '0750',
  }
}

# modules/consul/manifests/config.pp
class consul::config(
  String $node_name = $facts['fqdn'],
  String $datacenter = 'dc1',
  Array[String] $retry_join,
  String $bind_addr = $facts['networking']['ip'],
) {
  # The main Consul agent configuration. This is where we define its identity
  # and how it connects to the cluster. Hiera data will populate the parameters.
  $config_hash = {
    'datacenter'    => $datacenter,
    'data_dir'      => '/opt/consul/data',
    'log_level'     => 'INFO',
    'node_name'     => $node_name,
    'server'        => false, # These are all client agents
    'bind_addr'     => $bind_addr,
    'client_addr'   => '127.0.0.1', # Only listen on localhost for security
    'retry_join'    => $retry_join,
    'enable_script_checks' => true,
    'connect' => {
      'enabled' => true # This is the key to enable Consul Connect
    }
  }

  # Use the to_json_pretty function to ensure clean, readable output
  file { '/etc/consul.d/consul.json':
    ensure  => file,
    owner   => 'consul',
    group   => 'consul',
    mode    => '0644',
    content => to_json_pretty($config_hash),
    notify  => Service['consul'], # Restart service if this file changes
  }
}

# modules/consul/manifests/service.pp
class consul::service {
  service { 'consul':
    ensure  => running,
    enable  => true,
    require => [
      Class['consul::install'],
      Class['consul::config'],
    ],
  }
}

# The main profile that includes everything.
class profile::consul_agent {
  # Use Hiera to lookup cluster-specific configurations
  $retry_join_servers = hiera('consul::retry_join_servers')

  class { 'consul::install': }
  class { 'consul::config':
    retry_join => $retry_join_servers,
  }
  class { 'consul::service': }
}

This Puppet code ensures that on every agent run, the Consul binary is present, the configuration is exactly as defined, and the service is running. The notify metaparameter is crucial; it creates a dependency so that if we push a change to consul.json via Puppet, the service is automatically restarted to apply it. The connect: { enabled: true } stanza is the explicit opt-in to the service mesh functionality.

With agents deployed, the next step was to make our existing services known to Consul. We couldn’t modify the application code directly, at least not at first. The path of least resistance was using Puppet to drop service definition files into the /etc/consul.d/ directory.

Let’s take our problematic payment processor, payment-api, and its database dependency, payment-db.

# In the profile for the payment-api server
class profile::payment_api {
  # ... other application configuration ...

  # Define the Consul service for the API itself
  $api_service_def = {
    'service' => {
      'name' => 'payment-api',
      'port' => 8080,
      'tags' => ['java', 'api', 'production'],
      'check' => {
        'id'       => 'payment-api-health',
        'name'     => 'HTTP Health Check on port 8080',
        'http'     => 'http://localhost:8080/health',
        'method'   => 'GET',
        'interval' => '10s',
        'timeout'  => '2s',
      }
    }
  }

  file { '/etc/consul.d/payment-api-service.json':
    ensure  => file,
    owner   => 'consul',
    group   => 'consul',
    mode    => '0644',
    content => to_json_pretty($api_service_def),
    notify  => Service['consul'],
  }
}


# In the profile for the database server
class profile::payment_db {
  # ... other database configuration ...

  # Define the Consul service for the database
  $db_service_def = {
    'service' => {
      'name' => 'payment-db',
      'port' => 5432,
      'tags' => ['postgres', 'primary'],
      # A simple TCP check is often the best we can do for a legacy DB
      'check' => {
        'id'       => 'payment-db-tcp-check',
        'name'     => 'TCP check on port 5432',
        'tcp'      => "${facts['networking']['ip']}:5432",
        'interval' => '15s',
        'timeout'  => '3s',
      }
    }
  }

  file { '/etc/consul.d/payment-db-service.json':
    ensure  => file,
    owner   => 'consul',
    group   => 'consul',
    mode    => '0644',
    content => to_json_pretty($db_service_def),
    notify  => Service['consul'],
  }
}

After applying this Puppet code, our Consul UI immediately populated with the new services, and their health checks began reporting status. This was step one of visibility: we now knew what services were supposed to be running and whether they were up or down. But we still didn’t know who was talking to whom.

This is where Consul Connect comes into play. The goal is to force the payment-api to talk to payment-db through a Connect-managed sidecar proxy (Envoy, under the hood), rather than directly over the network.

First, we augment the payment-api service definition to declare its dependency. This is the “intention” part of the service mesh.

# In profile::payment_api, update the service definition
$api_service_def = {
  'service' => {
    'name' => 'payment-api',
    'port' => 8080,
    # ... existing tags and check ...

    # This is the new, critical part
    'connect' => {
      'sidecar_service' => {
        'proxy' => {
          # Declare that this service needs to talk to 'payment-db'
          'upstreams' => [
            {
              'destination_name' => 'payment-db',
              'local_bind_port'  => 15432 # The API will connect to localhost:15432
            }
          ]
        }
      }
    }
  }
}
# ... rest of the file resource definition

When Consul reloads this configuration, it knows payment-api intends to speak with payment-db. The local_bind_port is the magic. Consul will start an Envoy proxy on the payment-api server that listens on port 15432. Any connection to this port will be securely proxied over mTLS to an Envoy proxy on the payment-db server, which then forwards it to the actual PostgreSQL process on localhost:5432.

The second part of the magic is actually starting the sidecar proxy for the payment-db service. For our legacy, non-Connect-aware database, we simply define an empty sidecar_service.

# In profile::payment_db, update the service definition
$db_service_def = {
  'service' => {
    'name' => 'payment-db',
    'port' => 5432,
    # ... existing tags and check ...

    'connect' => {
      # This registers the service with Connect, allowing it to receive inbound connections
      'sidecar_service' => {}
    }
  }
}

Finally, we needed a small code change in the payment-api application’s configuration. This was unavoidable. Instead of connecting to a DNS name like payment-db.service.consul, it now had to connect to 127.0.0.1:15432. This configuration change was also managed and deployed via Puppet, ensuring consistency.

The flow now looks like this:

graph TD
    A[Payment API Application] -- "Connect to 127.0.0.1:15432" --> B(Envoy Sidecar on API Host);
    B -- "mTLS Tunnel" --> C(Envoy Sidecar on DB Host);
    C -- "Connect to 127.0.0.1:5432" --> D(PostgreSQL Database);

The moment we deployed this change, we gained two things: all traffic between these services was now encrypted, and more importantly for our goal, the Envoy proxies began emitting a wealth of telemetry.

The Envoy proxies expose a Prometheus-compatible metrics endpoint. We reconfigured our existing Prometheus setup, also managed by Puppet, to scrape these new endpoints.

# Simplified prometheus.yml snippet
# This job discovers Consul-managed sidecar proxies and scrapes them.
- job_name: 'consul-envoy-sidecars'
  consul_sd_configs:
    - server: 'localhost:8500'
  relabel_configs:
    # Find services that are Connect proxies
    - source_labels: ['__meta_consul_service']
      regex: '.*-sidecar-proxy'
      action: keep
    # Use the parent service name as the job label
    - source_labels: ['__meta_consul_service_id']
      regex: '(.*)-sidecar-proxy.*'
      replacement: '$1'
      target_label: 'service'
    # Set the scrape address
    - source_labels: ['__meta_consul_agent_address']
      replacement: '${1}:19001' # Envoy's default metrics port
      target_label: '__address__'

With data flowing into Prometheus, we turned to Grafana. We built a new dashboard titled “Service Mesh Topology”. The core of this dashboard was powered by a few key PromQL queries that visualized the connections we’d just created.

Query 1: Upstream Request Rate (RPS)
This query shows how many requests per second payment-api is sending to its upstream, payment-db.

# Rate of all requests from a source service to a destination cluster (upstream)
sum(rate(envoy_cluster_upstream_rq_total{
  envoy_cluster_name="payment-db",
  service="payment-api"
}[5m])) by (envoy_cluster_name, service)

Query 2: Upstream Request Success Rate
This shows the percentage of successful (2xx) responses. For a database connection, this metric is less about HTTP codes and more about successful TCP connections, but the principle is the same.

# Percentage of non-5xx responses from an upstream
(
  sum(rate(envoy_cluster_upstream_rq_completed{envoy_cluster_name="payment-db", service="payment-api"}[5m]))
  -
  sum(rate(envoy_cluster_upstream_rq_xx{envoy_response_code_class="5", envoy_cluster_name="payment-db", service="payment-api"}[5m]))
)
/
sum(rate(envoy_cluster_upstream_rq_completed{envoy_cluster_name="payment-db", service="payment-api"}[5m]))
* 100

Grafana’s “Node Graph” panel, when fed this data, drew a line from a payment-api node to a payment-db node, with the thickness and color of the line determined by the request rate and success rate. We finally had it: a live, data-driven visualization of a service dependency.

While Grafana was excellent for deep-diving into metrics, our on-call engineers needed something simpler: a high-level, opinionated dashboard to answer “what’s broken?” during an incident. We decided to build a small, internal React application that would directly query the Consul HTTP API for health status and service metadata.

The challenge with building UIs against dynamic, infrastructure-level APIs is testing. How do you reliably test a component that displays service health when the actual health of the service in your CI environment is unpredictable? Mocking is the only sane answer. We used React Testing Library for its philosophy of testing from the user’s perspective and Mock Service Worker (MSW) to intercept API calls and provide controlled, predictable responses.

Here’s the React component for displaying a single service’s status.

// src/components/ServiceStatusCard.js
import React, { useState, useEffect } from 'react';

const ServiceStatusCard = ({ serviceName }) => {
  const [status, setStatus] = useState('loading');
  const [instances, setInstances] = useState([]);
  const [error, setError] = useState(null);

  useEffect(() => {
    const fetchServiceHealth = async () => {
      setStatus('loading');
      setError(null);
      try {
        // In a real app, this URL would be configured.
        const response = await fetch(`/v1/health/service/${serviceName}`);
        if (!response.ok) {
          throw new Error(`Consul API returned ${response.status}`);
        }
        const data = await response.json();
        
        if (data.length === 0) {
            setStatus('empty');
            setInstances([]);
            return;
        }

        const passingChecks = data.filter(node => 
          node.Checks.every(check => check.Status === 'passing')
        );

        setStatus(passingChecks.length === data.length ? 'healthy' : 'unhealthy');
        setInstances(data);
      } catch (e) {
        setError(e.message);
        setStatus('error');
      }
    };

    fetchServiceHealth();
  }, [serviceName]);

  const renderStatusBadge = () => {
    switch (status) {
      case 'loading':
        return <span className="badge loading">Loading...</span>;
      case 'healthy':
        return <span className="badge healthy">Healthy</span>;
      case 'unhealthy':
        return <span className="badge unhealthy">Unhealthy</span>;
      case 'empty':
        return <span className="badge empty">No Instances</span>;
      case 'error':
        return <span className="badge error">Error</span>;
      default:
        return null;
    }
  };

  return (
    <div className="service-card" data-testid={`service-card-${serviceName}`}>
      <div className="card-header">
        <h3>{serviceName}</h3>
        {renderStatusBadge()}
      </div>
      {status === 'error' && <p className="error-message">Failed to fetch status: {error}</p>}
      {instances.length > 0 && (
        <ul>
          {instances.map(instance => (
            <li key={instance.Service.ID}>
              Node: {instance.Node.Node} | Addr: {instance.Service.Address}:{instance.Service.Port}
            </li>
          ))}
        </ul>
      )}
    </div>
  );
};

export default ServiceStatusCard;

Now, the crucial part: the test. We need to verify that this component correctly renders the “healthy,” “unhealthy,” “loading,” and “error” states based on the mocked API response.

First, we set up MSW to mock the Consul API endpoint.

// src/mocks/handlers.js
import { rest } from 'msw';

// Mock data representing a healthy service with two nodes.
const healthyPaymentApiResponse = [
  {
    Node: { Node: 'api-prod-01' },
    Service: { ID: 'payment-api-1', Service: 'payment-api', Address: '10.0.1.10', Port: 8080 },
    Checks: [{ Status: 'passing' }, { Status: 'passing' }]
  },
  {
    Node: { Node: 'api-prod-02' },
    Service: { ID: 'payment-api-2', Service: 'payment-api', Address: '10.0.1.11', Port: 8080 },
    Checks: [{ Status: 'passing' }]
  }
];

// Mock data representing a service where one node is failing.
const unhealthyPaymentApiResponse = [
    {
      Node: { Node: 'api-prod-01' },
      Service: { ID: 'payment-api-1', Service: 'payment-api', Address: '10.0.1.10', Port: 8080 },
      Checks: [{ Status: 'passing' }]
    },
    {
      Node: { Node: 'api-prod-02' },
      Service: { ID: 'payment-api-2', Service: 'payment-api', Address: '10.0.1.11', Port: 8080 },
      Checks: [{ Status: 'critical' }] // This check is failing
    }
];

export const handlers = [
  rest.get('/v1/health/service/payment-api', (req, res, ctx) => {
    // This allows us to control the mock response from within a test
    const scenario = req.url.searchParams.get('scenario');
    
    if (scenario === 'unhealthy') {
      return res(ctx.status(200), ctx.json(unhealthyPaymentApiResponse));
    }
    
    if (scenario === 'serverError') {
        return res(ctx.status(500), ctx.json({ message: 'Internal Server Error' }));
    }

    if (scenario === 'empty') {
        return res(ctx.status(200), ctx.json([]));
    }

    // Default to healthy
    return res(ctx.status(200), ctx.json(healthyPaymentApiResponse));
  })
];

With the mock handlers defined, we can write a test file that validates every state of our component without ever hitting a real Consul server.

// src/components/ServiceStatusCard.test.js
import React from 'react';
import { render, screen, waitFor } from '@testing-library/react';
import { setupServer } from 'msw/node';
import { handlers } from '../mocks/handlers';
import ServiceStatusCard from './ServiceStatusCard';

// Setup the mock server
const server = setupServer(...handlers);
beforeAll(() => server.listen());
afterEach(() => server.resetHandlers());
afterAll(() => server.close());

describe('ServiceStatusCard', () => {

  test('should initially show a loading state', () => {
    render(<ServiceStatusCard serviceName="payment-api" />);
    // The user sees "Loading..." while the fetch is in progress.
    expect(screen.getByText('Loading...')).toBeInTheDocument();
  });

  test('should display a healthy status and instance details on successful fetch', async () => {
    render(<ServiceStatusCard serviceName="payment-api" />);
    
    // React Testing Library's `findBy` queries are perfect for async operations.
    // It will wait for the "Healthy" badge to appear after the mock API resolves.
    const healthyBadge = await screen.findByText('Healthy');
    expect(healthyBadge).toBeInTheDocument();

    // Verify the instance details are rendered correctly.
    expect(screen.getByText(/Node: api-prod-01/i)).toBeInTheDocument();
    expect(screen.getByText(/Node: api-prod-02/i)).toBeInTheDocument();
  });

  test('should display an unhealthy status if any check is not passing', async () => {
    // We modify the server handler for this specific test case.
    server.use(
        rest.get('/v1/health/service/payment-api', (req, res, ctx) => {
            return res(ctx.json(unhealthyPaymentApiResponse)); // Using the unhealthy mock
        })
    );

    render(<ServiceStatusCard serviceName="payment-api" />);

    // We expect the component to correctly interpret the data and show "Unhealthy".
    expect(await screen.findByText('Unhealthy')).toBeInTheDocument();
    // It should still render the instance details for debugging.
    expect(screen.getByText(/Node: api-prod-01/i)).toBeInTheDocument();
  });
  
  test('should display an error state if the API fetch fails', async () => {
    server.use(
        rest.get('/v1/health/service/payment-api', (req, res, ctx) => {
            return res(ctx.status(500)); // Simulate a server error
        })
    );

    render(<ServiceStatusCard serviceName="payment-api" />);
    
    expect(await screen.findByText('Error')).toBeInTheDocument();
    expect(screen.getByText(/Failed to fetch status/i)).toBeInTheDocument();
  });
});

This test-driven approach gave us immense confidence in our internal tooling. We knew that no matter how Consul’s API responded, our UI would handle it gracefully. This was a far cry from the days of shipping untested internal scripts and hoping for the best.

This entire initiative, born from a painful outage, fundamentally shifted our operational posture. We moved from a state of architectural blindness to one of data-driven clarity. The integration of Consul Connect, managed by Puppet, gave us the mechanism for visibility. Prometheus and Grafana provided the deep analytical lens. And the test-driven React dashboard gave our engineers a reliable, purpose-built tool for incident response.

The journey is far from over. We’ve only migrated a fraction of our services to the mesh. The performance overhead of the Envoy sidecars, while minimal so far, requires constant monitoring as we scale. Our Puppet codebase itself is a form of technical debt that we may eventually replace with a more cloud-native approach. Furthermore, our React dashboard is still basic; future iterations will integrate distributed tracing information and provide operators with controls to shift traffic or view service-level configurations directly from the UI. But the foundation is now solid, built on the principle that observable, secure communication is not a feature, but a prerequisite for building resilient systems.


  TOC