Constructing an Idempotent Batching Consumer for Bridging a REST API and HBase via RabbitMQ

Data Engineering

Word Count: 2.9k

Read Times: 18 Min

The initial system architecture was straightforward, almost deceptively so. A fleet of RESTful API endpoints, built with Spring Boot, received a high volume of sensor event data. Each JSON payload was deserialized, validated, and then written directly to an HBase cluster. In staging, this worked flawlessly. In production, under spiky, unpredictable load, it was a cascading failure waiting to happen. The core pain point manifested during traffic bursts: the HBase HTable connections, managed by a connection pool, would become exhausted. API response times skyrocketed from 30ms to over 5000ms as threads waited for a connection. This latency tripped load balancer health checks, causing healthy nodes to be pulled from rotation, which only intensified the load on the remaining nodes. The HBase RegionServers themselves began showing increased GC pressure and compaction queues backed up. Direct, synchronous coupling between the stateless web tier and the stateful storage tier was the root of the instability.

The decision to decouple was obvious. A message broker was required to act as a durable buffer, absorbing the traffic spikes and allowing the data to be written to HBase at a sustainable, controlled pace. We selected RabbitMQ. While Kafka is a common choice for high-throughput pipelines, our message payloads were relatively small, we didn’t require stream history or re-reading, and RabbitMQ’s per-message acknowledgement model and flexible routing capabilities provided the fine-grained control needed for this particular problem. The new architecture looked like this:

graph TD
    subgraph "API Tier (Stateless)"
        A[RESTful API Endpoint] -->|Publish JSON event| B(RabbitMQ Exchange);
    end

    B --> C{Events Queue};

    subgraph "Ingestion Service (Stateful)"
        D[Batching Consumer] -->|Consume messages| C;
        D -->|Write batched Puts| E(HBase Cluster);
    end

    subgraph "Storage Tier"
        E;
    end

The true engineering challenge, however, shifted from the API to the implementation of the Batching Consumer. A naive consumer would simply re-introduce the original problem: consuming one message and immediately writing it to HBase would generate the same write pressure, just asynchronously. The goal was to build a consumer that was not only efficient but also resilient to failure, guaranteeing data integrity without overwhelming the downstream database. This required a deep focus on three specific mechanics: batching, idempotency, and backpressure.

Phase 1: The First Pass - A Naive Batching Implementation

Our initial consumer implementation was built in Java using the official RabbitMQ and HBase client libraries. The core idea was to accumulate messages in an in-memory List until it reached a certain size (BATCH_SIZE) or a time threshold was met, and then write the entire batch to HBase.

// WARNING: This initial implementation is flawed and for demonstration only.
public class NaiveBatchingConsumer extends DefaultConsumer {
    private final HTable hTable;
    private final List<Put> buffer = new ArrayList<>();
    private static final int BATCH_SIZE = 100;

    public NaiveBatchingConsumer(Channel channel, HTable hTable) {
        super(channel);
        this.hTable = hTable;
    }

    @Override
    public void handleDelivery(String consumerTag, Envelope envelope,
                               AMQP.BasicProperties properties, byte[] body) throws IOException {
        try {
            // Assume parseEvent creates an HBase Put object from the message body
            Put put = parseEvent(body); 
            if (put != null) {
                buffer.add(put);
            }

            if (buffer.size() >= BATCH_SIZE) {
                hTable.put(buffer);
                buffer.clear();
                // Acknowledge all messages up to this one. DANGEROUS!
                getChannel().basicAck(envelope.getDeliveryTag(), true); 
            }
        } catch (Exception e) {
            // What do we do here? Nack? Reject?
            // If we Nack, the entire batch might be redelivered and reprocessed.
            getChannel().basicNack(envelope.getDeliveryTag(), true, true);
        }
    }

    // A scheduled task would also need to run to flush incomplete batches.
}

This approach immediately presented critical flaws in a real-world project.

Data Loss on Crash: If the consumer process crashes after receiving messages but before the hTable.put(buffer) call completes, all messages in the buffer are lost forever. The broker considers them delivered and acknowledged (or they will be once the TCP connection drops and RabbitMQ requeues them, but our in-memory state is gone).
Poison Pill Messages: A single malformed message that causes parseEvent to throw an unhandled exception could halt processing. If the exception causes a basicNack with requeue, the message is put back at the head of the queue, and the consumer immediately fetches it again, creating an infinite loop of failures.
Vague Acknowledgement: The call basicAck(deliveryTag, true) is problematic. The multiple=true flag acknowledges all messages up to and including the given deliveryTag. If the hTable.put call succeeded for a batch of 100 messages, this works. But what if the put call partially fails? HBase’s batch put operation is not atomic. It can succeed for 50 Puts and fail for the other 50. A single basicAck acknowledges all 100, effectively losing the 50 that failed.
Duplicate Processing on Redelivery: If the hTable.put succeeds but the process crashes before basicAck is called, RabbitMQ will redeliver the entire batch of messages to another consumer instance upon recovery. This will result in duplicate data being written to HBase.

This naive implementation highlighted that the core of the problem was state management and failure handling. The consumer must be able to handle process restarts and message redeliveries without data loss or duplication.

Phase 2: Engineering for Idempotency and Atomic Batching

To solve the data duplication issue, we needed to make our write operations idempotent. An idempotent operation is one that can be applied multiple times without changing the result beyond the initial application. In the context of HBase, writing the same cell (defined by row key, column family, column qualifier, and timestamp) with the same value multiple times is inherently idempotent. The challenge was ensuring our application logic produced idempotent Put operations.

The solution lies in the design of the HBase row key. Each incoming event message must contain a unique identifier. We designed our row key to be a composite of this unique event ID and the event timestamp.

Row Key Strategy: [eventId:String]_[timestamp:long]

For example: evt-a7b3c8f1-e4f1-4f8a-b1c3-2d9f7a6b0c1e_1672531200000

With this strategy, if a message for eventId: evt-a7b3c8f1... is delivered twice due to a network failure, the consumer will generate the exact same Put object with the exact same row key. The second write to HBase will simply overwrite the first one with identical data, causing no data corruption.

Next, we addressed the batching and acknowledgement logic. We needed to move away from the simple in-memory list and create a more robust mechanism that ties acknowledgements directly to successful writes. We refactored the consumer to collect not just the Put objects but also their corresponding RabbitMQ delivery tags.

import com.rabbitmq.client.AMQP;
import com.rabbitmq.client.Channel;
import com.rabbitmq.client.DefaultConsumer;
import com.rabbitmq.client.Envelope;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.ReentrantLock;

public class IdempotentBatchingConsumer extends DefaultConsumer {

    private static final Logger logger = LoggerFactory.getLogger(IdempotentBatchingConsumer.class);
    
    // Configuration values, should be externalized
    private static final int BATCH_SIZE = 200;
    private static final long BATCH_TIMEOUT_MS = 5000;

    private final HTable hTable;
    private final List<BatchItem> batch;
    private final ReentrantLock lock = new ReentrantLock();
    private final ScheduledExecutorService scheduler;

    private static class BatchItem {
        final long deliveryTag;
        final Put putOperation;

        BatchItem(long deliveryTag, Put putOperation) {
            this.deliveryTag = deliveryTag;
            this.putOperation = putOperation;
        }
    }

    public IdempotentBatchingConsumer(Channel channel, HTable hTable) {
        super(channel);
        this.hTable = hTable;
        this.batch = new ArrayList<>(BATCH_SIZE);
        
        // Scheduler to flush batches on a timeout
        this.scheduler = Executors.newSingleThreadScheduledExecutor();
        this.scheduler.scheduleAtFixedRate(this::flushBatchOnTimeout, BATCH_TIMEOUT_MS, BATCH_TIMEOUT_MS, TimeUnit.MILLISECONDS);
    }

    @Override
    public void handleDelivery(String consumerTag, Envelope envelope, AMQP.BasicProperties properties, byte[] body) throws IOException {
        long deliveryTag = envelope.getDeliveryTag();

        try {
            // Assume parseEventToPut is a robust method that returns null for poison pills
            Put put = EventParser.parseEventToPut(body); 
            if (put == null) {
                logger.warn("Received a malformed message. Rejecting without requeue. DeliveryTag: {}", deliveryTag);
                // Reject poison pills so they don't clog the queue
                getChannel().basicReject(deliveryTag, false); 
                return;
            }

            lock.lock();
            try {
                batch.add(new BatchItem(deliveryTag, put));
                if (batch.size() >= BATCH_SIZE) {
                    flushBatch("Batch size reached");
                }
            } finally {
                lock.unlock();
            }

        } catch (Exception e) {
            logger.error("Unexpected error processing message. Nacking for redelivery. DeliveryTag: {}", deliveryTag, e);
            // Nack the single message for redelivery. Don't use multiple=true.
            getChannel().basicNack(deliveryTag, false, true); 
        }
    }

    private void flushBatchOnTimeout() {
        lock.lock();
        try {
            if (!batch.isEmpty()) {
                flushBatch("Timeout reached");
            }
        } catch (IOException e) {
            logger.error("Failed to flush batch on timeout", e);
        } finally {
            lock.unlock();
        }
    }

    private void flushBatch(String reason) throws IOException {
        if (batch.isEmpty()) {
            return;
        }

        // Create a copy to work on, releasing the lock quickly
        List<BatchItem> currentBatch = new ArrayList<>(batch);
        batch.clear();
        
        logger.info("Flushing batch of {} items. Reason: {}", currentBatch.size(), reason);

        List<Put> puts = new ArrayList<>(currentBatch.size());
        for (BatchItem item : currentBatch) {
            puts.add(item.putOperation);
        }

        try {
            hTable.put(puts);
            // If hTable.put completes without exception, all writes succeeded.
            // Now we can safely ACK all corresponding messages.
            long lastDeliveryTag = currentBatch.get(currentBatch.size() - 1).deliveryTag;
            logger.debug("Successfully wrote batch to HBase. ACKing up to deliveryTag: {}", lastDeliveryTag);
            getChannel().basicAck(lastDeliveryTag, true); // Use multiple=true here is now safer
        } catch (IOException e) {
            logger.error("Failed to write batch to HBase. Nacking all messages in batch for redelivery.", e);
            // If the batch write fails, Nack all messages individually for redelivery.
            for (BatchItem item : currentBatch) {
                // false for multiple, true to requeue
                getChannel().basicNack(item.deliveryTag, false, true);
            }
        }
    }

    public void stop() {
        scheduler.shutdown();
        try {
            if (!scheduler.awaitTermination(5, TimeUnit.SECONDS)) {
                scheduler.shutdownNow();
            }
        } catch (InterruptedException e) {
            scheduler.shutdownNow();
            Thread.currentThread().interrupt();
        }
    }
}

This revised implementation is significantly more robust:

Idempotency: The EventParser.parseEventToPut method is now responsible for creating a Put with a deterministic row key from the message content.
Poison Pill Handling: Malformed messages are identified, logged, and rejected with requeue=false, sending them to a Dead Letter Exchange (if configured) instead of causing an infinite processing loop.
Atomic Acknowledgement Logic: We only acknowledge messages after the hTable.put(puts) operation succeeds. If it fails, we explicitly nack each message in the failed batch, ensuring they will be redelivered and retried. Using multiple=true for the basicAck is an optimization that is now safe because we’ve confirmed the entire batch was persisted. If the process crashes between the successful put and the basicAck, the messages will be redelivered, but our idempotent design handles the duplicates gracefully.

Phase 3: Taming the Flood - Backpressure with Prefetch

We now have a reliable consumer, but it’s not yet performant or stable under extreme load. A sudden, massive influx of messages on the queue would cause this consumer to pull messages as fast as its network connection allows. The batch list would grow uncontrollably, potentially leading to an OutOfMemoryError on the consumer JVM long before the batching mechanism has a chance to write to HBase. The consumer is drinking from a firehose with no way to signal “slow down.”

This is where RabbitMQ’s Quality of Service (QoS) setting, specifically prefetch_count, becomes the critical tool for implementing backpressure.

sequenceDiagram
    participant Broker
    participant Consumer

    Consumer->>Broker: channel.basicQos(prefetch_count=200)
    Note right of Consumer: Consumer sets a limit on 
unacknowledged messages.
    
    loop 200 times
        Broker->>Consumer: deliverMessage(deliveryTag=N)
    end
    
    Broker-->>Consumer: Broker stops sending messages. 
It waits for an ACK.

    Note right of Consumer: Consumer processes messages, 
builds a batch.
    Consumer->>Broker: ...time passes...

    Consumer-->>HBase: writeBatch()
    HBase-->>Consumer: Success

    Consumer->>Broker: basicAck(deliveryTag=200, multiple=true)
    Note left of Broker: Broker receives ACKs, 
its internal count of un-ACKed 
messages for this consumer drops to 0.

    Broker->>Consumer: Broker is now free to send 
up to 200 more messages.
    loop ...
        Broker->>Consumer: deliverMessage(deliveryTag=N+1)
    end

The prefetch_count tells the RabbitMQ broker the maximum number of unacknowledged messages to send to the consumer at any given time. Once the consumer has prefetch_count messages “in-flight” (delivered but not yet acknowledged), the broker will stop sending it new messages until one or more of the in-flight messages are acknowledged.

This single setting creates a powerful, automatic backpressure mechanism. If HBase slows down, our consumer’s batch writes will take longer. This means ACKs will be sent back to the broker more slowly. This, in turn, causes the unacknowledged message count to remain high, preventing the broker from flooding the consumer with more data. The data simply waits safely in the RabbitMQ queue.

The configuration is a one-line change before starting the consumer.

// In the main application setup code
// ... connect to RabbitMQ
ConnectionFactory factory = new ConnectionFactory();
// ... set host, port, etc.
Connection connection = factory.newConnection();
Channel channel = connection.createChannel();

// This is the critical line for backpressure
int prefetchCount = BATCH_SIZE * 2; // A good starting point
channel.basicQos(prefetchCount); 

HTable hTable = ... // Initialize HBase connection

Consumer consumer = new IdempotentBatchingConsumer(channel, hTable);
String queueName = "events-queue";
channel.basicConsume(queueName, false, consumer); // false for manual acknowledgement

A common mistake is setting prefetch_count too low or too high.

Too low (e.g., 1): This serializes message processing and completely negates the benefits of batching, destroying throughput.
Too high: This can lead to the original OutOfMemoryError problem if the prefetch count multiplied by the average message size exceeds the available heap memory.

A pragmatic starting point is to set prefetch_count to a small multiple (e.g., 2x-3x) of your BATCH_SIZE. This ensures the consumer always has another full batch of messages ready to be processed as soon as the current one is flushed, keeping the pipeline full without risking memory exhaustion. This value requires tuning based on real-world performance monitoring.

Final Production-Grade Service Structure

A runnable application requires more than just the consumer class. It needs proper configuration management, lifecycle handling, and robust connection management.

A simplified main application class using this consumer might look like this:

import com.rabbitmq.client.Channel;
import com.rabbitmq.client.Connection;
import com.rabbitmq.client.ConnectionFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.HConnection;
import org.apache.hadoop.hbase.client.HConnectionManager;
import org.apache.hadoop.hbase.client.HTable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.concurrent.TimeoutException;

public class IngestionService {

    private static final Logger logger = LoggerFactory.getLogger(IngestionService.class);

    // All these should come from a config file (e.g., application.yml)
    private static final String RABBITMQ_HOST = "localhost";
    private static final String QUEUE_NAME = "events-queue";
    private static final String HBASE_ZOOKEEPER_QUORUM = "localhost";
    private static final String HBASE_TABLE_NAME = "sensor_events";
    private static final int BATCH_SIZE = 200;
    private static final int PREFETCH_COUNT = BATCH_SIZE * 2;

    private Connection rabbitConnection;
    private HConnection hbaseConnection;
    private IdempotentBatchingConsumer consumer;
    private Channel channel;

    public void start() throws Exception {
        logger.info("Starting Ingestion Service...");

        // Setup RabbitMQ Connection
        ConnectionFactory factory = new ConnectionFactory();
        factory.setHost(RABBITMQ_HOST);
        // Enable automatic recovery for connections and topology
        factory.setAutomaticRecoveryEnabled(true);
        factory.setNetworkRecoveryInterval(10000); // Attempt recovery every 10 seconds
        rabbitConnection = factory.newConnection();
        channel = rabbitConnection.createChannel();
        
        logger.info("RabbitMQ connection established.");

        // Setup HBase Connection (using the managed HConnection)
        Configuration hbaseConfig = HBaseConfiguration.create();
        hbaseConfig.set("hbase.zookeeper.quorum", HBASE_ZOOKEEPER_QUORUM);
        hbaseConnection = HConnectionManager.createConnection(hbaseConfig);
        HTable hTable = (HTable) hbaseConnection.getTable(TableName.valueOf(HBASE_TABLE_NAME));
        // Disable client-side write buffer for HTable as we manage it ourselves.
        hTable.setAutoFlush(false, true);

        logger.info("HBase connection established.");

        // Configure QoS for backpressure
        channel.basicQos(PREFETCH_COUNT);

        // Declare queue to ensure it exists
        channel.queueDeclare(QUEUE_NAME, true, false, false, null);

        // Start the consumer
        consumer = new IdempotentBatchingConsumer(channel, hTable);
        channel.basicConsume(QUEUE_NAME, false, consumer); // manualAck = false
        
        logger.info("Consumer started. Waiting for messages...");
    }

    public void stop() {
        logger.info("Stopping Ingestion Service...");
        if (consumer != null) {
            consumer.stop();
        }
        try {
            if (channel != null && channel.isOpen()) {
                channel.close();
            }
            if (rabbitConnection != null && rabbitConnection.isOpen()) {
                rabbitConnection.close();
            }
            if (hbaseConnection != null && !hbaseConnection.isClosed()) {
                hbaseConnection.close();
            }
        } catch (IOException | TimeoutException e) {
            logger.error("Error during service shutdown.", e);
        }
        logger.info("Ingestion Service stopped.");
    }

    public static void main(String[] args) {
        IngestionService service = new IngestionService();
        try {
            Runtime.getRuntime().addShutdownHook(new Thread(service::stop));
            service.start();
        } catch (Exception e) {
            logger.error("Failed to start ingestion service", e);
            System.exit(1);
        }
    }
}

This architecture, centered around an intelligent consumer, transformed an unstable system into a resilient data pipeline. The REST API tier can now handle extreme bursts of traffic, returning 202 Accepted responses immediately, with the confidence that the data is durably queued in RabbitMQ. The ingestion service, protected by its own backpressure mechanism, writes to HBase at a steady, optimized rate, preventing the storage layer from ever becoming a bottleneck.

The current implementation, however, has its own operational boundaries. It runs as a single instance. While RabbitMQ ensures that messages will be processed by a new instance upon restart, this creates a single point of failure for throughput. A logical next step is to containerize this service and run multiple instances for both high availability and increased processing capacity. This introduces a new challenge: ensuring that the batching timeout mechanism across multiple instances doesn’t lead to perpetually half-full, inefficient batches. Further performance tuning would also involve deep analysis of the HBase cluster’s write performance, region distribution, and compaction strategies, as optimizing the consumer is only one half of the equation.

RabbitMQ RESTful API HBase Java System Design

Implementing a Zero-Trust Data Analytics Pipeline with Trino, Jupyter, and Consul Connect

2023-10-27 Data Engineering

SSG Zero Trust Consul Connect Service Mesh Presto / Trino Jupyter

Constructing a High-Throughput Search Pipeline from Hadoop to Solr with a Shadcn UI Frontend and Playwright-Driven Validation

2023-10-27 Data Engineering

Solr Hadoop Shadcn UI Playwright Search Big Data E2E Testing