The mandate was to architect a backend for a digital evidence management platform. The data access patterns presented an immediate and fundamental conflict. The system required:
- Transactional Integrity: Case metadata, user permissions, and immutable audit trails demanded strict ACID compliance and the ability to perform complex relational queries.
- Extreme Write Throughput: The core function was ingesting terabytes of raw, unstructured evidence data—disk images, network captures, and extensive log files—at a velocity that would saturate a traditional relational database.
- Advanced Search Capability: Investigators needed to perform complex full-text searches across all ingested raw data, with support for faceting, highlighting, and relevance ranking based on structured metadata.
Attempting to solve this with a single data store is a common architectural pitfall. It forces compromises that cripple performance and scalability for at least one of the critical access patterns.
Alternative A: The TiDB-Centric Monolith Approach
A tempting first option is to leverage a modern HTAP (Hybrid Transactional/Analytical Processing) database like TiDB for everything.
Potential Advantages:
- Operational Simplicity: Managing a single distributed database cluster is vastly simpler than managing three. This reduces overhead for deployment, monitoring, backup, and recovery.
- Unified Data Model: All data lives under one roof, eliminating the need for complex data synchronization logic. Transactions can span across different data types, ensuring strong consistency out of the box.
- SQL Interface: A single, well-understood query language (SQL) can be used for all interactions, simplifying development and onboarding.
Critical Deficiencies:
In a real-world project, this approach reveals its weaknesses under load. TiDB is an exceptional distributed SQL database, but it’s not a purpose-built wide-column store or a dedicated search engine.
- Storage Inefficiency for Wide-Column Data: HBase, with its column-family storage model, is intrinsically more efficient for storing sparse, wide, semi-structured data like event logs. Storing terabytes of this data in TiDB’s row-oriented format can lead to significant storage cost overhead and potentially slower scans for specific column subsets.
- Inadequate Full-Text Search: While TiDB has some full-text search capabilities, they are no match for a dedicated engine like Solr or Elasticsearch. It lacks advanced features like sophisticated text analysis chains (stemming, tokenization for different languages), complex faceting, query-time boosting, and highlighting that are non-negotiable for a forensics platform.
- Workload Contention: A single TiDB cluster would be forced to handle three wildly different workloads: short, latency-sensitive OLTP transactions; high-throughput, sequential writes for ingestion; and CPU-intensive search queries. This creates resource contention, making it difficult to tune and guarantee performance SLOs for any single workload. An ingestion spike could degrade query performance for investigators, which is unacceptable.
Alternative B: The Polyglot Persistence Gateway Architecture
This approach embraces the principle of using the right tool for each job, creating a heterogeneous persistence layer orchestrated by a stateless service.
graph TD subgraph "Quarkus Gateway Service" A[API Endpoint: /evidence] end subgraph "Data Persistence Layer" C[TiDB Cluster - Metadata & Audit] D[HBase Cluster - Raw Evidence Store] E[SolrCloud Cluster - Search Index] end A -- 1. Write Metadata (ACID) --> C A -- 2. Write Raw Data (High Throughput) --> D A -- 3. Index Content (Async) --> E F[API Endpoint: /search] F -- 1. Query --> E E -- 2. Returns RowKeys & Metadata --> F F -- 3. Batch Get Raw Data --> D F -- 4. Enrich with Metadata --> C F -- 5. Aggregate & Return --> F
Key Components:
- TiDB: The system of record for all structured data. Case information, user roles, evidence metadata (hashes, timestamps, chain of custody), and audit logs. Its role is to guarantee consistency.
- HBase: The bulk storage engine. Raw evidence files and logs are written here. Its wide-column model and high write throughput are ideal for this task. Row keys are carefully designed to ensure data locality.
- Solr: The search and discovery engine. Textual content from data in HBase is indexed here, along with key metadata from TiDB to enable rich, faceted search.
- Quarkus: The stateless, high-performance gateway. It exposes a unified API and encapsulates the complexity of interacting with the three underlying data stores. Its low memory footprint and fast startup via native compilation make it ideal for a scalable microservice architecture.
The Trade-off:
The primary drawback is a significant increase in architectural and operational complexity. We are now responsible for deploying, managing, and monitoring three separate distributed systems. More importantly, we sacrifice universal strong consistency. Data consistency between the stores, particularly between the primary data in HBase/TiDB and the search index in Solr, becomes a critical application-level concern.
The Decision:
For a system of this nature, where performance, scalability, and feature depth for each access pattern are paramount, the complexity of the polyglot architecture is a necessary and justified trade-off. The deficiencies of the single-database approach present a functional and performance risk that is too high. The Quarkus gateway is the key to managing this complexity effectively.
Core Implementation in the Quarkus Gateway
The implementation must be robust, handling connection management, concurrent operations, and failure scenarios gracefully.
Project Dependencies
The pom.xml
must include clients for all three data sources.
<!-- pom.xml -->
<dependencies>
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-arc</artifactId>
</dependency>
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-resteasy-reactive-jackson</artifactId>
</dependency>
<!-- TiDB/MySQL Connector -->
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-jdbc-mysql</artifactId>
</dependency>
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-hibernate-orm-panache</artifactId>
</dependency>
<!-- HBase Client -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>2.5.5</version>
<!-- Exclude libraries that conflict with Quarkus -->
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<!-- Solr Client -->
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-solrj</artifactId>
<version>9.3.0</version>
</dependency>
<!-- For reactive programming -->
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-mutiny</artifactId>
</dependency>
<!-- Testing -->
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-junit5</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>io.rest-assured</groupId>
<artifactId>rest-assured</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
Configuration (application.properties
)
Centralized configuration is crucial for managing the connections to these heterogeneous systems.
# application.properties
# Quarkus HTTP Port
quarkus.http.port=8080
# --- TiDB Configuration (using MySQL protocol) ---
quarkus.datasource.db-kind=mysql
quarkus.datasource.username=root
quarkus.datasource.password=
quarkus.datasource.jdbc.url=jdbc:mysql://tidb-host:4000/forensics
quarkus.datasource.jdbc.max-size=20
quarkus.hibernate-orm.database.generation=none
# --- HBase Configuration ---
# These properties are typically managed via hbase-site.xml on the classpath,
# but can be specified here for clarity or overrides.
hbase.zookeeper.quorum=zk1,zk2,zk3
hbase.zookeeper.property.clientPort=2181
# --- Solr Configuration ---
solr.zookeeper.hosts=zk1:2181,zk2:2181,zk3:2181/solr
solr.collection=evidence_collection
# --- Application Specific ---
evidence.hbase.table=evidence_raw
evidence.hbase.column-family=d
Client Producers
In a real-world project, directly instantiating clients in business logic is a poor practice. We use Quarkus’s CDI (Contexts and Dependency Injection) to produce managed, application-scoped client instances. This ensures connections are properly initialized, shared, and closed.
// src/main/java/org/acme/forensics/data/ClientProducers.java
package org.acme.forensics.data;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.solr.client.solrj.impl.CloudSolrClient;
import org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider;
import org.eclipse.microprofile.config.inject.ConfigProperty;
import org.jboss.logging.Logger;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.enterprise.inject.Produces;
import jakarta.inject.Singleton;
import jakarta.ws.rs.Produces;
import java.io.IOException;
import java.util.Optional;
@ApplicationScoped
public class ClientProducers {
private static final Logger LOG = Logger.getLogger(ClientProducers.class);
@ConfigProperty(name = "hbase.zookeeper.quorum")
String hbaseZkQuorum;
@ConfigProperty(name = "hbase.zookeeper.property.clientPort")
String hbaseZkPort;
@ConfigProperty(name = "solr.zookeeper.hosts")
String solrZkHosts;
@ConfigProperty(name = "solr.collection")
String solrCollection;
@Produces
@Singleton
public Connection createHBaseConnection() throws IOException {
Configuration config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", hbaseZkQuorum);
config.set("hbase.zookeeper.property.clientPort", hbaseZkPort);
// Production settings: timeouts, retries, etc.
config.set("hbase.client.retries.number", "5");
config.set("hbase.client.pause", "100");
config.set("zookeeper.session.timeout", "60000");
LOG.info("Initializing HBase Connection...");
Connection connection = ConnectionFactory.createConnection(config);
// Add a shutdown hook to ensure the connection is closed gracefully
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
if (connection != null && !connection.isClosed()) {
try {
connection.close();
LOG.info("HBase connection closed.");
} catch (IOException e) {
LOG.error("Error closing HBase connection", e);
}
}
}));
return connection;
}
@Produces
@Singleton
public CloudSolrClient createSolrClient() {
LOG.info("Initializing Solr Client...");
CloudSolrClient client = new CloudSolrClient.Builder(
new ZkClientClusterStateProvider(solrZkHosts))
.build();
client.setDefaultCollection(solrCollection);
// Production settings
client.setConnectionTimeout(5000); // 5 seconds
client.setSoTimeout(20000); // 20 seconds
// Add a shutdown hook
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
if (client != null) {
try {
client.close();
LOG.info("Solr client closed.");
} catch (IOException e) {
LOG.error("Error closing Solr client", e);
}
}
}));
return client;
}
}
Data Ingestion Logic: The Dual-Write Problem
The ingestion endpoint is where the challenge of data consistency manifests. A naive implementation performs sequential writes, but this is brittle.
// src/main/java/org/acme/forensics/EvidenceService.java
package org.acme.forensics;
import org.acme.forensics.data.model.EvidenceMetadata;
import org.acme.forensics.payload.IngestRequest;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CloudSolrClient;
import org.apache.solr.common.SolrInputDocument;
import org.eclipse.microprofile.config.inject.ConfigProperty;
import org.jboss.logging.Logger;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.inject.Inject;
import jakarta.transaction.Transactional;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.time.Instant;
import java.util.Base64;
@ApplicationScoped
public class EvidenceService {
private static final Logger LOG = Logger.getLogger(EvidenceService.class);
@Inject
Connection hbaseConnection; // Injected via our producer
@Inject
CloudSolrClient solrClient; // Injected via our producer
@ConfigProperty(name = "evidence.hbase.table")
String hbaseTableName;
@ConfigProperty(name = "evidence.hbase.column-family")
String hbaseColumnFamily;
@Transactional
public String ingestEvidence(IngestRequest request) {
// --- 1. Pre-processing: Calculate hash and create row key ---
byte[] rawData = Base64.getDecoder().decode(request.getBase64Data());
String sha256 = calculateSHA256(rawData);
String rowKey = request.getCaseId() + "_" + sha256;
// --- 2. Save Metadata to TiDB (Transactional) ---
// This part is wrapped in a JTA transaction by Quarkus.
// If this fails, the whole method rolls back and nothing else happens.
EvidenceMetadata metadata = new EvidenceMetadata();
metadata.caseId = request.getCaseId();
metadata.evidenceHash = sha256;
metadata.fileName = request.getFileName();
metadata.ingestedAt = Instant.now();
metadata.persist(); // Panache ORM method
// --- 3. Write to HBase ---
// This is outside the JTA transaction. This is the core problem.
try (Table table = hbaseConnection.getTable(TableName.valueOf(hbaseTableName))) {
Put put = new Put(Bytes.toBytes(rowKey));
put.addColumn(
Bytes.toBytes(hbaseColumnFamily),
Bytes.toBytes("raw"),
rawData
);
put.addColumn(
Bytes.toBytes(hbaseColumnFamily),
Bytes.toBytes("filename"),
Bytes.toBytes(request.getFileName())
);
table.put(put);
} catch (IOException e) {
// CRITICAL: TiDB transaction has committed, but HBase write failed.
// We now have metadata for data that doesn't exist.
// A robust solution requires a compensation action (Saga pattern)
// or a background cleanup job to find these orphaned records.
LOG.errorf("HBase put failed for rowKey %s. Orphaned metadata created with ID %d.", rowKey, metadata.id);
throw new RuntimeException("Failed to persist raw evidence to HBase", e);
}
// --- 4. Index in Solr (Asynchronously) ---
// This is also non-transactional and can fail independently.
indexInSolr(rowKey, request, metadata);
return rowKey;
}
private void indexInSolr(String rowKey, IngestRequest request, EvidenceMetadata metadata) {
// In a production system, this should not be done synchronously in the request thread.
// It should be submitted to a durable queue (like Kafka) or at least an internal
// async executor to decouple it from the main ingestion flow.
// For simplicity, we do it directly here.
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", rowKey); // Solr unique key is our HBase rowKey
doc.addField("case_id_s", metadata.caseId);
doc.addField("filename_s", metadata.fileName);
doc.addField("hash_s", metadata.evidenceHash);
doc.addField("ingested_dt", metadata.ingestedAt.toString());
// A common mistake is indexing the entire raw data. Only index searchable text.
// Here we assume a utility function extracts text.
doc.addField("content_txt_en", new String(Base64.getDecoder().decode(request.getBase64Data()), StandardCharsets.UTF_8));
try {
solrClient.add(doc);
// In a real project, we would batch commits for performance.
solrClient.commit();
} catch (SolrServerException | IOException e) {
// CRITICAL: Data is in TiDB and HBase, but not searchable.
// The system is now in an inconsistent state from the user's perspective.
// A background process must periodically re-index failed documents.
LOG.errorf("Failed to index document in Solr for rowKey %s. Data is now latent in search.", rowKey);
// We don't re-throw the exception here, as the primary data is safe.
// The ingestion is considered a "partial success".
}
}
private String calculateSHA256(byte[] data) {
try {
MessageDigest digest = MessageDigest.getInstance("SHA-256");
byte[] hash = digest.digest(data);
return bytesToHex(hash);
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException("SHA-256 algorithm not available", e);
}
}
private String bytesToHex(byte[] hash) {
StringBuilder hexString = new StringBuilder(2 * hash.length);
for (byte b : hash) {
String hex = Integer.toHexString(0xff & b);
if (hex.length() == 1) {
hexString.append('0');
}
hexString.append(hex);
}
return hexString.toString();
}
}
The comments in the code highlight the core challenge: maintaining consistency. A production-grade system would implement a Saga pattern, where a failure in the HBase or Solr step triggers a compensating action (e.g., deleting the TiDB metadata) or flags the record for reconciliation. A simpler, more common approach is to rely on periodic reconciliation jobs that scan for inconsistencies.
Data Query Logic: The Scatter-Gather Pattern
Querying requires orchestrating calls to multiple systems. The flow is to search in Solr first, then enrich the results from TiDB and HBase. Quarkus Mutiny (reactive programming) is excellent for this, allowing us to parallelize I/O-bound operations.
// Part of EvidenceService.java
import io.smallrye.mutiny.Uni;
import io.smallrye.mutiny.infrastructure.Infrastructure;
import org.acme.forensics.payload.SearchResult;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import java.util.List;
import java.util.stream.Collectors;
public Uni<List<SearchResult>> search(String query) {
// Uni represents an asynchronous operation.
return Uni.createFrom().item(() -> {
// 1. Query Solr to get matching document IDs (HBase rowKeys)
SolrQuery solrQuery = new SolrQuery();
solrQuery.setQuery(query);
solrQuery.setFields("id", "case_id_s", "filename_s", "hash_s"); // Get fields stored in Solr
solrQuery.setRows(20);
try {
return solrClient.query(solrQuery).getResults();
} catch (Exception e) {
throw new RuntimeException("Solr query failed", e);
}
})
// Run the Solr query on a worker thread to avoid blocking the event loop.
.runSubscriptionOn(Infrastructure.getDefaultWorkerPool())
// 2. Once we have Solr results, process them
.onItem().transformToUni(solrDocs -> {
if (solrDocs.isEmpty()) {
return Uni.createFrom().item(List.of());
}
// Extract HBase row keys
List<String> rowKeys = solrDocs.stream()
.map(doc -> (String) doc.getFieldValue("id"))
.collect(Collectors.toList());
// Now, we can fetch from HBase and TiDB in parallel. This is a simplification.
// A more advanced implementation would use Mutiny's `combine` or `join` operators
// to execute HBase and TiDB lookups concurrently. For clarity, we do it sequentially.
// This is a blocking call. In a fully reactive pipeline, we'd use a Vert.x HBase client.
List<Get> gets = rowKeys.stream().map(key -> new Get(Bytes.toBytes(key))).collect(Collectors.toList());
try (Table table = hbaseConnection.getTable(TableName.valueOf(hbaseTableName))) {
Result[] results = table.get(gets);
// This is a simplified mapping.
return Uni.createFrom().item(buildSearchResults(solrDocs, results));
} catch (IOException e) {
return Uni.createFrom().failure(e);
}
});
}
private List<SearchResult> buildSearchResults(SolrDocumentList solrDocs, Result[] hbaseResults) {
// In a real application, you would map the SolrDocument and HBase Result
// into a unified SearchResult DTO. This would also be the place to
// perform a final lookup in TiDB if more metadata is needed that wasn't indexed in Solr.
return solrDocs.stream().map(doc -> {
SearchResult res = new SearchResult();
res.setHbaseRowKey((String) doc.getFieldValue("id"));
res.setCaseId((String) doc.getFieldValue("case_id_s"));
res.setFileName((String) doc.getFieldValue("filename_s"));
res.setSha256Hash((String) doc.getFieldValue("hash_s"));
// Potentially add a snippet from HBase raw data here.
return res;
}).collect(Collectors.toList());
}
Limitations and Future Iterations
The primary limitation of this architecture is its inherent complexity and the application-level responsibility for data consistency. The dual-write approach during ingestion is a known anti-pattern if not managed with care through Sagas or reconciliation. The operational burden of maintaining three distributed stateful systems is significant and requires mature automation and monitoring.
A superior future iteration would involve a move to a Change Data Capture (CDC) pipeline. Instead of the Quarkus service writing to Solr directly, it would only write to TiDB and HBase. A Debezium connector for TiDB would then stream metadata changes to a Kafka topic. Another process, or a custom sink, would consume these events and the corresponding HBase data to update the Solr index. This decouples the systems entirely, makes the indexing process more resilient, and eliminates the dual-write problem from the API service’s perspective, placing the responsibility on a more robust, asynchronous data pipeline.