Implementing a Custom Solr Operator and Angular UI for Searchable Artifacts within Kubeflow


Managing metadata for hundreds of Kubeflow pipeline runs, datasets, and models across dozens of projects became untenable. The default filtering capabilities, reliant on Kubernetes labels and annotations, were simply not designed for the complex, free-text queries our data science teams needed. Locating a specific experiment from six months ago based on a fragment of a log message or a particular hyperparameter value was a manual, time-consuming trawl through YAML files and logs. This operational friction was a significant drag on productivity. The clear requirement was for a robust, indexed, full-text search capability integrated directly into the Kubeflow ecosystem.

Our initial concept was to deploy a SolrCloud cluster to serve as a centralized search index for all ML artifacts. However, a single, manually managed cluster presented several architectural problems. It created a single point of failure and a noisy neighbor problem. A more robust solution would be to provision dedicated, namespaced SolrCloud instances on-demand for different teams or major projects. This led us to the decision to build a Kubernetes Operator. In a production environment, managing stateful applications like Solr requires automating complex lifecycle operations—deployment, scaling, configuration management, and teardown. An Operator, which extends the Kubernetes API with a Custom Resource Definition (CRD), is the canonical pattern for this. It allows us to treat a Solr cluster as a native Kubernetes resource.

For the user interface, the choice was straightforward. The Kubeflow Central Dashboard is primarily built with Angular and Polymer. To ensure a seamless user experience and leverage existing UI component libraries and authentication flows, building our management and search interface as a new Angular component was the path of least resistance. This avoided the complexity of integrating a separate frontend application with its own auth, routing, and styling. The entire solution would thus consist of two main parts: a Go-based Solr Operator handling the backend lifecycle and an Angular component integrated into the Kubeflow dashboard for user interaction.

Phase 1: The Declarative Foundation - CRD and Operator Scaffolding

The first step was to define the API for our new resource. This is the SolrCluster Custom Resource Definition (CRD). The spec section is the contract with our users, defining the desired state of their Solr cluster. We kept it minimal initially to avoid over-engineering.

# config/crd/bases/mlops.example.com_solrclusters.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: solrclusters.mlops.example.com
spec:
  group: mlops.example.com
  names:
    kind: SolrCluster
    listKind: SolrClusterList
    plural: solrclusters
    singular: solrcluster
  scope: Namespaced
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        type: object
        properties:
          apiVersion:
            type: string
          kind:
            type: string
          metadata:
            type: object
          spec:
            type: object
            properties:
              replicas:
                type: integer
                minimum: 1
                description: Number of Solr nodes in the cluster.
              image:
                type: string
                default: "solr:8.11"
                description: The Solr container image to use.
              zookeeperConnectString:
                type: string
                description: The connection string for the external ZooKeeper ensemble.
              configSetName:
                type: string
                description: The name of the Solr config set to be used by collections in this cluster.
            required:
            - replicas
            - zookeeperConnectString
            - configSetName
          status:
            type: object
            properties:
              readyReplicas:
                type: integer
              clusterState:
                type: string
              serviceEndpoint:
                type: string
    served: true
    storage: true
    subresources:
      status: {}

A critical decision here was to rely on an external ZooKeeper. While we could have the operator manage ZooKeeper as well, in a real-world project, a stable, multi-tenant ZooKeeper ensemble is often a piece of shared infrastructure. Forcing each Solr cluster to have its own ZK would add significant overhead. The status subresource is also crucial; it’s how the operator communicates the observed state of the world back to the user and other system components, like our UI.

With the CRD defined, we scaffolded the operator project using Kubebuilder. The core logic resides in the Reconcile method of the controller. The initial task for the reconciler is to ensure a StatefulSet matching the SolrCluster spec exists. A StatefulSet is non-negotiable for a stateful application like Solr, as it provides stable network identifiers and persistent storage for each pod.

// internal/controller/solrcluster_controller.go

import (
	// ... other imports
	appsv1 "k8s.io/api/apps/v1"
	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/types"
	"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
)

// Reconcile is part of the main kubernetes reconciliation loop which aims to
// move the current state of the cluster closer to the desired state.
func (r *SolrClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := log.FromContext(ctx)

	// Fetch the SolrCluster instance
	solrCluster := &mlopsv1alpha1.SolrCluster{}
	err := r.Get(ctx, req.NamespacedName, solrCluster)
	if err != nil {
		if errors.IsNotFound(err) {
			// Request object not found, could have been deleted after reconcile request.
			// Owned objects are automatically garbage collected. For additional cleanup logic use finalizers.
			log.Info("SolrCluster resource not found. Ignoring since object must be deleted.")
			return ctrl.Result{}, nil
		}
		log.Error(err, "Failed to get SolrCluster")
		return ctrl.Result{}, err
	}

	// Check if the StatefulSet already exists, if not create a new one
	foundSts := &appsv1.StatefulSet{}
	err = r.Get(ctx, types.NamespacedName{Name: solrCluster.Name, Namespace: solrCluster.Namespace}, foundSts)
	if err != nil && errors.IsNotFound(err) {
		// Define a new StatefulSet
		sts := r.statefulSetForSolrCluster(solrCluster)
		// Set SolrCluster instance as the owner and controller
		if err := controllerutil.SetControllerReference(solrCluster, sts, r.Scheme); err != nil {
			log.Error(err, "Failed to set owner reference on StatefulSet")
			return ctrl.Result{}, err
		}

		log.Info("Creating a new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
		if err := r.Create(ctx, sts); err != nil {
			log.Error(err, "Failed to create new StatefulSet", "StatefulSet.Namespace", sts.Namespace, "StatefulSet.Name", sts.Name)
			return ctrl.Result{}, err
		}
		// StatefulSet created successfully - return and requeue
		return ctrl.Result{Requeue: true}, nil
	} else if err != nil {
		log.Error(err, "Failed to get StatefulSet")
		return ctrl.Result{}, err
	}

	// Ensure the StatefulSet size is the same as the spec
	size := *solrCluster.Spec.Replicas
	if *foundSts.Spec.Replicas != size {
		foundSts.Spec.Replicas = &size
		if err = r.Update(ctx, foundSts); err != nil {
			log.Error(err, "Failed to update StatefulSet", "StatefulSet.Namespace", foundSts.Namespace, "StatefulSet.Name", foundSts.Name)
			return ctrl.Result{}, err
		}
		log.Info("Updated StatefulSet replica count", "Replicas", size)
		return ctrl.Result{Requeue: true}, nil
	}

	// ... More reconciliation logic for status updates, etc. will go here ...

	return ctrl.Result{}, nil
}

// statefulSetForSolrCluster returns a Solr StatefulSet object
func (r *SolrClusterReconciler) statefulSetForSolrCluster(s *mlopsv1alpha1.SolrCluster) *appsv1.StatefulSet {
	// A common pitfall is forgetting to define labels for the selector.
	// This will cause the StatefulSet creation to fail validation.
	ls := map[string]string{"app": "solr-node", "solrcluster": s.Name}
	replicas := *s.Spec.Replicas

	sts := &appsv1.StatefulSet{
		ObjectMeta: metav1.ObjectMeta{
			Name:      s.Name,
			Namespace: s.Namespace,
		},
		Spec: appsv1.StatefulSetSpec{
			Replicas: &replicas,
			Selector: &metav1.LabelSelector{
				MatchLabels: ls,
			},
			ServiceName: s.Name + "-headless", // Headless service is crucial for peer discovery
			Template: corev1.PodTemplateSpec{
				ObjectMeta: metav1.ObjectMeta{
					Labels: ls,
				},
				Spec: corev1.PodSpec{
					Containers: []corev1.Container{{
						Image: s.Spec.Image,
						Name:  "solr",
						Ports: []corev1.ContainerPort{{
							ContainerPort: 8983,
							Name:          "solr-port",
						}},
						Env: []corev1.EnvVar{{
							Name:  "ZK_HOST",
							Value: s.Spec.ZookeeperConnectString,
						}},
						// Liveness and readiness probes are essential for production
						ReadinessProbe: &corev1.Probe{
							ProbeHandler: corev1.ProbeHandler{
								HTTPGet: &corev1.HTTPGetAction{
									Path: "/solr/admin/info/system",
									Port: intstr.FromInt(8983),
								},
							},
							InitialDelaySeconds: 30,
							PeriodSeconds:       10,
						},
					}},
				},
			},
			// Define a VolumeClaimTemplate for persistent storage
			VolumeClaimTemplates: []corev1.PersistentVolumeClaim{
				{
					ObjectMeta: metav1.ObjectMeta{
						Name: "solr-data",
					},
					Spec: corev1.PersistentVolumeClaimSpec{
						AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
						Resources: corev1.ResourceRequirements{
							Requests: corev1.ResourceList{
								corev1.ResourceStorage: resource.MustParse("10Gi"),
							},
						},
					},
				},
			},
		},
	}
	// Mount the volume to the container
	sts.Spec.Template.Spec.Containers[0].VolumeMounts = []corev1.VolumeMount{
		{
			Name:      "solr-data",
			MountPath: "/var/solr",
		},
	}
	return sts
}

Phase 2: Configuration Management and Operator Robustness

A running Solr node is useless without a configuration set (containing schema.xml and solrconfig.xml). Baking these into the container image is inflexible. The standard Kubernetes pattern for this is to use a ConfigMap. The operator must ensure this ConfigMap exists and is accessible to the Solr pods. However, SolrCloud needs its configuration uploaded to ZooKeeper. The solution is an initContainer that runs before the main Solr container, responsible for this one-off setup task.

We added logic to the operator to create a ConfigMap and modified the StatefulSet definition to include the initContainer.

graph TD
    subgraph "Reconciliation Loop"
        A[User applies SolrCluster CR] --> B{Reconciler};
        B --> C{ConfigMap Exists?};
        C -- No --> D[Create ConfigMap];
        D --> E;
        C -- Yes --> E;
        E{StatefulSet Exists?};
        E -- No --> F[Create StatefulSet w/ InitContainer];
        F --> G;
        E -- Yes --> G[Compare Spec and Scale/Update];
        G --> H{Update SolrCluster Status};
    end

    subgraph "Pod Lifecycle"
        P[Pod Starts] --> IC[InitContainer Runs];
        IC -- "solr zk upconfig" --> ZK[ZooKeeper];
        IC -- Success --> SC[Main Solr Container Starts];
        SC -- "reads config from" --> ZK;
    end

    A -- "Defines" --> CR;
    subgraph "Kubernetes Objects"
        CR(SolrCluster CRD);
        CM(ConfigMap with schema.xml);
        SS(StatefulSet);
        Pod(Pod);
        PVC(PersistentVolumeClaim);
    end

    B -- "Manages" --> CM;
    B -- "Manages" --> SS;
    SS -- "Creates" --> Pod;
    Pod -- "Uses" --> PVC;
    Pod -- "Mounts" --> CM;

The init container itself runs a simple script. We place this script inside another ConfigMap and mount it as an executable file.

#!/bin/sh
set -e

# solr-config-uploader.sh

CONFIG_SET_NAME=$1
ZK_HOST=$2
CONFIG_DIR=/tmp/solr_config

echo "Waiting for Solr to be available..."
# Use the headless service for DNS resolution
until solr status -z $ZK_HOST; do
  echo "ZooKeeper not ready yet, sleeping..."
  sleep 2
done

echo "Checking if config set '$CONFIG_SET_NAME' exists in ZooKeeper..."
if solr zk ls /configs -z $ZK_HOST | grep -q "$CONFIG_SET_NAME"; then
  echo "Config set '$CONFIG_SET_NAME' already exists. Exiting."
  exit 0
else
  echo "Config set not found. Uploading from $CONFIG_DIR..."
  solr zk upconfig -n $CONFIG_SET_NAME -d $CONFIG_DIR -z $ZK_HOST
  echo "Config set '$CONFIG_SET_NAME' uploaded successfully."
fi

This script makes the process idempotent. If the pod restarts, the init container runs again but does nothing if the configuration is already in ZooKeeper, which is a crucial behavior for stability.

We also enhanced the Reconcile loop to update the status subresource. A common mistake is to try and calculate status based purely on the StatefulSet spec. In reality, you must inspect the status of the managed resources.

// internal/controller/solrcluster_controller.go (inside Reconcile)

// ... after ensuring StatefulSet size is correct ...

// Update the SolrCluster status
	if solrCluster.Status.ReadyReplicas != foundSts.Status.ReadyReplicas {
		solrCluster.Status.ReadyReplicas = foundSts.Status.ReadyReplicas
		// Naive state for now, more complex logic is needed for production
		if foundSts.Status.ReadyReplicas == size {
			solrCluster.Status.ClusterState = "Ready"
		} else {
			solrCluster.Status.ClusterState = "Reconciling"
		}

		// Also update the service endpoint for the UI to use
		// This should point to a non-headless service we also create
		serviceEndpoint := fmt.Sprintf("http://%s.%s.svc.cluster.local:8983", solrCluster.Name, solrCluster.Namespace)
		solrCluster.Status.ServiceEndpoint = serviceEndpoint
		
		if err = r.Status().Update(ctx, solrCluster); err != nil {
			log.Error(err, "Failed to update SolrCluster status")
			return ctrl.Result{}, err
		}
	}

This pattern of Read -> Compare -> Act -> Update Status is the essence of a Kubernetes controller.

Phase 3: The Angular UI - Management and Control

With the operator managing the lifecycle, we turned to the UI. The goal was to create a component within the Kubeflow dashboard to list, create, and delete SolrCluster resources.

The first challenge is communication with the Kubernetes API server from a browser. Kubeflow’s architecture includes a backend that proxies requests to in-cluster services, including the main Kubernetes API. This is perfect, as it handles authentication and authorization transparently for the frontend. We just need to hit the right endpoints.

We created an ApiService in Angular to handle these interactions.

// src/app/solr-management/api.service.ts
import { Injectable } from '@angular/core';
import { HttpClient, HttpHeaders } from '@angular/common/http';
import { Observable } from 'rxjs';

// Simplified model of our CRD
export interface SolrCluster {
  apiVersion: string;
  kind: string;
  metadata: {
    name: string;
    namespace: string;
    [key: string]: any;
  };
  spec: {
    replicas: number;
    image: string;
    zookeeperConnectString: string;
    configSetName: string;
  };
  status?: {
    readyReplicas?: number;
    clusterState?: string;
    serviceEndpoint?: string;
  };
}

@Injectable({
  providedIn: 'root',
})
export class SolrApiService {
  // The Kubeflow proxy exposes the K8s API under this path prefix
  private apiPrefix = '/api/k8s';

  constructor(private http: HttpClient) {}

  // Get all SolrClusters in a given namespace
  public getSolrClusters(namespace: string): Observable<{ items: SolrCluster[] }> {
    const url = `${this.apiPrefix}/apis/mlops.example.com/v1alpha1/namespaces/${namespace}/solrclusters`;
    return this.http.get<{ items: SolrCluster[] }>(url);
  }

  // Create a new SolrCluster
  public createSolrCluster(namespace: string, cluster: SolrCluster): Observable<SolrCluster> {
    const url = `${this.apiPrefix}/apis/mlops.example.com/v1alpha1/namespaces/${namespace}/solrclusters`;
    const httpOptions = {
      headers: new HttpHeaders({
        'Content-Type': 'application/json',
      }),
    };
    return this.http.post<SolrCluster>(url, cluster, httpOptions);
  }

  // Delete a SolrCluster
  public deleteSolrCluster(namespace: string, name: string): Observable<any> {
    const url = `${this.apiPrefix}/apis/mlops.example.com/v1alpha1/namespaces/${namespace}/solrclusters/${name}`;
    return this.http.delete(url);
  }
}

The component itself uses Angular’s reactive forms for creating new clusters and displays the list of existing ones, polling the getSolrClusters endpoint to show real-time status updates.

<!-- src/app/solr-management/solr-management.component.html -->
<div class="container">
  <h2>Solr Cluster Management for Namespace: {{ currentNamespace }}</h2>

  <!-- Creation Form -->
  <form [formGroup]="createForm" (ngSubmit)="onSubmit()">
    <mat-form-field>
      <mat-label>Cluster Name</mat-label>
      <input matInput formControlName="name">
    </mat-form-field>
    
    <mat-form-field>
      <mat-label>Replicas</mat-label>
      <input matInput type="number" formControlName="replicas">
    </mat-form-field>
    
    <!-- Other form fields for image, zk connect string, etc. -->
    
    <button mat-raised-button color="primary" type="submit" [disabled]="!createForm.valid">
      Create Cluster
    </button>
  </form>

  <!-- Cluster List -->
  <table mat-table [dataSource]="clusters$" class="mat-elevation-z8">
    <!-- Name Column -->
    <ng-container matColumnDef="name">
      <th mat-header-cell *matHeaderCellDef> Name </th>
      <td mat-cell *matCellDef="let element"> {{element.metadata.name}} </td>
    </ng-container>

    <!-- Status Column -->
    <ng-container matColumnDef="status">
      <th mat-header-cell *matHeaderCellDef> Status </th>
      <td mat-cell *matCellDef="let element">
        <span [ngClass]="{'status-ready': element.status?.clusterState === 'Ready', 'status-reconciling': element.status?.clusterState !== 'Ready'}">
          {{ element.status?.clusterState || 'Pending' }}
        </span>
      </td>
    </ng-container>

    <!-- Ready Replicas Column -->
    <ng-container matColumnDef="ready">
      <th mat-header-cell *matHeaderCellDef> Ready </th>
      <td mat-cell *matCellDef="let element">
        {{ element.status?.readyReplicas || 0 }} / {{ element.spec.replicas }}
      </td>
    </ng-container>

    <!-- Actions Column -->
    <ng-container matColumnDef="actions">
        <th mat-header-cell *matHeaderCellDef> Actions </th>
        <td mat-cell *matCellDef="let element">
            <button mat-icon-button color="warn" (click)="deleteCluster(element)">
                <mat-icon>delete</mat-icon>
            </button>
        </td>
    </ng-container>

    <tr mat-header-row *matHeaderRowDef="displayedColumns"></tr>
    <tr mat-row *matRowDef="let row; columns: displayedColumns;"></tr>
  </table>
</div>

Phase 4: The Search Interface - Querying the Data

The final piece was the search interface itself. This requires the UI to communicate directly with the Solr service endpoint. A pitfall here is hardcoding the service URL. The correct approach is to discover it dynamically. Our operator already populates the serviceEndpoint in the SolrCluster status. The UI can read this from the CR it already fetched.

We created a second Angular service, SolrSearchService, dedicated to constructing and sending search queries.

// src/app/solr-search/solr-search.service.ts
import { Injectable } from '@angular/core';
import { HttpClient, HttpParams } from '@angular/common/http';
import { Observable } from 'rxjs';

@Injectable({
  providedIn: 'root',
})
export class SolrSearchService {
  constructor(private http: HttpClient) {}

  /**
   * Performs a search against a specific Solr instance and collection.
   * A common mistake is to try and proxy this through the K8s API server.
   * For performance and simplicity, if the service is exposed via an Ingress or Gateway,
   * it's better to call it directly. Here we assume a Kubeflow backend proxy exists for all services.
   * @param serviceEndpoint The base URL of the Solr service (from the CR status)
   * @param collection The collection to search
   * @param query The user's query string
   * @returns An observable of the Solr response
   */
  public search(
    serviceEndpoint: string,
    collection: string,
    query: string,
  ): Observable<any> {
    // Construct the proxied URL. The actual implementation detail depends on the
    // specific Kubeflow proxy setup.
    // e.g., /api/service/<namespace>/<service-name>:<port>/...
    const endpoint = serviceEndpoint.replace('http://', '').replace(':8983', '');
    const [serviceName, namespace] = endpoint.split('.');
    
    const proxiedUrl = `/api/service/${namespace}/${serviceName}:8983/solr/${collection}/select`;

    let params = new HttpParams();
    params = params.set('q', query);
    params = params.set('wt', 'json'); // Request JSON response format

    return this.http.get(proxiedUrl, { params });
  }
}

The search component presents a dropdown to select an active SolrCluster (and by extension, its endpoint) and a text box for the query. When the user executes a search, the component calls the SolrSearchService, which uses the dynamically discovered endpoint from the selected cluster’s status. This decouples the UI from the cluster’s network configuration, a key principle for building resilient systems.

The result is a fully integrated workflow. A data scientist can, from within the Kubeflow dashboard, provision a dedicated search cluster for their project, see when it’s ready, and then immediately use a search bar to query the ML artifact metadata that an external pipeline has indexed into it. All of this happens through declarative, custom Kubernetes resources, without ever needing kubectl or direct access to the cluster’s infrastructure.

This architecture, however, is not without its limitations. The current operator logic for updates is simplistic; a real-world implementation would require a more sophisticated rolling update strategy to handle Solr version changes or configuration updates without downtime. The security model also assumes a trusted environment within the cluster namespace. Exposing the Solr UI or API externally would necessitate integration with an Ingress controller and robust NetworkPolicy objects, which the operator would also need to manage. Furthermore, the indexing of metadata from Kubeflow pipelines into Solr is a critical, separate component that was outside the scope of this implementation but is essential for the system’s utility. Future iterations would focus on building a corresponding “indexing controller” that watches PipelineRun objects and pushes relevant metadata into the appropriate Solr cluster.


  TOC