Our team’s adoption of a utility-first CSS methodology, specifically UnoCSS, was meant to accelerate UI development and enforce consistency. In practice, it led to a new class of problems. While developers were fast, the resulting class strings were often inconsistent, verbose, or semantically incorrect. We saw things like p-2 p-4
or flex items-center justify-start
where flex items-center
would have sufficed. Standard linters fell short; they could catch syntax errors but not stylistic or architectural anti-patterns.
Our first attempt to solve this involved leveraging a Large Language Model (LLM) to generate component markup from natural language prompts. The idea was to create a “golden path” for component creation. The initial prototypes were both impressive and alarming. A prompt like “a primary action button” could produce a perfectly styled button, but a slightly different prompt might yield a monstrosity with conflicting classes and redundant utilities. The LLM’s output was too stochastic for direct inclusion in a production pipeline. It lacked the deterministic rigor required for a design system. This is the log of our journey to build a validation and self-correction layer around the LLM, creating a hybrid system that pairs generative creativity with classical machine learning’s analytical power.
Initial Concept and The Core Problem
The first iteration was a simple FastAPI service in Python that wrapped an OpenAI API call. A developer could send a JSON payload with a prompt, and the service would return an HTML string.
# initial_prototype.py
# WARNING: This is the naive first version and has significant flaws.
import os
from fastapi import FastAPI
from pydantic import BaseModel
import openai
# In a real-world project, this key would be managed via a secrets manager.
openai.api_key = os.getenv("OPENAI_API_KEY")
app = FastAPI()
class GenerationRequest(BaseModel):
prompt: str
@app.post("/generate")
async def generate_component(request: GenerationRequest):
system_prompt = """
You are an expert front-end developer specializing in UnoCSS.
Generate a single, self-contained HTML snippet based on the user's prompt.
Use only standard UnoCSS utility classes. Do not use custom CSS or style tags.
The output must be only the HTML code, with no explanations.
"""
try:
response = openai.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": request.prompt},
],
temperature=0.5,
max_tokens=500,
)
generated_html = response.choices[0].message.content.strip()
return {"html": generated_html}
except Exception as e:
# Basic error handling, will be improved later.
return {"error": str(e)}, 500
The output for “a dismissible warning alert” was often acceptable:
<div class="relative flex items-center p-4 border border-yellow-300 bg-yellow-100 text-yellow-800 rounded-md">
<div class="i-carbon-warning-filled text-xl mr-3"></div>
<span>This is a warning message.</span>
<button class="absolute top-2 right-2 p-1 text-yellow-600 hover:text-yellow-900">
<div class="i-carbon-close"></div>
</button>
</div>
However, for slightly more complex requests, the cracks appeared. “A card with a header, footer, and some padded content” produced this mess on one run:
<div class="border rounded-lg shadow-md m-4 p-0">
<div class="p-4 border-b font-bold p-3">Card Header</div>
<div class="p-6 p-4">Main content goes here. It has some padding.</div>
<div class="p-4 border-t text-sm text-gray-500 p-2">Footer Info</div>
</div>
The immediate anti-patterns are obvious to a human developer: p-4 p-3
, p-6 p-4
, p-4 p-2
. The LLM understood the concept of padding but failed at consistent application. It was clear we couldn’t just trust the generator. We needed a gatekeeper.
The Turn to Scikit-learn: A Discriminative Approach
Our problem wasn’t that the LLM was incapable, but that its output was unpredictable. The errors, however, fell into recognizable categories: redundancy, conflict, and non-idiomatic patterns. This is not a task for another LLM; it’s a classification problem. We could train a simpler, faster, and more deterministic model to look at a set of UnoCSS
classes and classify it as “good” (1) or “bad” (0).
Scikit-learn
was the obvious choice. We didn’t need the complexity of a deep learning framework. A robust classifier like a RandomForestClassifier
or GradientBoostingClassifier
would be perfect, especially given that our main challenge would be feature engineering, not model architecture.
The core task became: how do you convert a string like "p-4 border-t text-sm text-gray-500 p-2"
into a vector of numbers that a machine learning model can understand? This feature engineering process was the most critical part of the entire project.
We designed a feature extractor that would analyze the set of classes from the generated HTML.
# features/extractor.py
import re
from collections import Counter
class UnoCSSFeatureExtractor:
def __init__(self, design_system_config):
self.config = design_system_config
self.spacing_regex = re.compile(r'^(p|m|gap)-?([xytrbl])?-(\d+(\.\d+)?|px)$')
self.color_regex = re.compile(r'^(text|bg|border|ring)-(\w+)-(\d+)$')
def _extract_classes(self, html_string):
"""Extracts all unique class attributes from an HTML string."""
class_lists = re.findall(r'class="([^"]+)"', html_string)
all_classes = set()
for class_list in class_lists:
all_classes.update(class_list.split())
return list(all_classes)
def featurize(self, html_string: str) -> dict:
"""
Converts an HTML string into a feature vector for ML model consumption.
In a real project, this would be a flat dictionary or a numpy array.
"""
classes = self._extract_classes(html_string)
if not classes:
return {
"class_count": 0, "redundancy_score": 1.0, "conflict_score": 1.0,
"palette_adherence": 1.0, "spacing_consistency_score": 1.0,
"idiomatic_flex_score": 0, "total_utility_families": 0
}
features = {}
class_count = len(classes)
features["class_count"] = class_count
# 1. Redundancy Score
prefixes = [c.split('-')[0] for c in classes]
prefix_counts = Counter(prefixes)
# Simple redundancy: multiple p-, m-, etc.
redundant_prefixes = {'p', 'm', 'w', 'h', 'text'}
redundancy_issues = sum(1 for p, c in prefix_counts.items() if p in redundant_prefixes and c > 1)
features["redundancy_score"] = 1.0 - (redundancy_issues / len(redundant_prefixes))
# 2. Conflict Score
# Example: presence of both text-red-500 and text-blue-500
color_classes = [c for c in classes if self.color_regex.match(c)]
color_families = Counter([f"{m.group(1)}-{m.group(2)}" for c in color_classes if (m := self.color_regex.match(c))])
conflict_issues = sum(1 for family, count in color_families.items() if count > 1)
features["conflict_score"] = 1.0 - (conflict_issues / class_count if class_count > 0 else 0)
# 3. Design System Palette Adherence
valid_colors = self.config.get('colors', {})
palette_violations = 0
for c in color_classes:
match = self.color_regex.match(c)
if match:
color_name = match.group(2)
if color_name not in valid_colors:
palette_violations += 1
features["palette_adherence"] = 1.0 - (palette_violations / len(color_classes) if color_classes else 0)
# 4. Spacing Consistency Score
spacing_values = []
for c in classes:
match = self.spacing_regex.match(c)
if match and match.group(3) != 'px':
spacing_values.append(float(match.group(3)))
if len(spacing_values) > 1:
# A simple metric: are all spacing values multiples of the smallest non-zero value?
min_spacing = min(v for v in spacing_values if v > 0) if any(v > 0 for v in spacing_values) else 0
if min_spacing > 0:
inconsistent_spacing = sum(1 for v in spacing_values if v % min_spacing != 0)
features["spacing_consistency_score"] = 1.0 - (inconsistent_spacing / len(spacing_values))
else:
features["spacing_consistency_score"] = 1.0
else:
features["spacing_consistency_score"] = 1.0 # Not enough data to judge
# 5. Idiomatic Pattern Usage (e.g., flexbox)
has_flex = 'flex' in classes
has_alignment = any(c.startswith('items-') or c.startswith('justify-') for c in classes)
features["idiomatic_flex_score"] = 1 if has_flex and has_alignment else 0 if has_flex else -1 # -1 for N/A
# 6. Utility Family Diversity
utility_families = {c.split('-')[0] for c in classes}
features["total_utility_families"] = len(utility_families)
return features
This extractor became the heart of our validator. We built a small dataset by hand, taking about 200 examples of LLM-generated HTML, running them through the featurizer, and manually labeling them as 0 (bad) or 1 (good).
The training script was standard scikit-learn
procedure.
# models/train.py
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Assume 'dataset.csv' is created with columns from the featurizer and a 'label' column.
# 'label' is 1 for good code, 0 for bad code.
def train_validator_model():
"""Trains and saves the classification model."""
try:
data = pd.read_csv('dataset.csv')
except FileNotFoundError:
print("Error: dataset.csv not found. Please generate the dataset first.")
return
# A common mistake is to forget to handle missing values or non-numeric types
data = data.fillna(0)
X = data.drop('label', axis=1)
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# We create a pipeline to scale features and then classify.
# This prevents data leakage from the test set during scaling.
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))
])
print("Starting model training...")
pipeline.fit(X_train, y_train)
print("Training complete.")
# Evaluate the model
y_pred = pipeline.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Save the trained pipeline (scaler + model)
joblib.dump(pipeline, 'models/unocss_validator.joblib')
print("\nModel saved to models/unocss_validator.joblib")
if __name__ == '__main__':
# This script would be run offline as part of the development/CI process.
train_validator_model()
The results were promising. With just a few hundred labeled examples, the RandomForestClassifier
achieved over 95% accuracy in identifying clearly flawed patterns.
Phase 3: The Integrated Pipeline
With a trained model, we rebuilt the FastAPI service to incorporate a validation step. The workflow changed from a simple request-response to a multi-stage process.
sequenceDiagram participant User participant API Service participant LLM participant Validator (Scikit-learn Model) User->>API Service: POST /generate (prompt) API Service->>LLM: Generate HTML for prompt LLM-->>API Service: Returns generated HTML API Service->>Validator: Featurize HTML and predict quality Validator-->>API Service: Prediction (Good/Bad) and score alt Prediction is Good API Service-->>User: 200 OK with validated HTML else Prediction is Bad API Service->>LLM: Regenerate with feedback (e.g., "Avoid redundant padding") LLM-->>API Service: Returns new HTML attempt Note right of API Service: (Could loop N times or fail) API Service-->>User: 200 OK (if successful) or 422 Unprocessable (if fails) end
The new service implementation looked much more robust.
# service/main.py
import os
import logging
import joblib
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import openai
from features.extractor import UnoCSSFeatureExtractor # Assuming this is in a module
# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# Load configurations and models at startup
openai.api_key = os.getenv("OPENAI_API_KEY")
if not openai.api_key:
raise ValueError("OPENAI_API_KEY environment variable not set.")
try:
validator_model = joblib.load('models/unocss_validator.joblib')
except FileNotFoundError:
raise RuntimeError("Validator model not found. Please run training script.")
# A mock design system config
design_system_config = {
"colors": {"blue", "gray", "red", "green", "yellow"}
}
feature_extractor = UnoCSSFeatureExtractor(design_system_config)
app = FastAPI()
class GenerationRequest(BaseModel):
prompt: str
max_retries: int = 2
class GenerationResponse(BaseModel):
html: str
status: str
validation_score: float
def get_llm_response(prompt: str, system_prompt: str) -> str:
"""Encapsulates the OpenAI API call with error handling."""
try:
response = openai.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt},
],
temperature=0.6,
max_tokens=500,
)
return response.choices[0].message.content.strip()
except Exception as e:
logging.error(f"OpenAI API call failed: {e}")
raise HTTPException(status_code=503, detail="LLM service unavailable.")
@app.post("/generate", response_model=GenerationResponse)
async def generate_and_validate(request: GenerationRequest):
system_prompt = "You are an expert front-end developer specializing in UnoCSS..." # Same as before
current_prompt = request.prompt
for attempt in range(request.max_retries + 1):
logging.info(f"Generation attempt {attempt + 1} for prompt: '{request.prompt}'")
generated_html = get_llm_response(current_prompt, system_prompt)
features_dict = feature_extractor.featurize(generated_html)
# The pipeline expects a DataFrame-like structure
feature_values = [list(features_dict.values())]
prediction = validator_model.predict(feature_values)[0]
# Get the probability of the 'good' class (class 1)
probability_score = validator_model.predict_proba(feature_values)[0][1]
if prediction == 1:
logging.info(f"Validation successful with score {probability_score:.2f}")
return GenerationResponse(
html=generated_html,
status="validated",
validation_score=probability_score
)
else:
logging.warning(f"Validation failed with score {probability_score:.2f}. Retrying...")
# A pitfall here is crafting good feedback. It needs to be specific.
# This is a simple version. A better one would map features to feedback.
feedback = "The previous attempt was invalid. It contained redundant or conflicting CSS utilities. Please try again, ensuring each utility type (like padding or color) is defined only once per element."
current_prompt = f"{request.prompt}\n\nPREVIOUS FAILED ATTEMPT:\n{generated_html}\n\nFEEDBACK:\n{feedback}"
logging.error(f"Failed to generate valid HTML after {request.max_retries + 1} attempts.")
raise HTTPException(
status_code=422,
detail="Failed to generate a valid component after multiple retries."
)
This closed-loop system was a significant improvement. It automatically corrected the LLM’s most common, low-level mistakes. However, the feedback loop was reactive and could be slow. The ideal scenario is to get a better-quality output from the LLM on the first try.
Phase 4: Improving First-Pass Quality with RAG
The final architectural enhancement was to incorporate Retrieval-Augmented Generation (RAG). The idea is to give the LLM relevant context from our own documentation and best-practice examples before it generates the code. This primes the model to produce output that is already aligned with our standards.
We created a small knowledge base of Markdown documents containing:
- Canonical examples of our core components (buttons, cards, alerts).
- Rules of thumb from our design system docs (“Always use spacing tokens from the theme,” “Flexbox containers must define alignment”).
We used the sentence-transformers
library to embed these documents into vectors and stored them in a simple in-memory FAISS index for fast similarity search.
The pipeline was updated one last time:
# Part of the service/main.py, showing the RAG integration
# This would be initialized at startup
# from rag_component import KnowledgeBase
# knowledge_base = KnowledgeBase(documents_path="kb/")
async def generate_and_validate_with_rag(request: GenerationRequest):
# 1. Retrieve relevant context from the knowledge base
context_docs = knowledge_base.search(request.prompt, top_k=3)
context_str = "\n\n".join([doc.page_content for doc in context_docs])
# 2. Build an augmented prompt
augmented_prompt = f"""
Based on the following best practices and examples from our design system:
--- CONTEXT ---
{context_str}
--- END CONTEXT ---
Now, fulfill the user's request: {request.prompt}
"""
# The rest of the generation and validation loop follows...
# ...
This RAG-based approach dramatically improved the quality of the first-pass generation. The LLM, now equipped with concrete examples and rules, was far less likely to produce code that would be rejected by the scikit-learn
validator. The validator’s role shifted from being a constant corrector to being a final quality assurance check for edge cases the RAG context didn’t cover.
The feature engineering for the scikit-learn
validator remains a critical, and somewhat manual, process. Its effectiveness is directly tied to how well the features capture the nuances of our design system’s rules. A more advanced implementation might explore graph-based features to understand the relationships between utilities applied to nested elements, but this adds significant complexity. Furthermore, the current system is reactive. A truly self-improving system would use the validated or rejected outputs to continuously fine-tune a smaller, domain-specific model, but the MLOps required for such a system is non-trivial. The architecture, while effective, is synchronous and may not scale well under high load without transitioning to an asynchronous task queue model for the generation and validation steps.