Architecture
This document describes the technical architecture and design decisions behind SchemaX.
Overview
SchemaX implements a provider-based, snapshot-driven schema versioning system. The core principle is to maintain an append-only operation log with periodic snapshots, enabling both state-based and change-based workflows across multiple catalog types (Unity Catalog, Hive Metastore, PostgreSQL, etc.).
Design Goals
- Git-Friendly: Store schema definitions in human-readable JSON that produces clean diffs
- Reproducible: Replay operations from a snapshot to reconstruct current state
- Performant: Fast loading even with hundreds of operations
- Auditable: Complete history of who changed what and when
- Migration-Ready: Operations can be converted to SQL migration scripts
- Extensible: Easy to add support for new catalog types via providers
- Multi-Provider: Support multiple catalog systems with unified interface
Architectural Patterns
SchemaX follows several well-established architectural patterns that work together to provide a robust, maintainable, and extensible system.
Primary Pattern: Event Sourcing
The foundation of SchemaX is Event Sourcing - all changes are stored as immutable events (operations) in an append-only log.
Implementation:
// Operations are immutable events
interface Operation {
id: string;
ts: string;
provider: string;
op: string;
target: string;
payload: Record<string, any>;
}
// Current state = replay all operations from a snapshot
state = loadSnapshot(latestSnapshot);
for (const operation of changelog.ops) {
state = provider.applyOperation(state, operation);
}
Key Characteristics:
- ✅ Append-only log (
changelog.json) - ✅ Operations never modified or deleted
- ✅ Complete audit trail
- ✅ Time-travel capability via snapshots
- ✅ State is derived, not stored directly
Benefits:
- Full history of all changes
- Can reconstruct state at any point
- Easy debugging ("what happened?")
- Enables undo/redo capabilities
Snapshot + Delta Pattern
Optimization of Event Sourcing to prevent unbounded operation log growth.
Implementation:
State at v0.3.0 =
load_snapshot("v0.2.0") +
apply_operations(changelog.ops)
.schemax/
├── snapshots/v0.2.0.json ← Full state checkpoint
└── changelog.json ← Only ops since v0.2.0
Benefits:
- Fast state loading (no need to replay 1000s of operations)
- Bounded memory usage
- Clean separation of committed vs uncommitted changes
Plugin Architecture (Provider System)
Extensibility through providers - catalog-specific implementations plugged into a common interface.
Implementation:
// Base contract
interface Provider {
info: ProviderInfo;
capabilities: ProviderCapabilities;
applyOperation(state: ProviderState, op: Operation): ProviderState;
getSQLGenerator(state: ProviderState): SQLGenerator;
validateOperation(op: Operation): ValidationResult;
}
// Implementations
class UnityProvider implements Provider { ... }
class HiveProvider implements Provider { ... }
class PostgresProvider implements Provider { ... }
// Registry
ProviderRegistry.register(unityProvider);
Key Characteristics:
- ✅ Open/Closed Principle (open for extension, closed for modification)
- ✅ Each provider is isolated and independent
- ✅ Core system doesn't know provider details
- ✅ New providers added without changing core
Strategy Pattern (Provider Operations)
Different algorithms (SQL generation, state reduction) selected based on provider.
Implementation:
// Context uses provider to select strategy
function generateSQL(ops: Operation[], project: Project) {
const provider = ProviderRegistry.get(project.provider.type);
const generator = provider.getSQLGenerator(state);
return generator.generateSQL(ops);
}
// Concrete strategies
class UnitySQLGenerator extends SQLGenerator {
generateSQL(ops: Operation[]): string {
// Unity Catalog-specific SQL
}
}
class HiveSQLGenerator extends SQLGenerator {
generateSQL(ops: Operation[]): string {
// Hive Metastore-specific SQL
}
}
Benefits:
- Swappable implementations
- Each strategy optimized for its system
- Clean separation of concerns
State Reducer Pattern (Redux-inspired)
Immutable state transformations through pure functions.
Implementation:
function applyOperation(state: ProviderState, operation: Operation): ProviderState {
// Pure function: state + operation → new_state
const newState = deepClone(state);
switch (operation.op) {
case 'unity.add_catalog':
newState.catalogs.push(createCatalog(operation.payload));
break;
case 'unity.add_table':
const schema = findSchema(newState, operation.payload.schemaId);
schema.tables.push(createTable(operation.payload));
break;
}
return newState; // Never mutate input
}
Key Principles:
- ✅ Pure functions (no side effects)
- ✅ Immutable state
- ✅ Predictable transformations
- ✅ Easy to test
- ✅ Time-travel debugging
Redux Comparison:
// Redux
newState = reducer(state, action)
// SchemaX
newState = provider.applyOperation(state, operation)
Registry Pattern (Provider Lookup)
Central registry for service discovery and dependency injection.
Implementation:
class ProviderRegistryClass {
private providers = new Map<string, Provider>();
register(provider: Provider): void {
this.providers.set(provider.info.id, provider);
}
get(providerId: string): Provider | undefined {
return this.providers.get(providerId);
}
}
// Singleton
export const ProviderRegistry = new ProviderRegistryClass();
// Auto-registration on import
ProviderRegistry.register(unityProvider);
Benefits:
- Service discovery
- Loose coupling
- Easy testing (swap implementations)
Repository Pattern (Storage Layer)
Abstraction over file system operations.
Implementation:
// storage_v3.ts/py acts as repository
class StorageRepository {
readProject(workspacePath: Path): ProjectFile;
writeProject(workspacePath: Path, project: ProjectFile): void;
readChangelog(workspacePath: Path): ChangelogFile;
writeChangelog(workspacePath: Path, changelog: ChangelogFile): void;
readSnapshot(workspacePath: Path, version: string): SnapshotFile;
writeSnapshot(workspacePath: Path, snapshot: SnapshotFile): void;
}
// Usage
const project = storage.readProject(workspace);
// Don't care if it's JSON, SQLite, or remote API
Benefits:
- Data access abstraction
- Easy to swap storage backend
- Testability (mock the repository)
Command Pattern (Operations)
Operations as command objects that encapsulate requests.
Implementation:
// Command = Operation
interface Operation {
id: string; // Command ID
op: string; // Command name
target: string; // Receiver
payload: object; // Parameters
ts: string; // Timestamp
}
// Command execution
function execute(state: State, command: Operation): State {
return applyOperation(state, command);
}
Characteristics:
- ✅ Encapsulates request as object
- ✅ Supports queuing and logging
- ✅ Can be serialized
- ✅ Enables undo (store reverse operations)
Adapter Pattern (Python ↔ TypeScript)
Translating between language ecosystems while maintaining compatibility.
Implementation:
# Python (Pydantic) - accepts both camelCase and snake_case
class Column(BaseModel):
id: str
name: str
mask_id: Optional[str] = Field(None, alias="maskId")
class Config:
populate_by_name = True # Accept both maskId and mask_id
// TypeScript (Zod) - uses camelCase
const Column = z.object({
id: z.string(),
name: z.string(),
maskId: z.string().optional(),
});
Same JSON works in both:
{"id": "col_1", "name": "email", "maskId": "mask_1"}
Benefits:
- Seamless interoperability
- Single source of truth (JSON files)
- No code sharing required
Façade Pattern (CLI)
Simplified interface to complex subsystems.
Implementation:
# cli.py provides simple interface hiding complexity
@cli.command()
def sql(workspace: str):
# Hides complexity of:
# - File system operations
# - Provider lookup
# - State reconstruction
# - SQL generation
state, changelog, provider = load_current_state(Path(workspace))
generator = provider.get_sql_generator(state)
sql = generator.generate_sql(changelog.ops)
console.print(sql)
Benefits:
- Simple API for complex operations
- Easy to use
- Decouples CLI from internal complexity
Pattern Interaction Diagram
┌─────────────────────────────────────────────────────────────────┐
│ CLI / Extension (Façade) │
└────────────┬────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Storage Repository (Repository Pattern) │
│ Reads/writes .schemax/ files │
└────────────┬────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Provider Registry (Registry + Strategy) │
│ ProviderRegistry.get(provider_id) │
└────────────┬────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Provider Instance (Plugin Architecture) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Unity │ │ Hive │ │ PostgreSQL │ │
│ │ Provider │ │ Provider │ │ Provider │ │
│ └───── ─────────┘ └──────────────┘ └──────────────┘ │
└────────────┬────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ State Reducer (State Reducer Pattern) │
│ state' = applyOperation(state, operation) │
│ (Immutable transformations) │
│ Based on Event Sourcing + Command Pattern │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SQL Generator (Strategy Pattern) │
│ sql = generator.generateSQL(operations) │
└─────────────────────────────────────────────────────────────────┘
Architectural Style: Functional Core, Imperative Shell
SchemaX follows the Functional Core, Imperative Shell pattern:
Functional Core (Pure Logic):
- State reducers (pure functions)
- Operation validation
- SQL generation algorithms
- State transformations
Imperative Shell (Side Effects):
- File I/O (storage layer)
- CLI output
- VS Code webview communication
- Extension activation
Benefits:
- Easy to test (core is pure functions)
- Easy to reason about (no hidden state)
- Side effects isolated at boundaries
Design Patterns Summary
| Pattern | Where Used | Purpose |
|---|---|---|
| Event Sourcing | changelog.json, operations | Core architectural foundation |
| Snapshot + Delta | snapshots/, changelog.json | Performance optimization |
| Plugin Architecture | providers/ system | Extensibility for new catalog types |
| Strategy | Provider implementations | Swappable algorithms |
| State Reducer | state_reducer.ts/py | Immutable state updates |
| Registry | ProviderRegistry | Service discovery |
| Repository | storage_v3.ts/py | Data access abstraction |
| Command | Operation objects | Operation encapsulation |
| Adapter | Pydantic/Zod models | Cross-language compatibility |
| Façade | cli.py, extension.ts | Simple interface to complexity |
Architectural Principles
1. Separation of Concerns
- Storage ≠ Provider ≠ CLI
- Each layer has single responsibility
2. Immutability
- Operations never change
- State transformations create new objects
- Snapshots are read-only
3. Idempotency
- SQL can be run multiple times safely
- Operations produce same result when replayed
4. Extensibility
- Add providers without changing core
- Plugin-based architecture
5. Type Safety
- Zod (TypeScript) and Pydantic (Python)
- Runtime validation
- IDE autocomplete
6. Testability
- Pure functions (state reducers)
- Mockable repositories
- Isolated providers
Anti-Patterns Avoided
✅ No Mutable Global State - All state is passed explicitly
✅ No Tight Coupling - Providers are independent plugins
✅ No Direct File System Access - Goes through repository layer
✅ No Hardcoded Provider Logic - Uses registry + strategy
✅ No Side Effects in Reducers - Pure functions only
Architecture Inspirations
SchemaX's architecture draws inspiration from:
-
Redux (State Management)
- Immutable state
- Pure reducers
- Action dispatching → Operations
-
Git (Version Control)
- Commit log → Operations log
- Branches → Environments
- Tags → Snapshots
-
Terraform (Infrastructure as Code)
- Desired state → Schema definition
- Plan → SQL preview
- Apply → SQL execution
-
Liquibase/Flyway (Database Migrations)
- Version-controlled schema changes
- Idempotent migrations
- Rollback support
-
Event-Driven Architecture
- Events → Operations
- Event store → Changelog
- Projections → Current state
Provider-Based Architecture
What is a Provider?
A Provider is a plugin that adds support for a specific catalog system. Each provider implements a standard interface that SchemaX uses for:
- State Management - How objects are stored and modified
- Operations - What actions users can perform
- SQL Generation - How to convert operations to DDL statements
- Validation - What constraints and rules to enforce
- UI Metadata - How to display objects in the interface
Provider Registry
class ProviderRegistry {
private static providers = new Map<string, Provider>();
static register(provider: Provider): void;
static get(providerId: string): Provider | undefined;
static getAll(): Provider[];
}
Providers are registered at startup:
// TypeScript
import './providers'; // Auto-registers Unity provider
// Python
from schemax.providers import unity_provider
# Unity provider auto-registered on import
Current Providers
Available:
- ✅ Unity Catalog (
unity) - Databricks Unity Catalog with full governance features
Planned (Stage 2):
- 🔜 Hive Metastore (
hive) - Apache Hive Metastore - 🔜 PostgreSQL/Lakebase (
postgres) - PostgreSQL with Lakebase extensions
Base Provider Enhancements
Generic SQL Optimization Algorithms
The base provider layer includes reusable optimization algorithms that all providers inherit automatically:
1. ColumnReorderOptimizer - Novel single-column drag detection
- Detects when only one column moved in a reorder operation
- Generates 1 SQL statement instead of N statements
- Time complexity: O(n²) worst case, O(n) average case
- Works for any SQL database with column positioning support
2. OperationBatcher - Groups operations by target object
- Batches table operations to create complete CREATE TABLE statements
- Reduces "empty table + multiple ALTERs" to single CREATE statement
- Consolidates property settings and constraints
- Provider-agnostic algorithm with provider-specific SQL generation
3. Enhanced BaseSQLGenerator - Template pattern for providers
- Abstract methods for provider-specific SQL syntax
- Generic utilities (_build_fqn, escape_identifier, escape_string)
- Dependency-level ordering (catalog → schema → table)
- Optimization components auto-initialized
Benefits for Provider Implementers
When implementing a new provider (e.g., Postgres, Snowflake):
- 60% Less Code - Generic algorithms already implemented
- Automatic Optimizations - Batching and reordering work out-of-the-box
- Consistent Behavior - All providers generate high-quality SQL
- Faster Time-to-Market - Focus on SQL syntax, not algorithms
- Easier Maintenance - Bug fixes in base benefit all providers
Implementation Example
class PostgresSQLGenerator(BaseSQLGenerator):
"""Postgres provider using base optimizations"""
def __init__(self, state, name_mapping=None):
super().__init__(state, name_mapping)
# self.batcher and self.optimizer inherited from base
self.id_name_map = self._build_id_name_map()
# Implement abstract methods
def _get_target_object_id(self, op):
"""Extract table ID from Postgres operation"""
if op.op == "postgres.create_table":
return op.target
return op.payload.get("table_id")
def _is_create_operation(self, op):
"""Check if operation creates new object"""
return op.op in ["postgres.create_schema", "postgres.create_table"]
def _generate_batched_create_sql(self, object_id, batch_info):
"""Generate Postgres CREATE TABLE with columns"""
# Postgres-specific SQL syntax here
return f"CREATE TABLE {table_name} ({columns}) ..."
# Base handles: batching, reordering, dependency ordering
Code Reduction Metrics
Unity Catalog provider refactoring results:
- Before: 928 lines (everything Unity-specific + generic algorithms)
- After: 738 lines (only Unity-specific SQL syntax)
- Reduction: 190 lines (20% reduction)
- Moved to Base: 272 lines of generic code
Future providers will start with this reduced baseline.
File Structure
Version 3 Architecture (Current)
workspace-root/
└── .schemax/
├── project.json # Project metadata with provider info
├── changelog.json # Uncommitted operations
├── snapshots/
│ ├── v0.1.0.json # Full state snapshot
│ ├── v0.2.0.json
│ └── v0.3.0.json
└── migrations/ # Generated SQL files
└── migration_*.sql
Key Changes from V2:
project.jsonnow includes provider metadata- Operations are prefixed with provider ID (e.g.,
unity.add_catalog) - Snapshots include provider context
- Automatic V2 to V3 migration on first load
Project File Schema (V3)
project.json contains metadata and provider selection:
{
"version": 3,
"name": "my-databricks-schemas",
"provider": {
"type": "unity",
"version": "1.0.0"
},
"environments": ["dev", "test", "prod"],
"latestSnapshot": "v0.2.0",
"snapshots": [
{
"id": "snap_uuid",
"version": "v0.1.0",
"name": "Initial schema",
"file": ".schemax/snapshots/v0.1.0.json",
"ts": "2025-10-06T10:00:00Z",
"opsCount": 15,
"hash": "sha256...",
"previousSnapshot": null
}
],
"deployments": [],
"settings": {
"autoIncrementVersion": true,
"versionPrefix": "v",
"requireSnapshotForProd": true,
"allowDrift": false
}
}
New in V3:
providerfield specifies catalog type and version- Provider info used to load correct reducer and SQL generator
Snapshot File Schema
snapshots/vX.Y.Z.json contains the full state:
{
"id": "snap_uuid",
"version": "v0.1.0",
"name": "Initial schema",
"ts": "2025-10-06T10:00:00Z",
"createdBy": "user@example.com",
"state": {
"catalogs": [
{
"id": "cat_uuid",
"name": "my_catalog",
"schemas": [
{
"id": "sch_uuid",
"name": "my_schema",
"tables": [...]
}
]
}
]
},
"operations": [
{
"id": "op_1",
"ts": "2025-01-01T10:00:00Z",
"provider": "unity",
"op": "unity.add_schema",
"target": "sch_uuid",
"payload": {...}
},
...
],
"previousSnapshot": null,
"hash": "sha256...",
"tags": ["production"],
"comment": "First release"
}
Note: The structure of state varies by provider. Unity Catalog uses catalogs, Hive uses databases, etc.
Changelog File Schema
changelog.json contains uncommitted operations:
{
"version": 1,
"sinceSnapshot": "v0.2.0",
"ops": [
{
"id": "op_abc123",
"ts": "2025-10-07T14:30:00Z",
"provider": "unity",
"op": "unity.add_catalog",
"target": "cat_new",
"payload": {
"catalogId": "cat_new",
"name": "analytics"
}
}
],
"lastModified": "2025-10-07T14:30:00Z"
}
New in V3:
providerfield on each operationopfield prefixed with provider ID (e.g.,unity.add_catalog)- Operations validated by provider before being saved
Core Concepts
1. Operations
Operations are the fundamental unit of change in SchemaX. Every user action generates one or more operations.
Operation Structure:
interface Operation {
id: string; // Unique operation ID
ts: string; // ISO 8601 timestamp
provider: string; // Provider ID (e.g., 'unity')
op: string; // Operation type with provider prefix
target: string; // ID of object being modified
payload: Record<string, any>; // Operation-specific data
}
Example - Unity Catalog:
{
"id": "op_abc123",
"ts": "2025-10-07T14:30:00Z",
"provider": "unity",
"op": "unity.add_catalog",
"target": "cat_xyz",
"payload": {
"catalogId": "cat_xyz",
"name": "analytics"
}
}
Example - Hive Metastore (Future):
{
"id": "op_def456",
"ts": "2025-10-07T14:35:00Z",
"provider": "hive",
"op": "hive.add_database",
"target": "db_xyz",
"payload": {
"databaseId": "db_xyz",
"name": "analytics"
}
}
2. Snapshots
Snapshots are point-in-time captures of the complete state. They serve as:
- Performance optimization - No need to replay all operations
- Release markers - Tagged versions for deployment
- Rollback points - Revert to known good state
- Audit checkpoints - Verify system integrity
When to Create Snapshots:
- Before deploying to production
- After completing a major feature
- Before risky operations
- On a regular schedule (e.g., weekly)
3. State Loading (Provider-Aware)
Current state is computed as: Latest Snapshot + Changelog Operations
async function loadCurrentState(workspaceUri: Uri): Promise<{
state: ProviderState;
changelog: ChangelogFile;
provider: Provider;
}> {
// 1. Read project file
const project = await readProject(workspaceUri);
// 2. Get provider from registry
const provider = ProviderRegistry.get(project.provider.type);
// 3. Load latest snapshot (or start with empty state)
let state = project.latestSnapshot
? await readSnapshot(workspaceUri, project.latestSnapshot)
: provider.createInitialState();
// 4. Load changelog
const changelog = await readChangelog(workspaceUri);
// 5. Apply changelog operations using provider's reducer
state = provider.applyOperations(state, changelog.ops);
return { state, changelog, provider };
}
Key Points:
- Provider selected based on
project.provider.type - Provider's state reducer applies operations
- Operations validated by provider before applying
- State structure defined by provider
4. SQL Generation (Provider-Specific)
Each provider implements its own SQL generator that converts operations to DDL:
interface Provider {
getSQLGenerator(state: ProviderState): SQLGenerator;
}
interface SQLGenerator {
generateSQL(ops: Operation[]): string;
}
Example - Unity Catalog:
const generator = unityProvider.getSQLGenerator(state);
const sql = generator.generateSQL(changelog.ops);
Output:
-- Operation: unity.add_catalog (op_abc123)
CREATE CATALOG IF NOT EXISTS `analytics`;
-- Operation: unity.add_schema (op_def456)
CREATE SCHEMA IF NOT EXISTS `analytics`.`bronze`;
Example - Hive Metastore (Future):
-- Operation: hive.add_database (op_abc123)
CREATE DATABASE IF NOT EXISTS analytics;
-- Operation: hive.add_table (op_def456)
CREATE TABLE IF NOT EXISTS analytics.bronze_users (...);
Provider Hierarchy
Different providers have different object hierarchies:
Unity Catalog (3 levels)
Catalog
└─ Schema
└─ Table
Hive Metastore (2 levels)
Database
└─ Table
PostgreSQL (3 levels)
Database
└─ Schema
└─ Table
UI Adaptation:
The UI dynamically adapts to the provider's hierarchy:
const hierarchy = provider.capabilities.hierarchy;
const levels = hierarchy.levels; // Array of HierarchyLevel
// Render tree based on hierarchy depth
levels.forEach(level => {
renderLevel(level.name, level.displayName, level.icon);
});
Operation Flow
1. User Action in UI
// User clicks "Add Catalog"
useDesignerStore().addCatalog('analytics');
2. Store Creates Operation
addCatalog: (name) => {
const catalogId = `cat_${uuidv4()}`;
const provider = get().provider; // Get current provider
const op: Operation = {
id: `op_${uuidv4()}`,
ts: new Date().toISOString(),
provider: provider.id,
op: `${provider.id}.add_catalog`, // Provider prefix!
target: catalogId,
payload: { catalogId, name },
};
emitOps([op]); // Send to extension
},
3. Extension Validates and Saves
// Extension receives operation
const ops: Operation[] = message.payload;
// Validate using provider
const provider = ProviderRegistry.get(project.provider.type);
for (const op of ops) {
const validation = provider.validateOperation(op);
if (!validation.valid) {
throw new Error(`Invalid operation: ${validation.errors}`);
}
}
// Append to changelog
await appendOps(workspaceUri, ops);
4. State Updated
// Reload state
const { state, changelog, provider } = await loadCurrentState(workspaceUri);
// Apply operations using provider
const newState = provider.applyOperations(state, changelog.ops);
// Send back to UI
webview.postMessage({
type: 'project-updated',
payload: { project, state: newState, ops: changelog.ops, provider }
});
Code Organization
TypeScript (VSCode Extension)
src/
├── providers/
│ ├── base/
│ │ ├── provider.ts # Provider interface
│ │ ├── models.ts # Base types
│ │ ├── operations.ts # Operation types
│ │ ├── sql-generator.ts # SQL generator base
│ │ └── hierarchy.ts # Hierarchy types
│ ├── registry.ts # Provider registry
│ └── unity/
│ ├── index.ts # Unity provider exports
│ ├── models.ts # Unity-specific models
│ ├── operations.ts # Unity operations
│ ├── sql-generator.ts # Unity SQL generator
│ ├── state-reducer.ts # Unity state reducer
│ └── hierarchy.ts # Unity hierarchy config
├── storage-v3.ts # Provider-aware storage
├── extension.ts # Extension entry point
└── webview/
├── state/
│ └── useDesignerStore.ts # Provider-aware store
└── components/
└── ... UI components
Python (SDK/CLI)
src/schemax/
├── providers/
│ ├── base/
│ │ ├── provider.py
│ │ ├── models.py
│ │ ├── operations.py
│ │ ├── sql_generator.py
│ │ └── hierarchy.py
│ ├── registry.py
│ └── unity/
│ ├── __init__.py
│ ├── models.py
│ ├── operations.py
│ ├── sql_generator.py
│ ├── state_reducer.py
│ └── hierarchy.py
├── storage_v3.py
└── cli.py
Data Models
Unity Catalog Models
interface UnityCatalog {
id: string;
name: string;
schemas: UnitySchema[];
}
interface UnitySchema {
id: string;
name: string;
tables: UnityTable[];
views: UnityView[];
}
interface UnityTable {
id: string;
name: string;
format: 'delta' | 'iceberg';
columns: UnityColumn[];
properties: Record<string, string>;
constraints: UnityConstraint[];
rowFilters?: UnityRowFilter[];
columnMasks?: UnityColumnMask[];
grants: UnityGrant[];
comment?: string;
}
interface UnityColumn {
id: string;
name: string;
type: string;
nullable: boolean;
comment?: string;
tags?: Record<string, string>;
maskId?: string;
}
interface UnityView {
id: string;
name: string;
definition: string;
comment?: string;
properties?: Record<string, string>;
tags?: Record<string, string>;
extractedDependencies?: {
tables?: string[];
views?: string[];
};
}
Governance Features (Unity Catalog)
interface UnityConstraint {
id: string;
type: 'primary_key' | 'foreign_key' | 'check';
name?: string;
columns: string[];
// ... type-specific fields
}
interface UnityRowFilter {
id: string;
name: string;
enabled: boolean;
udfExpression: string;
description?: string;
}
interface UnityColumnMask {
id: string;
columnId: string;
name: string;
enabled: boolean;
maskFunction: string;
description?: string;
}
Migration from V2 to V3
Automatic Migration
When a V2 project is opened, it's automatically migrated to V3:
async function migrateV2ToV3(
workspaceUri: Uri,
v2Project: any,
providerId: string = 'unity'
): Promise<void> {
// 1. Add provider field
const v3Project = {
...v2Project,
version: 3,
provider: {
type: providerId,
version: '1.0.0',
},
};
// 2. Prefix operations with provider
const changelog = await readChangelog(workspaceUri);
const migratedOps = changelog.ops.map(op => ({
...op,
provider: providerId,
op: `${providerId}.${op.op}`, // Add provider prefix
}));
// 3. Save migrated files
await writeProject(workspaceUri, v3Project);
await writeChangelog(workspaceUri, { ...changelog, ops: migratedOps });
}
Migration is:
- ✅ Automatic (no user action required)
- ✅ Non-destructive (preserves all data)
- ✅ One-way (V3 projects don't downgrade to V2)
- ✅ Logged (migration events logged to output)
Benefits of Provider Architecture
For Users
- Unified Experience - Same tool for Unity Catalog, Hive, PostgreSQL
- Provider-Specific Features - Full support for each catalog's unique features
- Easy Migration - Switch providers if needed (future)
- Single Learning Curve - Learn once, use everywhere
For Developers
- Clear Boundaries - Providers are isolated modules
- No Merge Conflicts - Teams work in separate provider directories
- Independent Testing - Each provider has its own test suite
- Easy to Add - Well-documented provider contract
- Type Safety - TypeScript and Pydantic enforce contracts
For the Project
- Maintainable - Clear separation of concerns
- Scalable - Easy to add new providers
- Testable - Provider compliance tests ensure quality
- Future-Proof - Architecture supports any catalog type
Performance Considerations
State Loading
- Bounded Growth: Snapshots prevent unlimited operation replay
- Lazy Loading: Only load latest snapshot, not entire history
- Incremental Updates: Apply only new operations since last load
Example Timings (100 tables):
- Load snapshot: <10ms
- Apply 50 operations: <5ms
- Total: <15ms ⚡
SQL Generation
SQL generation in SchemaX uses a dependency-aware system that ensures DDL statements are generated in the correct execution order.
Dependency-Aware Ordering
Problem: Database objects have dependencies (e.g., views depend on tables, foreign keys reference other tables). Naive SQL generation in operation timestamp order can produce unexecutable SQL.
Solution: Build a dependency graph and use topological sorting to determine correct execution order.
Implementation:
// 1. Build dependency graph from operations
const graph = new DependencyGraph();
for (const op of operations) {
graph.addNode({
id: getTargetObjectId(op),
type: getObjectType(op),
hierarchyLevel: getDependencyLevel(op),
operation: op
});
// Extract dependencies (e.g., view → table)
const deps = extractOperationDependencies(op);
for (const [depId, depType] of deps) {
graph.addEdge(targetId, depId, depType);
}
}
// 2. Detect cycles
const cycles = graph.detectCycles();
if (cycles.length > 0) {
throw new Error('Circular dependencies detected');
}
// 3. Topological sort (dependencies first)
const sortedOps = graph.topologicalSort();
// 4. Generate SQL in dependency order
return sortedOps.map(op => generateSQLForOperation(op));
Key Features:
- Graph-Based Ordering: Uses NetworkX (Python) or custom graph implementation (TypeScript)
- Cycle Detection: Prevents invalid circular dependencies between views
- Breaking Change Detection: Warns when dropping objects with dependents
- Hierarchy-Aware: Respects natural hierarchy (catalog → schema → table/view)
- Provider-Extensible: Each provider defines its own dependency extraction logic
Dependency Types
enum DependencyType {
HIERARCHY = 'hierarchy', // Schema-level (catalog → schema → table)
FOREIGN_KEY = 'foreign_key', // FK constraint dependency
VIEW_DEPENDENCY = 'view_dependency', // View depends on table/view
COLUMN_MASK = 'column_mask', // Column mask dependency
ROW_FILTER = 'row_filter' // Row filter dependency
}
enum DependencyEnforcement {
ENFORCED = 'enforced', // Must be satisfied (hierarchical deps)
INFORMATIONAL = 'informational', // Not enforced (Unity Catalog FKs)
WARNING = 'warning' // Only for warnings
}
View and Materialized View Dependency Extraction
Views and materialized views are the primary use case for dependency-aware ordering. SchemaX extracts dependencies from view/MV SQL (FROM and JOIN references). The same mechanism applies to both: each view or materialized view stores extractedDependencies (tables and views it references), and the SQL generator orders creation so that tables and views are created before any view or materialized view that depends on them. Chains of arbitrary depth are supported: e.g. table → view1 → view2 → … → view_n (and materialized views at the end) are ordered correctly via the dependency graph.
Python (SQLGlot):
from sqlglot import parse_one
def extract_dependencies_from_view(view_sql: str) -> dict:
parsed = parse_one(view_sql, dialect="databricks")
tables = []
for table_exp in parsed.find_all(exp.Table):
fqn = f"{table_exp.catalog}.{table_exp.schema}.{table_exp.name}"
tables.append(fqn)
return {"tables": tables, "views": [], ...}
TypeScript (Regex-Based):
function extractDependenciesFromView(viewSql: string) {
const tablePattern = /(?:FROM|JOIN)\s+(?:(\w+)\.)?(?:(\w+)\.)?(\w+)/gi;
const tables = [];
let match;
while ((match = tablePattern.exec(viewSql)) !== null) {
const fqn = [match[1], match[2], match[3]].filter(Boolean).join('.');
tables.push(fqn);
}
return { tables, views: [], catalogs: [], schemas: [] };
}
Manual Dependency Override
In the VS Code Designer, views and materialized views show a Dependencies section (tables and views they depend on). Dependencies are filled automatically from the SQL when you save the definition. You can also edit the list manually (add or remove table or view names) to fix extraction gaps or force creation order. Use object names as they appear in the project (e.g. table name or schema.table). Saving the definition again re-extracts from SQL and replaces the dependency list; use Edit dependencies after saving to adjust if needed.
Hierarchy Levels
Operations are assigned hierarchy levels for fallback ordering:
Level 0: Catalog operations (CREATE CATALOG)
Level 1: Schema operations (CREATE SCHEMA)
Level 2: Table/View/Materialized view creation (CREATE TABLE, CREATE VIEW, CREATE MATERIALIZED VIEW)
Level 3: Modifications (ALTER TABLE, ADD COLUMN, SET PROPERTIES)
Breaking Change Detection
When a DROP operation is detected, the system checks for dependents:
if (opType.startsWith('drop_')) {
const dependents = graph.getDependents(targetId);
if (dependents.length > 0) {
warnings.push(
`BREAKING CHANGE: Dropping ${targetId} will affect: ${dependents}`
);
}
}
Fallback Behavior
If dependency analysis fails (e.g., cycles detected, parsing errors):
- Detect Cycle: Report cycle path to user
- Fall Back: Use hierarchy level + timestamp sorting
- Add Warnings: Include warnings in generated SQL comments
Example Warning:
-- WARNINGS:
-- Circular dependencies detected:
-- view_a → view_b → view_a
-- Using level-based sorting as fallback
-- Statement 1
CREATE SCHEMA my_schema;
Automatic FQN Qualification
Problem: View SQL may contain unqualified table references (e.g., SELECT * FROM table4), which fail when executed without a current catalog/schema context (like in SQL Statement Execution API).
Solution: Automatically qualify all table/view references in view definitions with fully-qualified names (FQN).
Python Implementation (SQLGlot):
def _qualify_view_definition(self, definition: str, extracted_deps: dict) -> str:
"""Qualify unqualified table/view references with fully-qualified names"""
import sqlglot
from sqlglot import expressions as exp
# Parse SQL
parsed = sqlglot.parse_one(definition, dialect="databricks")
# Build name → FQN mapping from id_name_map
name_to_fqn = {}
for object_id, fqn in self.id_name_map.items():
if "." in fqn:
parts = fqn.split(".")
if len(parts) == 3:
catalog, schema, name = parts
name_to_fqn[name] = fqn # unqualified
name_to_fqn[f"{schema}.{name}"] = fqn # partially qualified
# Replace table references with FQNs
for table_node in parsed.find_all(exp.Table):
current_ref = ".".join([
table_node.catalog, table_node.db, table_node.name
]).strip(".")
if current_ref in name_to_fqn:
fqn = name_to_fqn[current_ref]
catalog, schema, name = fqn.split(".")
# Use quoted identifiers for Databricks compatibility
table_node.set("catalog", exp.to_identifier(catalog, quoted=True))
table_node.set("db", exp.to_identifier(schema, quoted=True))
table_node.set("this", exp.to_identifier(name, quoted=True))
return parsed.sql(dialect="databricks", pretty=True)
Example Transformation:
-- Input (unqualified)
SELECT * FROM table4
-- Output (qualified with backticks)
SELECT * FROM `dev_sales_analytics`.`customer_analytics`.`table4`
Key Benefits:
- ✅ Views work regardless of current catalog/schema context
- ✅ Compatible with SQL Statement Execution API
- ✅ Prevents
TABLE_OR_VIEW_NOT_FOUNDerrors - ✅ Uses SQLGlot for accurate SQL parsing and transformation
- ✅ Supports unqualified (
table4) and partially qualified (schema.table4) references
Performance
- Graph Construction: O(V + E) where V = operations, E = dependencies
- Cycle Detection: O(V + E) using DFS
- Topological Sort: O(V + E) using Kahn's algorithm
- Typical Scale: <10ms for 100 operations with 50 dependencies
Provider-Specific Implementation
Each provider implements:
abstract class BaseSQLGenerator {
// Required: Define hierarchy levels
protected abstract _getDependencyLevel(op: Operation): number;
// Optional: Extract dependencies from operations
protected _extractOperationDependencies(op: Operation): Dependency[] {
return []; // Base implementation
}
}
class UnitySQLGenerator extends BaseSQLGenerator {
protected _getDependencyLevel(op: Operation): number {
if (op.op.includes('catalog')) return 0;
if (op.op.includes('schema')) return 1;
if (op.op.includes('add_table') || op.op.includes('add_view')) return 2;
return 3;
}
protected _extractOperationDependencies(op: Operation): Dependency[] {
// Extract from view SQL, foreign keys, etc.
if (op.op === 'unity.add_view') {
const deps = extractDependenciesFromView(op.payload.definition);
return deps.tables.map(tableId => [
tableId,
DependencyType.VIEW_DEPENDENCY,
DependencyEnforcement.INFORMATIONAL
]);
}
return [];
}
}
Benefits
- ✅ Correct Execution Order: SQL runs without dependency errors
- ✅ View Support: Enables complex view hierarchies (view → view → table)
- ✅ Breaking Change Warnings: Prevents accidental data loss
- ✅ Scalable: Handles projects with 1000+ objects efficiently
- ✅ Provider-Agnostic: Works with any catalog system
Memory Usage
- Single State: Only one state object in memory
- Immutable Operations: Operations are small and append-only
- Efficient Snapshots: Snapshots compressed when written
Security & Validation
Operation Validation
Every operation is validated before being saved:
const validation = provider.validateOperation(op);
if (!validation.valid) {
throw new Error(`Invalid operation: ${validation.errors}`);
}
Validation Checks:
- Required fields present
- Field types correct
- Provider supports operation
- References valid (IDs exist)
- Constraints satisfied
State Validation
State can be validated at any time:
const validation = provider.validateState(state);
if (!validation.valid) {
console.error('State validation errors:', validation.errors);
}
Snapshot Integrity
Snapshots include SHA-256 hash for integrity verification:
const expectedHash = snapshot.hash;
const actualHash = calculateHash(snapshot.state, snapshot.operations);
if (expectedHash !== actualHash) {
throw new Error('Snapshot integrity check failed');
}
Extension Points
Adding a New Provider
- Implement Provider Interface - See Provider Contract
- Register Provider - Add to registry
- Test Compliance - Run provider compliance tests
- Document - Add provider-specific docs
Adding New Operations
- Define Operation - Add to provider's
operations.ts - Update State Reducer - Handle in
state-reducer.ts - Update SQL Generator - Generate DDL in
sql-generator.ts - Add UI - Create UI for operation
- Test - Add operation tests
Extending UI
- Use Provider Capabilities - Check what provider supports
- Adapt to Hierarchy - Use provider's hierarchy definition
- Dynamic Forms - Generate forms based on operation metadata
- Provider-Specific Components - Add provider-specific features
Multi-Environment Support (v4)
SchemaX v4 introduces comprehensive multi-environment support, enabling users to design schemas once and deploy them to multiple environments (dev, test, prod) with different physical catalog names.
Logical vs Physical Naming
Design Pattern:
- Logical names stored in SchemaX state (environment-agnostic)
- Physical names generated at SQL generation time (environment-specific)
- Mapping defined in
project.jsonenvironment configuration
Example:
// Logical state (in changelog.json)
{
"catalogs": [{ "name": "__implicit__" }],
"schemas": [{ "name": "customer_360" }]
}
// Environment configuration (in project.json v4)
{
"provider": {
"environments": {
"dev": { "catalog": "dev_my_analytics" },
"prod": { "catalog": "prod_my_analytics" }
}
}
}
// Generated SQL
schemax sql --target dev → CREATE SCHEMA `dev_my_analytics`.`customer_360`;
schemax sql --target prod → CREATE SCHEMA `prod_my_analytics`.`customer_360`;
Project Schema v4
Environment Configuration:
{
"version": 4,
"provider": {
"type": "unity",
"environments": {
"dev": {
"catalog": "dev_analytics",
"description": "Development environment",
"allowDrift": true,
"requireSnapshot": false,
"autoCreateCatalog": true,
"autoCreateSchemaxSchema": true
},
"prod": {
"catalog": "prod_analytics",
"description": "Production environment",
"allowDrift": false,
"requireSnapshot": true,
"autoCreateCatalog": false
}
}
}
}
Environment Settings:
| Setting | Type | Description |
|---|---|---|
catalog | string | Physical catalog name in target system |
description | string | Human-readable description |
allowDrift | boolean | Allow actual state to differ from SchemaX |
requireSnapshot | boolean | Require snapshot before deployment |
autoCreateCatalog | boolean | Create catalog if it doesn't exist |
autoCreateSchemaxSchema | boolean | Auto-create tracking schema |
Catalog Name Mapping
Implementation (TypeScript):
// extension.ts
function buildCatalogMapping(
state: any,
envConfig: EnvironmentConfig
): Record<string, string> {
const catalogs = state.catalogs || [];
if (catalogs.length === 1) {
const logicalName = catalogs[0].name; // e.g., "__implicit__"
const physicalName = envConfig.catalog; // e.g., "dev_analytics"
return { [logicalName]: physicalName };
}
if (catalogs.length > 1) {
throw new Error("Multi-catalog projects not yet supported");
}
return {};
}
// UnitySQLGenerator
constructor(state: UnityState, catalogNameMapping?: Record<string, string>) {
this.catalogNameMapping = catalogNameMapping || {};
// Mapping applied when building fully-qualified names
}
Benefits:
- ✅ Clean separation: logical names in state, physical names in deployment
- ✅ Automatic mapping for single-catalog projects (90%+ use case)
- ✅ Environment isolation: same schema → different catalogs per environment
- ✅ Git-friendly: logical names in version control, physical names in config
Implicit Catalog Mode
For single-catalog projects (the vast majority), SchemaX uses implicit catalog mode to simplify the user experience.
User Experience:
Instead of:
- Configure environment catalog:
dev_analytics - Create catalog in UI:
analytics - Map during deployment:
analytics→dev_analytics
Users now experience:
- Configure environment catalog:
dev_analytics - Directly add schemas (no catalog creation step)
- Auto-map:
__implicit__→dev_analytics
Implementation:
// Project settings
{
"settings": {
"catalogMode": "single" // Default mode
}
}
// Auto-created implicit catalog
{
"catalogs": [
{
"id": "cat_implicit",
"name": "__implicit__", // Special marker
"schemas": [...]
}
]
}
UI Changes:
- ✅ No "+ Catalog" button in single-catalog mode
- ✅ Schemas shown at root level (flat hierarchy)
- ✅ Catalog layer invisible to user
- ✅ Physical catalog names shown in SQL generation prompts
Benefits:
- Simpler mental model (schemas and tables, not catalogs)
- Matches how users think ("I'm designing customer data")
- Physical catalog per environment already configured
- No confusion about logical vs physical names
Deployment Tracking
SchemaX tracks deployments in the target catalog itself using a dedicated schemax schema:
-- Auto-created on first deployment
CREATE SCHEMA IF NOT EXISTS <catalog>.schemax;
-- Deployment history
CREATE TABLE <catalog>.schemax.deployments (
id STRING,
environment STRING,
snapshot_version STRING,
deployed_at TIMESTAMP,
deployed_by STRING,
status STRING, -- pending/success/failed
ops_count INT,
error_message STRING,
sql_executed STRING,
PRIMARY KEY (id)
);
-- Per-operation tracking
CREATE TABLE <catalog>.schemax.deployment_ops (
deployment_id STRING,
op_id STRING,
op_type STRING,
sql_statement STRING,
status STRING,
execution_order INT,
PRIMARY KEY (deployment_id, op_id)
);
Benefits:
- ✅ Queryable audit trail
- ✅ Multi-user visibility
- ✅ Compliance and governance
- ✅ Tracks partial failures
Apply Command Workflow
# Generate SQL for dev
schemax sql --target dev --output migration.sql
# Preview changes (dry run)
schemax apply --target dev --profile DEV --warehouse-id abc123 --dry-run
# Apply with confirmation
schemax apply --target dev --profile DEV --warehouse-id abc123
# Non-interactive (CI/CD)
schemax apply --target dev --profile DEV --warehouse-id abc123 --no-interaction
Execution Flow:
- Load project and validate environment config
- Build catalog name mapping (logical → physical)
- Generate SQL with mapped names
- Show preview and prompt for confirmation
- Execute SQL statements sequentially
- Record deployment in
<catalog>.schemax.deployments - On failure: stop immediately, record error, show status
Future Architecture Plans
Stage 2 Enhancements
- Multi-Catalog Projects - Support explicit multiple catalogs per environment
- Provider Plugins - Load providers from external packages
- Cross-Provider References - Reference objects across providers
- Provider Marketplace - Community-contributed providers
Advanced Features
- Drift Detection - Compare deployed state vs SchemaX state
- Impact Analysis - Show what a change will affect
- Rollback Support - Revert to previous snapshots
- State Diffs - Visual comparison between versions
- Schema Import - Reverse-engineer existing catalogs into SchemaX
- DAB Generation - Export as Databricks Asset Bundles
Testing & Quality Assurance
Test Coverage
SchemaX includes comprehensive test suites for both Python and TypeScript implementations:
Python SDK Test Status:
- ✅ 124 passing tests (91.2%)
- ⏸️ 12 skipped tests (8.8%) - documented in GitHub issues #19, #20
- Test Categories:
- Storage operations (28 tests)
- State reducer (29 tests)
- Provider system (14 tests)
- SQL generator (39 passing, 9 skipped)
- Integration workflows (13 passing, 4 skipped)
Test Helpers:
Tests use OperationBuilder pattern for creating operations:
from tests.utils import OperationBuilder
builder = OperationBuilder()
op = builder.add_catalog("cat_123", "bronze", op_id="op_001")
# Automatically includes: id, ts, provider, op, target, payload
SQL Validation
SchemaX includes optional SQLGlot integration for validating generated SQL:
import sqlglot
# Validate Databricks SQL
sql = "ALTER TABLE `catalog`.`schema`.`table` ADD COLUMN `id` BIGINT"
parsed = sqlglot.parse_one(sql, dialect="databricks")
assert parsed is not None # Valid SQL
Benefits:
- Catches SQL syntax errors early
- Supports multiple SQL dialects (Databricks, PostgreSQL, MySQL, etc.)
- Can be integrated into CI/CD pipelines
Installation:
pip install sqlglot>=20.0.0
# or
pip install "schemaxpy[validation]"
Conclusion
SchemaX's provider-based architecture provides:
✅ Extensibility - Easy to add new catalog types
✅ Flexibility - Each provider can have unique features
✅ Type Safety - Strong typing throughout
✅ Performance - Fast state loading and SQL generation
✅ Maintainability - Clear module boundaries
✅ Future-Proof - Ready for new catalog systems
✅ Well-Tested - Comprehensive test coverage with 124+ passing tests
✅ SQL Validation - Optional SQLGlot integration for quality assurance
For more details:
- Provider Development: Provider Contract
- Development Guide: Development
- Getting Started: Quickstart