Skip to main content

Architecture

This document describes the technical architecture and design decisions behind SchemaX.

Overview

SchemaX implements a provider-based, snapshot-driven schema versioning system. The core principle is to maintain an append-only operation log with periodic snapshots, enabling both state-based and change-based workflows across multiple catalog types (Unity Catalog, Hive Metastore, PostgreSQL, etc.).

Design Goals

  1. Git-Friendly: Store schema definitions in human-readable JSON that produces clean diffs
  2. Reproducible: Replay operations from a snapshot to reconstruct current state
  3. Performant: Fast loading even with hundreds of operations
  4. Auditable: Complete history of who changed what and when
  5. Migration-Ready: Operations can be converted to SQL migration scripts
  6. Extensible: Easy to add support for new catalog types via providers
  7. Multi-Provider: Support multiple catalog systems with unified interface

Architectural Patterns

SchemaX follows several well-established architectural patterns that work together to provide a robust, maintainable, and extensible system.

Primary Pattern: Event Sourcing

The foundation of SchemaX is Event Sourcing - all changes are stored as immutable events (operations) in an append-only log.

Implementation:

// Operations are immutable events
interface Operation {
id: string;
ts: string;
provider: string;
op: string;
target: string;
payload: Record<string, any>;
}

// Current state = replay all operations from a snapshot
state = loadSnapshot(latestSnapshot);
for (const operation of changelog.ops) {
state = provider.applyOperation(state, operation);
}

Key Characteristics:

  • ✅ Append-only log (changelog.json)
  • ✅ Operations never modified or deleted
  • ✅ Complete audit trail
  • ✅ Time-travel capability via snapshots
  • ✅ State is derived, not stored directly

Benefits:

  • Full history of all changes
  • Can reconstruct state at any point
  • Easy debugging ("what happened?")
  • Enables undo/redo capabilities

Snapshot + Delta Pattern

Optimization of Event Sourcing to prevent unbounded operation log growth.

Implementation:

State at v0.3.0 = 
load_snapshot("v0.2.0") +
apply_operations(changelog.ops)

.schemax/
├── snapshots/v0.2.0.json ← Full state checkpoint
└── changelog.json ← Only ops since v0.2.0

Benefits:

  • Fast state loading (no need to replay 1000s of operations)
  • Bounded memory usage
  • Clean separation of committed vs uncommitted changes

Plugin Architecture (Provider System)

Extensibility through providers - catalog-specific implementations plugged into a common interface.

Implementation:

// Base contract
interface Provider {
info: ProviderInfo;
capabilities: ProviderCapabilities;
applyOperation(state: ProviderState, op: Operation): ProviderState;
getSQLGenerator(state: ProviderState): SQLGenerator;
validateOperation(op: Operation): ValidationResult;
}

// Implementations
class UnityProvider implements Provider { ... }
class HiveProvider implements Provider { ... }
class PostgresProvider implements Provider { ... }

// Registry
ProviderRegistry.register(unityProvider);

Key Characteristics:

  • ✅ Open/Closed Principle (open for extension, closed for modification)
  • ✅ Each provider is isolated and independent
  • ✅ Core system doesn't know provider details
  • ✅ New providers added without changing core

Strategy Pattern (Provider Operations)

Different algorithms (SQL generation, state reduction) selected based on provider.

Implementation:

// Context uses provider to select strategy
function generateSQL(ops: Operation[], project: Project) {
const provider = ProviderRegistry.get(project.provider.type);
const generator = provider.getSQLGenerator(state);
return generator.generateSQL(ops);
}

// Concrete strategies
class UnitySQLGenerator extends SQLGenerator {
generateSQL(ops: Operation[]): string {
// Unity Catalog-specific SQL
}
}

class HiveSQLGenerator extends SQLGenerator {
generateSQL(ops: Operation[]): string {
// Hive Metastore-specific SQL
}
}

Benefits:

  • Swappable implementations
  • Each strategy optimized for its system
  • Clean separation of concerns

State Reducer Pattern (Redux-inspired)

Immutable state transformations through pure functions.

Implementation:

function applyOperation(state: ProviderState, operation: Operation): ProviderState {
// Pure function: state + operation → new_state
const newState = deepClone(state);

switch (operation.op) {
case 'unity.add_catalog':
newState.catalogs.push(createCatalog(operation.payload));
break;
case 'unity.add_table':
const schema = findSchema(newState, operation.payload.schemaId);
schema.tables.push(createTable(operation.payload));
break;
}

return newState; // Never mutate input
}

Key Principles:

  • ✅ Pure functions (no side effects)
  • ✅ Immutable state
  • ✅ Predictable transformations
  • ✅ Easy to test
  • ✅ Time-travel debugging

Redux Comparison:

// Redux
newState = reducer(state, action)

// SchemaX
newState = provider.applyOperation(state, operation)

Registry Pattern (Provider Lookup)

Central registry for service discovery and dependency injection.

Implementation:

class ProviderRegistryClass {
private providers = new Map<string, Provider>();

register(provider: Provider): void {
this.providers.set(provider.info.id, provider);
}

get(providerId: string): Provider | undefined {
return this.providers.get(providerId);
}
}

// Singleton
export const ProviderRegistry = new ProviderRegistryClass();

// Auto-registration on import
ProviderRegistry.register(unityProvider);

Benefits:

  • Service discovery
  • Loose coupling
  • Easy testing (swap implementations)

Repository Pattern (Storage Layer)

Abstraction over file system operations.

Implementation:

// storage_v3.ts/py acts as repository
class StorageRepository {
readProject(workspacePath: Path): ProjectFile;
writeProject(workspacePath: Path, project: ProjectFile): void;
readChangelog(workspacePath: Path): ChangelogFile;
writeChangelog(workspacePath: Path, changelog: ChangelogFile): void;
readSnapshot(workspacePath: Path, version: string): SnapshotFile;
writeSnapshot(workspacePath: Path, snapshot: SnapshotFile): void;
}

// Usage
const project = storage.readProject(workspace);
// Don't care if it's JSON, SQLite, or remote API

Benefits:

  • Data access abstraction
  • Easy to swap storage backend
  • Testability (mock the repository)

Command Pattern (Operations)

Operations as command objects that encapsulate requests.

Implementation:

// Command = Operation
interface Operation {
id: string; // Command ID
op: string; // Command name
target: string; // Receiver
payload: object; // Parameters
ts: string; // Timestamp
}

// Command execution
function execute(state: State, command: Operation): State {
return applyOperation(state, command);
}

Characteristics:

  • ✅ Encapsulates request as object
  • ✅ Supports queuing and logging
  • ✅ Can be serialized
  • ✅ Enables undo (store reverse operations)

Adapter Pattern (Python ↔ TypeScript)

Translating between language ecosystems while maintaining compatibility.

Implementation:

# Python (Pydantic) - accepts both camelCase and snake_case
class Column(BaseModel):
id: str
name: str
mask_id: Optional[str] = Field(None, alias="maskId")

class Config:
populate_by_name = True # Accept both maskId and mask_id
// TypeScript (Zod) - uses camelCase
const Column = z.object({
id: z.string(),
name: z.string(),
maskId: z.string().optional(),
});

Same JSON works in both:

{"id": "col_1", "name": "email", "maskId": "mask_1"}

Benefits:

  • Seamless interoperability
  • Single source of truth (JSON files)
  • No code sharing required

Façade Pattern (CLI)

Simplified interface to complex subsystems.

Implementation:

# cli.py provides simple interface hiding complexity
@cli.command()
def sql(workspace: str):
# Hides complexity of:
# - File system operations
# - Provider lookup
# - State reconstruction
# - SQL generation
state, changelog, provider = load_current_state(Path(workspace))
generator = provider.get_sql_generator(state)
sql = generator.generate_sql(changelog.ops)
console.print(sql)

Benefits:

  • Simple API for complex operations
  • Easy to use
  • Decouples CLI from internal complexity

Pattern Interaction Diagram

┌─────────────────────────────────────────────────────────────────┐
│ CLI / Extension (Façade) │
└────────────┬────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ Storage Repository (Repository Pattern) │
│ Reads/writes .schemax/ files │
└────────────┬────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ Provider Registry (Registry + Strategy) │
│ ProviderRegistry.get(provider_id) │
└────────────┬────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ Provider Instance (Plugin Architecture) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Unity │ │ Hive │ │ PostgreSQL │ │
│ │ Provider │ │ Provider │ │ Provider │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────┬────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ State Reducer (State Reducer Pattern) │
│ state' = applyOperation(state, operation) │
│ (Immutable transformations) │
│ Based on Event Sourcing + Command Pattern │
└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│ SQL Generator (Strategy Pattern) │
│ sql = generator.generateSQL(operations) │
└─────────────────────────────────────────────────────────────────┘

Architectural Style: Functional Core, Imperative Shell

SchemaX follows the Functional Core, Imperative Shell pattern:

Functional Core (Pure Logic):

  • State reducers (pure functions)
  • Operation validation
  • SQL generation algorithms
  • State transformations

Imperative Shell (Side Effects):

  • File I/O (storage layer)
  • CLI output
  • VS Code webview communication
  • Extension activation

Benefits:

  • Easy to test (core is pure functions)
  • Easy to reason about (no hidden state)
  • Side effects isolated at boundaries

Design Patterns Summary

PatternWhere UsedPurpose
Event Sourcingchangelog.json, operationsCore architectural foundation
Snapshot + Deltasnapshots/, changelog.jsonPerformance optimization
Plugin Architectureproviders/ systemExtensibility for new catalog types
StrategyProvider implementationsSwappable algorithms
State Reducerstate_reducer.ts/pyImmutable state updates
RegistryProviderRegistryService discovery
Repositorystorage_v3.ts/pyData access abstraction
CommandOperation objectsOperation encapsulation
AdapterPydantic/Zod modelsCross-language compatibility
Façadecli.py, extension.tsSimple interface to complexity

Architectural Principles

1. Separation of Concerns

  • Storage ≠ Provider ≠ CLI
  • Each layer has single responsibility

2. Immutability

  • Operations never change
  • State transformations create new objects
  • Snapshots are read-only

3. Idempotency

  • SQL can be run multiple times safely
  • Operations produce same result when replayed

4. Extensibility

  • Add providers without changing core
  • Plugin-based architecture

5. Type Safety

  • Zod (TypeScript) and Pydantic (Python)
  • Runtime validation
  • IDE autocomplete

6. Testability

  • Pure functions (state reducers)
  • Mockable repositories
  • Isolated providers

Anti-Patterns Avoided

No Mutable Global State - All state is passed explicitly

No Tight Coupling - Providers are independent plugins

No Direct File System Access - Goes through repository layer

No Hardcoded Provider Logic - Uses registry + strategy

No Side Effects in Reducers - Pure functions only


Architecture Inspirations

SchemaX's architecture draws inspiration from:

  1. Redux (State Management)

    • Immutable state
    • Pure reducers
    • Action dispatching → Operations
  2. Git (Version Control)

    • Commit log → Operations log
    • Branches → Environments
    • Tags → Snapshots
  3. Terraform (Infrastructure as Code)

    • Desired state → Schema definition
    • Plan → SQL preview
    • Apply → SQL execution
  4. Liquibase/Flyway (Database Migrations)

    • Version-controlled schema changes
    • Idempotent migrations
    • Rollback support
  5. Event-Driven Architecture

    • Events → Operations
    • Event store → Changelog
    • Projections → Current state

Provider-Based Architecture

What is a Provider?

A Provider is a plugin that adds support for a specific catalog system. Each provider implements a standard interface that SchemaX uses for:

  • State Management - How objects are stored and modified
  • Operations - What actions users can perform
  • SQL Generation - How to convert operations to DDL statements
  • Validation - What constraints and rules to enforce
  • UI Metadata - How to display objects in the interface

Provider Registry

class ProviderRegistry {
private static providers = new Map<string, Provider>();

static register(provider: Provider): void;
static get(providerId: string): Provider | undefined;
static getAll(): Provider[];
}

Providers are registered at startup:

// TypeScript
import './providers'; // Auto-registers Unity provider

// Python
from schemax.providers import unity_provider
# Unity provider auto-registered on import

Current Providers

Available:

  • Unity Catalog (unity) - Databricks Unity Catalog with full governance features

Planned (Stage 2):

  • 🔜 Hive Metastore (hive) - Apache Hive Metastore
  • 🔜 PostgreSQL/Lakebase (postgres) - PostgreSQL with Lakebase extensions

Base Provider Enhancements

Generic SQL Optimization Algorithms

The base provider layer includes reusable optimization algorithms that all providers inherit automatically:

1. ColumnReorderOptimizer - Novel single-column drag detection

  • Detects when only one column moved in a reorder operation
  • Generates 1 SQL statement instead of N statements
  • Time complexity: O(n²) worst case, O(n) average case
  • Works for any SQL database with column positioning support

2. OperationBatcher - Groups operations by target object

  • Batches table operations to create complete CREATE TABLE statements
  • Reduces "empty table + multiple ALTERs" to single CREATE statement
  • Consolidates property settings and constraints
  • Provider-agnostic algorithm with provider-specific SQL generation

3. Enhanced BaseSQLGenerator - Template pattern for providers

  • Abstract methods for provider-specific SQL syntax
  • Generic utilities (_build_fqn, escape_identifier, escape_string)
  • Dependency-level ordering (catalog → schema → table)
  • Optimization components auto-initialized

Benefits for Provider Implementers

When implementing a new provider (e.g., Postgres, Snowflake):

  • 60% Less Code - Generic algorithms already implemented
  • Automatic Optimizations - Batching and reordering work out-of-the-box
  • Consistent Behavior - All providers generate high-quality SQL
  • Faster Time-to-Market - Focus on SQL syntax, not algorithms
  • Easier Maintenance - Bug fixes in base benefit all providers

Implementation Example

class PostgresSQLGenerator(BaseSQLGenerator):
"""Postgres provider using base optimizations"""

def __init__(self, state, name_mapping=None):
super().__init__(state, name_mapping)
# self.batcher and self.optimizer inherited from base
self.id_name_map = self._build_id_name_map()

# Implement abstract methods
def _get_target_object_id(self, op):
"""Extract table ID from Postgres operation"""
if op.op == "postgres.create_table":
return op.target
return op.payload.get("table_id")

def _is_create_operation(self, op):
"""Check if operation creates new object"""
return op.op in ["postgres.create_schema", "postgres.create_table"]

def _generate_batched_create_sql(self, object_id, batch_info):
"""Generate Postgres CREATE TABLE with columns"""
# Postgres-specific SQL syntax here
return f"CREATE TABLE {table_name} ({columns}) ..."

# Base handles: batching, reordering, dependency ordering

Code Reduction Metrics

Unity Catalog provider refactoring results:

  • Before: 928 lines (everything Unity-specific + generic algorithms)
  • After: 738 lines (only Unity-specific SQL syntax)
  • Reduction: 190 lines (20% reduction)
  • Moved to Base: 272 lines of generic code

Future providers will start with this reduced baseline.


File Structure

Version 3 Architecture (Current)

workspace-root/
└── .schemax/
├── project.json # Project metadata with provider info
├── changelog.json # Uncommitted operations
├── snapshots/
│ ├── v0.1.0.json # Full state snapshot
│ ├── v0.2.0.json
│ └── v0.3.0.json
└── migrations/ # Generated SQL files
└── migration_*.sql

Key Changes from V2:

  • project.json now includes provider metadata
  • Operations are prefixed with provider ID (e.g., unity.add_catalog)
  • Snapshots include provider context
  • Automatic V2 to V3 migration on first load

Project File Schema (V3)

project.json contains metadata and provider selection:

{
"version": 3,
"name": "my-databricks-schemas",
"provider": {
"type": "unity",
"version": "1.0.0"
},
"environments": ["dev", "test", "prod"],
"latestSnapshot": "v0.2.0",
"snapshots": [
{
"id": "snap_uuid",
"version": "v0.1.0",
"name": "Initial schema",
"file": ".schemax/snapshots/v0.1.0.json",
"ts": "2025-10-06T10:00:00Z",
"opsCount": 15,
"hash": "sha256...",
"previousSnapshot": null
}
],
"deployments": [],
"settings": {
"autoIncrementVersion": true,
"versionPrefix": "v",
"requireSnapshotForProd": true,
"allowDrift": false
}
}

New in V3:

  • provider field specifies catalog type and version
  • Provider info used to load correct reducer and SQL generator

Snapshot File Schema

snapshots/vX.Y.Z.json contains the full state:

{
"id": "snap_uuid",
"version": "v0.1.0",
"name": "Initial schema",
"ts": "2025-10-06T10:00:00Z",
"createdBy": "user@example.com",
"state": {
"catalogs": [
{
"id": "cat_uuid",
"name": "my_catalog",
"schemas": [
{
"id": "sch_uuid",
"name": "my_schema",
"tables": [...]
}
]
}
]
},
"operations": [
{
"id": "op_1",
"ts": "2025-01-01T10:00:00Z",
"provider": "unity",
"op": "unity.add_schema",
"target": "sch_uuid",
"payload": {...}
},
...
],
"previousSnapshot": null,
"hash": "sha256...",
"tags": ["production"],
"comment": "First release"
}

Note: The structure of state varies by provider. Unity Catalog uses catalogs, Hive uses databases, etc.

Changelog File Schema

changelog.json contains uncommitted operations:

{
"version": 1,
"sinceSnapshot": "v0.2.0",
"ops": [
{
"id": "op_abc123",
"ts": "2025-10-07T14:30:00Z",
"provider": "unity",
"op": "unity.add_catalog",
"target": "cat_new",
"payload": {
"catalogId": "cat_new",
"name": "analytics"
}
}
],
"lastModified": "2025-10-07T14:30:00Z"
}

New in V3:

  • provider field on each operation
  • op field prefixed with provider ID (e.g., unity.add_catalog)
  • Operations validated by provider before being saved

Core Concepts

1. Operations

Operations are the fundamental unit of change in SchemaX. Every user action generates one or more operations.

Operation Structure:

interface Operation {
id: string; // Unique operation ID
ts: string; // ISO 8601 timestamp
provider: string; // Provider ID (e.g., 'unity')
op: string; // Operation type with provider prefix
target: string; // ID of object being modified
payload: Record<string, any>; // Operation-specific data
}

Example - Unity Catalog:

{
"id": "op_abc123",
"ts": "2025-10-07T14:30:00Z",
"provider": "unity",
"op": "unity.add_catalog",
"target": "cat_xyz",
"payload": {
"catalogId": "cat_xyz",
"name": "analytics"
}
}

Example - Hive Metastore (Future):

{
"id": "op_def456",
"ts": "2025-10-07T14:35:00Z",
"provider": "hive",
"op": "hive.add_database",
"target": "db_xyz",
"payload": {
"databaseId": "db_xyz",
"name": "analytics"
}
}

2. Snapshots

Snapshots are point-in-time captures of the complete state. They serve as:

  • Performance optimization - No need to replay all operations
  • Release markers - Tagged versions for deployment
  • Rollback points - Revert to known good state
  • Audit checkpoints - Verify system integrity

When to Create Snapshots:

  • Before deploying to production
  • After completing a major feature
  • Before risky operations
  • On a regular schedule (e.g., weekly)

3. State Loading (Provider-Aware)

Current state is computed as: Latest Snapshot + Changelog Operations

async function loadCurrentState(workspaceUri: Uri): Promise<{
state: ProviderState;
changelog: ChangelogFile;
provider: Provider;
}> {
// 1. Read project file
const project = await readProject(workspaceUri);

// 2. Get provider from registry
const provider = ProviderRegistry.get(project.provider.type);

// 3. Load latest snapshot (or start with empty state)
let state = project.latestSnapshot
? await readSnapshot(workspaceUri, project.latestSnapshot)
: provider.createInitialState();

// 4. Load changelog
const changelog = await readChangelog(workspaceUri);

// 5. Apply changelog operations using provider's reducer
state = provider.applyOperations(state, changelog.ops);

return { state, changelog, provider };
}

Key Points:

  • Provider selected based on project.provider.type
  • Provider's state reducer applies operations
  • Operations validated by provider before applying
  • State structure defined by provider

4. SQL Generation (Provider-Specific)

Each provider implements its own SQL generator that converts operations to DDL:

interface Provider {
getSQLGenerator(state: ProviderState): SQLGenerator;
}

interface SQLGenerator {
generateSQL(ops: Operation[]): string;
}

Example - Unity Catalog:

const generator = unityProvider.getSQLGenerator(state);
const sql = generator.generateSQL(changelog.ops);

Output:

-- Operation: unity.add_catalog (op_abc123)
CREATE CATALOG IF NOT EXISTS `analytics`;

-- Operation: unity.add_schema (op_def456)
CREATE SCHEMA IF NOT EXISTS `analytics`.`bronze`;

Example - Hive Metastore (Future):

-- Operation: hive.add_database (op_abc123)
CREATE DATABASE IF NOT EXISTS analytics;

-- Operation: hive.add_table (op_def456)
CREATE TABLE IF NOT EXISTS analytics.bronze_users (...);

Provider Hierarchy

Different providers have different object hierarchies:

Unity Catalog (3 levels)

Catalog
└─ Schema
└─ Table

Hive Metastore (2 levels)

Database
└─ Table

PostgreSQL (3 levels)

Database
└─ Schema
└─ Table

UI Adaptation:

The UI dynamically adapts to the provider's hierarchy:

const hierarchy = provider.capabilities.hierarchy;
const levels = hierarchy.levels; // Array of HierarchyLevel

// Render tree based on hierarchy depth
levels.forEach(level => {
renderLevel(level.name, level.displayName, level.icon);
});

Operation Flow

1. User Action in UI

// User clicks "Add Catalog"
useDesignerStore().addCatalog('analytics');

2. Store Creates Operation

addCatalog: (name) => {
const catalogId = `cat_${uuidv4()}`;
const provider = get().provider; // Get current provider

const op: Operation = {
id: `op_${uuidv4()}`,
ts: new Date().toISOString(),
provider: provider.id,
op: `${provider.id}.add_catalog`, // Provider prefix!
target: catalogId,
payload: { catalogId, name },
};

emitOps([op]); // Send to extension
},

3. Extension Validates and Saves

// Extension receives operation
const ops: Operation[] = message.payload;

// Validate using provider
const provider = ProviderRegistry.get(project.provider.type);
for (const op of ops) {
const validation = provider.validateOperation(op);
if (!validation.valid) {
throw new Error(`Invalid operation: ${validation.errors}`);
}
}

// Append to changelog
await appendOps(workspaceUri, ops);

4. State Updated

// Reload state
const { state, changelog, provider } = await loadCurrentState(workspaceUri);

// Apply operations using provider
const newState = provider.applyOperations(state, changelog.ops);

// Send back to UI
webview.postMessage({
type: 'project-updated',
payload: { project, state: newState, ops: changelog.ops, provider }
});

Code Organization

TypeScript (VSCode Extension)

src/
├── providers/
│ ├── base/
│ │ ├── provider.ts # Provider interface
│ │ ├── models.ts # Base types
│ │ ├── operations.ts # Operation types
│ │ ├── sql-generator.ts # SQL generator base
│ │ └── hierarchy.ts # Hierarchy types
│ ├── registry.ts # Provider registry
│ └── unity/
│ ├── index.ts # Unity provider exports
│ ├── models.ts # Unity-specific models
│ ├── operations.ts # Unity operations
│ ├── sql-generator.ts # Unity SQL generator
│ ├── state-reducer.ts # Unity state reducer
│ └── hierarchy.ts # Unity hierarchy config
├── storage-v3.ts # Provider-aware storage
├── extension.ts # Extension entry point
└── webview/
├── state/
│ └── useDesignerStore.ts # Provider-aware store
└── components/
└── ... UI components

Python (SDK/CLI)

src/schemax/
├── providers/
│ ├── base/
│ │ ├── provider.py
│ │ ├── models.py
│ │ ├── operations.py
│ │ ├── sql_generator.py
│ │ └── hierarchy.py
│ ├── registry.py
│ └── unity/
│ ├── __init__.py
│ ├── models.py
│ ├── operations.py
│ ├── sql_generator.py
│ ├── state_reducer.py
│ └── hierarchy.py
├── storage_v3.py
└── cli.py

Data Models

Unity Catalog Models

interface UnityCatalog {
id: string;
name: string;
schemas: UnitySchema[];
}

interface UnitySchema {
id: string;
name: string;
tables: UnityTable[];
views: UnityView[];
}

interface UnityTable {
id: string;
name: string;
format: 'delta' | 'iceberg';
columns: UnityColumn[];
properties: Record<string, string>;
constraints: UnityConstraint[];
rowFilters?: UnityRowFilter[];
columnMasks?: UnityColumnMask[];
grants: UnityGrant[];
comment?: string;
}

interface UnityColumn {
id: string;
name: string;
type: string;
nullable: boolean;
comment?: string;
tags?: Record<string, string>;
maskId?: string;
}

interface UnityView {
id: string;
name: string;
definition: string;
comment?: string;
properties?: Record<string, string>;
tags?: Record<string, string>;
extractedDependencies?: {
tables?: string[];
views?: string[];
};
}

Governance Features (Unity Catalog)

interface UnityConstraint {
id: string;
type: 'primary_key' | 'foreign_key' | 'check';
name?: string;
columns: string[];
// ... type-specific fields
}

interface UnityRowFilter {
id: string;
name: string;
enabled: boolean;
udfExpression: string;
description?: string;
}

interface UnityColumnMask {
id: string;
columnId: string;
name: string;
enabled: boolean;
maskFunction: string;
description?: string;
}

Migration from V2 to V3

Automatic Migration

When a V2 project is opened, it's automatically migrated to V3:

async function migrateV2ToV3(
workspaceUri: Uri,
v2Project: any,
providerId: string = 'unity'
): Promise<void> {
// 1. Add provider field
const v3Project = {
...v2Project,
version: 3,
provider: {
type: providerId,
version: '1.0.0',
},
};

// 2. Prefix operations with provider
const changelog = await readChangelog(workspaceUri);
const migratedOps = changelog.ops.map(op => ({
...op,
provider: providerId,
op: `${providerId}.${op.op}`, // Add provider prefix
}));

// 3. Save migrated files
await writeProject(workspaceUri, v3Project);
await writeChangelog(workspaceUri, { ...changelog, ops: migratedOps });
}

Migration is:

  • ✅ Automatic (no user action required)
  • ✅ Non-destructive (preserves all data)
  • ✅ One-way (V3 projects don't downgrade to V2)
  • ✅ Logged (migration events logged to output)

Benefits of Provider Architecture

For Users

  1. Unified Experience - Same tool for Unity Catalog, Hive, PostgreSQL
  2. Provider-Specific Features - Full support for each catalog's unique features
  3. Easy Migration - Switch providers if needed (future)
  4. Single Learning Curve - Learn once, use everywhere

For Developers

  1. Clear Boundaries - Providers are isolated modules
  2. No Merge Conflicts - Teams work in separate provider directories
  3. Independent Testing - Each provider has its own test suite
  4. Easy to Add - Well-documented provider contract
  5. Type Safety - TypeScript and Pydantic enforce contracts

For the Project

  1. Maintainable - Clear separation of concerns
  2. Scalable - Easy to add new providers
  3. Testable - Provider compliance tests ensure quality
  4. Future-Proof - Architecture supports any catalog type

Performance Considerations

State Loading

  • Bounded Growth: Snapshots prevent unlimited operation replay
  • Lazy Loading: Only load latest snapshot, not entire history
  • Incremental Updates: Apply only new operations since last load

Example Timings (100 tables):

  • Load snapshot: <10ms
  • Apply 50 operations: <5ms
  • Total: <15ms

SQL Generation

SQL generation in SchemaX uses a dependency-aware system that ensures DDL statements are generated in the correct execution order.

Dependency-Aware Ordering

Problem: Database objects have dependencies (e.g., views depend on tables, foreign keys reference other tables). Naive SQL generation in operation timestamp order can produce unexecutable SQL.

Solution: Build a dependency graph and use topological sorting to determine correct execution order.

Implementation:

// 1. Build dependency graph from operations
const graph = new DependencyGraph();
for (const op of operations) {
graph.addNode({
id: getTargetObjectId(op),
type: getObjectType(op),
hierarchyLevel: getDependencyLevel(op),
operation: op
});

// Extract dependencies (e.g., view → table)
const deps = extractOperationDependencies(op);
for (const [depId, depType] of deps) {
graph.addEdge(targetId, depId, depType);
}
}

// 2. Detect cycles
const cycles = graph.detectCycles();
if (cycles.length > 0) {
throw new Error('Circular dependencies detected');
}

// 3. Topological sort (dependencies first)
const sortedOps = graph.topologicalSort();

// 4. Generate SQL in dependency order
return sortedOps.map(op => generateSQLForOperation(op));

Key Features:

  • Graph-Based Ordering: Uses NetworkX (Python) or custom graph implementation (TypeScript)
  • Cycle Detection: Prevents invalid circular dependencies between views
  • Breaking Change Detection: Warns when dropping objects with dependents
  • Hierarchy-Aware: Respects natural hierarchy (catalog → schema → table/view)
  • Provider-Extensible: Each provider defines its own dependency extraction logic

Dependency Types

enum DependencyType {
HIERARCHY = 'hierarchy', // Schema-level (catalog → schema → table)
FOREIGN_KEY = 'foreign_key', // FK constraint dependency
VIEW_DEPENDENCY = 'view_dependency', // View depends on table/view
COLUMN_MASK = 'column_mask', // Column mask dependency
ROW_FILTER = 'row_filter' // Row filter dependency
}

enum DependencyEnforcement {
ENFORCED = 'enforced', // Must be satisfied (hierarchical deps)
INFORMATIONAL = 'informational', // Not enforced (Unity Catalog FKs)
WARNING = 'warning' // Only for warnings
}

View and Materialized View Dependency Extraction

Views and materialized views are the primary use case for dependency-aware ordering. SchemaX extracts dependencies from view/MV SQL (FROM and JOIN references). The same mechanism applies to both: each view or materialized view stores extractedDependencies (tables and views it references), and the SQL generator orders creation so that tables and views are created before any view or materialized view that depends on them. Chains of arbitrary depth are supported: e.g. table → view1 → view2 → … → view_n (and materialized views at the end) are ordered correctly via the dependency graph.

Python (SQLGlot):

from sqlglot import parse_one

def extract_dependencies_from_view(view_sql: str) -> dict:
parsed = parse_one(view_sql, dialect="databricks")
tables = []
for table_exp in parsed.find_all(exp.Table):
fqn = f"{table_exp.catalog}.{table_exp.schema}.{table_exp.name}"
tables.append(fqn)
return {"tables": tables, "views": [], ...}

TypeScript (Regex-Based):

function extractDependenciesFromView(viewSql: string) {
const tablePattern = /(?:FROM|JOIN)\s+(?:(\w+)\.)?(?:(\w+)\.)?(\w+)/gi;
const tables = [];
let match;
while ((match = tablePattern.exec(viewSql)) !== null) {
const fqn = [match[1], match[2], match[3]].filter(Boolean).join('.');
tables.push(fqn);
}
return { tables, views: [], catalogs: [], schemas: [] };
}

Manual Dependency Override

In the VS Code Designer, views and materialized views show a Dependencies section (tables and views they depend on). Dependencies are filled automatically from the SQL when you save the definition. You can also edit the list manually (add or remove table or view names) to fix extraction gaps or force creation order. Use object names as they appear in the project (e.g. table name or schema.table). Saving the definition again re-extracts from SQL and replaces the dependency list; use Edit dependencies after saving to adjust if needed.

Hierarchy Levels

Operations are assigned hierarchy levels for fallback ordering:

Level 0: Catalog operations (CREATE CATALOG)
Level 1: Schema operations (CREATE SCHEMA)
Level 2: Table/View/Materialized view creation (CREATE TABLE, CREATE VIEW, CREATE MATERIALIZED VIEW)
Level 3: Modifications (ALTER TABLE, ADD COLUMN, SET PROPERTIES)

Breaking Change Detection

When a DROP operation is detected, the system checks for dependents:

if (opType.startsWith('drop_')) {
const dependents = graph.getDependents(targetId);
if (dependents.length > 0) {
warnings.push(
`BREAKING CHANGE: Dropping ${targetId} will affect: ${dependents}`
);
}
}

Fallback Behavior

If dependency analysis fails (e.g., cycles detected, parsing errors):

  1. Detect Cycle: Report cycle path to user
  2. Fall Back: Use hierarchy level + timestamp sorting
  3. Add Warnings: Include warnings in generated SQL comments

Example Warning:

-- WARNINGS:
-- Circular dependencies detected:
-- view_a → view_b → view_a
-- Using level-based sorting as fallback

-- Statement 1
CREATE SCHEMA my_schema;

Automatic FQN Qualification

Problem: View SQL may contain unqualified table references (e.g., SELECT * FROM table4), which fail when executed without a current catalog/schema context (like in SQL Statement Execution API).

Solution: Automatically qualify all table/view references in view definitions with fully-qualified names (FQN).

Python Implementation (SQLGlot):

def _qualify_view_definition(self, definition: str, extracted_deps: dict) -> str:
"""Qualify unqualified table/view references with fully-qualified names"""
import sqlglot
from sqlglot import expressions as exp

# Parse SQL
parsed = sqlglot.parse_one(definition, dialect="databricks")

# Build name → FQN mapping from id_name_map
name_to_fqn = {}
for object_id, fqn in self.id_name_map.items():
if "." in fqn:
parts = fqn.split(".")
if len(parts) == 3:
catalog, schema, name = parts
name_to_fqn[name] = fqn # unqualified
name_to_fqn[f"{schema}.{name}"] = fqn # partially qualified

# Replace table references with FQNs
for table_node in parsed.find_all(exp.Table):
current_ref = ".".join([
table_node.catalog, table_node.db, table_node.name
]).strip(".")

if current_ref in name_to_fqn:
fqn = name_to_fqn[current_ref]
catalog, schema, name = fqn.split(".")
# Use quoted identifiers for Databricks compatibility
table_node.set("catalog", exp.to_identifier(catalog, quoted=True))
table_node.set("db", exp.to_identifier(schema, quoted=True))
table_node.set("this", exp.to_identifier(name, quoted=True))

return parsed.sql(dialect="databricks", pretty=True)

Example Transformation:

-- Input (unqualified)
SELECT * FROM table4

-- Output (qualified with backticks)
SELECT * FROM `dev_sales_analytics`.`customer_analytics`.`table4`

Key Benefits:

  • ✅ Views work regardless of current catalog/schema context
  • ✅ Compatible with SQL Statement Execution API
  • ✅ Prevents TABLE_OR_VIEW_NOT_FOUND errors
  • ✅ Uses SQLGlot for accurate SQL parsing and transformation
  • ✅ Supports unqualified (table4) and partially qualified (schema.table4) references

Performance

  • Graph Construction: O(V + E) where V = operations, E = dependencies
  • Cycle Detection: O(V + E) using DFS
  • Topological Sort: O(V + E) using Kahn's algorithm
  • Typical Scale: <10ms for 100 operations with 50 dependencies

Provider-Specific Implementation

Each provider implements:

abstract class BaseSQLGenerator {
// Required: Define hierarchy levels
protected abstract _getDependencyLevel(op: Operation): number;

// Optional: Extract dependencies from operations
protected _extractOperationDependencies(op: Operation): Dependency[] {
return []; // Base implementation
}
}

class UnitySQLGenerator extends BaseSQLGenerator {
protected _getDependencyLevel(op: Operation): number {
if (op.op.includes('catalog')) return 0;
if (op.op.includes('schema')) return 1;
if (op.op.includes('add_table') || op.op.includes('add_view')) return 2;
return 3;
}

protected _extractOperationDependencies(op: Operation): Dependency[] {
// Extract from view SQL, foreign keys, etc.
if (op.op === 'unity.add_view') {
const deps = extractDependenciesFromView(op.payload.definition);
return deps.tables.map(tableId => [
tableId,
DependencyType.VIEW_DEPENDENCY,
DependencyEnforcement.INFORMATIONAL
]);
}
return [];
}
}

Benefits

  • Correct Execution Order: SQL runs without dependency errors
  • View Support: Enables complex view hierarchies (view → view → table)
  • Breaking Change Warnings: Prevents accidental data loss
  • Scalable: Handles projects with 1000+ objects efficiently
  • Provider-Agnostic: Works with any catalog system

Memory Usage

  • Single State: Only one state object in memory
  • Immutable Operations: Operations are small and append-only
  • Efficient Snapshots: Snapshots compressed when written

Security & Validation

Operation Validation

Every operation is validated before being saved:

const validation = provider.validateOperation(op);
if (!validation.valid) {
throw new Error(`Invalid operation: ${validation.errors}`);
}

Validation Checks:

  • Required fields present
  • Field types correct
  • Provider supports operation
  • References valid (IDs exist)
  • Constraints satisfied

State Validation

State can be validated at any time:

const validation = provider.validateState(state);
if (!validation.valid) {
console.error('State validation errors:', validation.errors);
}

Snapshot Integrity

Snapshots include SHA-256 hash for integrity verification:

const expectedHash = snapshot.hash;
const actualHash = calculateHash(snapshot.state, snapshot.operations);
if (expectedHash !== actualHash) {
throw new Error('Snapshot integrity check failed');
}

Extension Points

Adding a New Provider

  1. Implement Provider Interface - See Provider Contract
  2. Register Provider - Add to registry
  3. Test Compliance - Run provider compliance tests
  4. Document - Add provider-specific docs

Adding New Operations

  1. Define Operation - Add to provider's operations.ts
  2. Update State Reducer - Handle in state-reducer.ts
  3. Update SQL Generator - Generate DDL in sql-generator.ts
  4. Add UI - Create UI for operation
  5. Test - Add operation tests

Extending UI

  1. Use Provider Capabilities - Check what provider supports
  2. Adapt to Hierarchy - Use provider's hierarchy definition
  3. Dynamic Forms - Generate forms based on operation metadata
  4. Provider-Specific Components - Add provider-specific features

Multi-Environment Support (v4)

SchemaX v4 introduces comprehensive multi-environment support, enabling users to design schemas once and deploy them to multiple environments (dev, test, prod) with different physical catalog names.

Logical vs Physical Naming

Design Pattern:

  • Logical names stored in SchemaX state (environment-agnostic)
  • Physical names generated at SQL generation time (environment-specific)
  • Mapping defined in project.json environment configuration

Example:

// Logical state (in changelog.json)
{
"catalogs": [{ "name": "__implicit__" }],
"schemas": [{ "name": "customer_360" }]
}

// Environment configuration (in project.json v4)
{
"provider": {
"environments": {
"dev": { "catalog": "dev_my_analytics" },
"prod": { "catalog": "prod_my_analytics" }
}
}
}

// Generated SQL
schemax sql --target dev → CREATE SCHEMA `dev_my_analytics`.`customer_360`;
schemax sql --target prod → CREATE SCHEMA `prod_my_analytics`.`customer_360`;

Project Schema v4

Environment Configuration:

{
"version": 4,
"provider": {
"type": "unity",
"environments": {
"dev": {
"catalog": "dev_analytics",
"description": "Development environment",
"allowDrift": true,
"requireSnapshot": false,
"autoCreateCatalog": true,
"autoCreateSchemaxSchema": true
},
"prod": {
"catalog": "prod_analytics",
"description": "Production environment",
"allowDrift": false,
"requireSnapshot": true,
"autoCreateCatalog": false
}
}
}
}

Environment Settings:

SettingTypeDescription
catalogstringPhysical catalog name in target system
descriptionstringHuman-readable description
allowDriftbooleanAllow actual state to differ from SchemaX
requireSnapshotbooleanRequire snapshot before deployment
autoCreateCatalogbooleanCreate catalog if it doesn't exist
autoCreateSchemaxSchemabooleanAuto-create tracking schema

Catalog Name Mapping

Implementation (TypeScript):

// extension.ts
function buildCatalogMapping(
state: any,
envConfig: EnvironmentConfig
): Record<string, string> {
const catalogs = state.catalogs || [];

if (catalogs.length === 1) {
const logicalName = catalogs[0].name; // e.g., "__implicit__"
const physicalName = envConfig.catalog; // e.g., "dev_analytics"
return { [logicalName]: physicalName };
}

if (catalogs.length > 1) {
throw new Error("Multi-catalog projects not yet supported");
}

return {};
}

// UnitySQLGenerator
constructor(state: UnityState, catalogNameMapping?: Record<string, string>) {
this.catalogNameMapping = catalogNameMapping || {};
// Mapping applied when building fully-qualified names
}

Benefits:

  • ✅ Clean separation: logical names in state, physical names in deployment
  • ✅ Automatic mapping for single-catalog projects (90%+ use case)
  • ✅ Environment isolation: same schema → different catalogs per environment
  • ✅ Git-friendly: logical names in version control, physical names in config

Implicit Catalog Mode

For single-catalog projects (the vast majority), SchemaX uses implicit catalog mode to simplify the user experience.

User Experience:

Instead of:

  1. Configure environment catalog: dev_analytics
  2. Create catalog in UI: analytics
  3. Map during deployment: analyticsdev_analytics

Users now experience:

  1. Configure environment catalog: dev_analytics
  2. Directly add schemas (no catalog creation step)
  3. Auto-map: __implicit__dev_analytics

Implementation:

// Project settings
{
"settings": {
"catalogMode": "single" // Default mode
}
}

// Auto-created implicit catalog
{
"catalogs": [
{
"id": "cat_implicit",
"name": "__implicit__", // Special marker
"schemas": [...]
}
]
}

UI Changes:

  • ✅ No "+ Catalog" button in single-catalog mode
  • ✅ Schemas shown at root level (flat hierarchy)
  • ✅ Catalog layer invisible to user
  • ✅ Physical catalog names shown in SQL generation prompts

Benefits:

  • Simpler mental model (schemas and tables, not catalogs)
  • Matches how users think ("I'm designing customer data")
  • Physical catalog per environment already configured
  • No confusion about logical vs physical names

Deployment Tracking

SchemaX tracks deployments in the target catalog itself using a dedicated schemax schema:

-- Auto-created on first deployment
CREATE SCHEMA IF NOT EXISTS <catalog>.schemax;

-- Deployment history
CREATE TABLE <catalog>.schemax.deployments (
id STRING,
environment STRING,
snapshot_version STRING,
deployed_at TIMESTAMP,
deployed_by STRING,
status STRING, -- pending/success/failed
ops_count INT,
error_message STRING,
sql_executed STRING,
PRIMARY KEY (id)
);

-- Per-operation tracking
CREATE TABLE <catalog>.schemax.deployment_ops (
deployment_id STRING,
op_id STRING,
op_type STRING,
sql_statement STRING,
status STRING,
execution_order INT,
PRIMARY KEY (deployment_id, op_id)
);

Benefits:

  • ✅ Queryable audit trail
  • ✅ Multi-user visibility
  • ✅ Compliance and governance
  • ✅ Tracks partial failures

Apply Command Workflow

# Generate SQL for dev
schemax sql --target dev --output migration.sql

# Preview changes (dry run)
schemax apply --target dev --profile DEV --warehouse-id abc123 --dry-run

# Apply with confirmation
schemax apply --target dev --profile DEV --warehouse-id abc123

# Non-interactive (CI/CD)
schemax apply --target dev --profile DEV --warehouse-id abc123 --no-interaction

Execution Flow:

  1. Load project and validate environment config
  2. Build catalog name mapping (logical → physical)
  3. Generate SQL with mapped names
  4. Show preview and prompt for confirmation
  5. Execute SQL statements sequentially
  6. Record deployment in <catalog>.schemax.deployments
  7. On failure: stop immediately, record error, show status

Future Architecture Plans

Stage 2 Enhancements

  1. Multi-Catalog Projects - Support explicit multiple catalogs per environment
  2. Provider Plugins - Load providers from external packages
  3. Cross-Provider References - Reference objects across providers
  4. Provider Marketplace - Community-contributed providers

Advanced Features

  1. Drift Detection - Compare deployed state vs SchemaX state
  2. Impact Analysis - Show what a change will affect
  3. Rollback Support - Revert to previous snapshots
  4. State Diffs - Visual comparison between versions
  5. Schema Import - Reverse-engineer existing catalogs into SchemaX
  6. DAB Generation - Export as Databricks Asset Bundles

Testing & Quality Assurance

Test Coverage

SchemaX includes comprehensive test suites for both Python and TypeScript implementations:

Python SDK Test Status:

  • ✅ 124 passing tests (91.2%)
  • ⏸️ 12 skipped tests (8.8%) - documented in GitHub issues #19, #20
  • Test Categories:
    • Storage operations (28 tests)
    • State reducer (29 tests)
    • Provider system (14 tests)
    • SQL generator (39 passing, 9 skipped)
    • Integration workflows (13 passing, 4 skipped)

Test Helpers:

Tests use OperationBuilder pattern for creating operations:

from tests.utils import OperationBuilder

builder = OperationBuilder()
op = builder.add_catalog("cat_123", "bronze", op_id="op_001")
# Automatically includes: id, ts, provider, op, target, payload

SQL Validation

SchemaX includes optional SQLGlot integration for validating generated SQL:

import sqlglot

# Validate Databricks SQL
sql = "ALTER TABLE `catalog`.`schema`.`table` ADD COLUMN `id` BIGINT"
parsed = sqlglot.parse_one(sql, dialect="databricks")
assert parsed is not None # Valid SQL

Benefits:

  • Catches SQL syntax errors early
  • Supports multiple SQL dialects (Databricks, PostgreSQL, MySQL, etc.)
  • Can be integrated into CI/CD pipelines

Installation:

pip install sqlglot>=20.0.0
# or
pip install "schemaxpy[validation]"

Conclusion

SchemaX's provider-based architecture provides:

Extensibility - Easy to add new catalog types
Flexibility - Each provider can have unique features
Type Safety - Strong typing throughout
Performance - Fast state loading and SQL generation
Maintainability - Clear module boundaries
Future-Proof - Ready for new catalog systems
Well-Tested - Comprehensive test coverage with 124+ passing tests
SQL Validation - Optional SQLGlot integration for quality assurance

For more details: