Skip to main content

Delta Lake Vacuum Strategy

S0PMOBAI Publix Mobile Chat API

Document Version: 1.0
Last Updated: December 16, 2025
Status: Approved for Implementation

Table of Contents

  1. What is VACUUM Strategy
  2. Key Points
  3. Strategy Description
  4. Code Affected
  5. Implementation
  6. Final Recommendations

What is VACUUM Strategy

Definition

VACUUM is a Delta Lake maintenance operation that permanently removes data files that are no longer needed, reclaiming storage space while maintaining data integrity and time-travel capabilities.

How It Works

When Delta tables are updated:

  1. New data files are written with current data
  2. Old data files are marked for deletion (but not removed immediately)
  3. Transaction logs track which files are current vs. obsolete
  4. VACUUM operation removes files older than the retention period

Why It's Critical

  • Storage Cost Management: Prevents accumulation of obsolete data files
  • Performance: Reduces metadata overhead from tracking too many files
  • Compliance: Enforces data retention policies
  • Time-Travel Balance: Maintains historical versions for specified duration

The Problem Without VACUUM

Our ETL jobs use mode("overwrite") which creates a complete new set of files with each run:

productFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_df")
  • Daily job run = New file set created, old files orphaned
  • 30 days without vacuum = 30x storage consumption
  • Result: Exponentially growing storage costs

How VACUUM Actually Deletes Files

Critical Understanding: Retention Period ≠ Automatic Deletion

The retention period is just a threshold parameter - it does NOT automatically delete anything.

The retention period tells VACUUM: "Don't delete files newer than X hours"

The Actual Deletion Process

VACUUM is the ONLY mechanism that physically deletes old files. Here's how it works:

When You Execute VACUUM:

DeltaTable.forName(spark, "my_table").vacuum(168)  # 168 hours = 7 days

Step 1: Scan Transaction Log

  • VACUUM reads the Delta table's _delta_log directory
  • Identifies which data files are currently referenced (active files)
  • Identifies which files are no longer referenced (orphaned files)

Step 2: Check File Age

  • For each orphaned file, VACUUM checks its creation timestamp
  • Compares the file age against the retention threshold (7 days in this example)

Step 3: Physical Deletion

  • Files that meet BOTH conditions are deleted:
    • ✅ NOT referenced in transaction log (orphaned)
    • ✅ AND older than retention period (> 7 days old)
  • Files are physically removed from Azure Blob Storage
  • Storage space is immediately reclaimed

Step 4: Preserve Recent History

  • Files newer than retention period are kept (enables time-travel)
  • Currently active files are never deleted (protects data integrity)

Your Vacuum Notebook IS the Deletion Process

The scheduled notebook (delta-table-vacuum-maintenance.py) IS the actual deletion mechanism:

# This code PHYSICALLY DELETES files from Azure Blob Storage
DeltaTable.forName(spark, table_name).vacuum(table_retention)

Without running this notebook, NO files are ever deleted, regardless of retention settings.

Scheduled Execution = Scheduled Deletion

Your Databricks job schedule determines WHEN files get deleted:

EnvironmentScheduleWhen Files Are Deleted
PRDWeekly (Sunday 2 AM)Every Sunday at 2 AM
STGWeekly (Sunday 3 AM)Every Sunday at 3 AM
TSTWeekly (Sunday 4 AM)Every Sunday at 4 AM
DEVDaily (Midnight)Every day at midnight

If the job fails or is disabled, files will NOT be deleted that period.

What Happens Without VACUUM

If you never run VACUUM:

❌ Retention period setting = meaningless (no process using it)
❌ Old files = accumulate forever
❌ Storage = grows exponentially
❌ Cost = keeps increasing
❌ Performance = degrades over time

Setting a retention period without running VACUUM accomplishes nothing.

Deletion Flow Diagram

┌─────────────────────────────────────────────────────────────┐
│ 1. Write Operation (overwrite/append/merge) │
│ → Creates new data files in Azure Blob Storage │
└─────────────────────────┬───────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 2. Transaction Log Update │
│ → Old files marked as "orphaned" (no longer referenced) │
└─────────────────────────┬───────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 3. Time Passes │
│ → Orphaned files age beyond retention period │
└─────────────────────────┬───────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 4. VACUUM RUNS (via scheduled notebook) │
│ → Scans transaction log for orphaned files │
│ → Checks file ages against retention threshold │
└─────────────────────────┬───────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 5. Physical Deletion │
│ → Eligible files removed from Azure Blob Storage │
│ → Storage space immediately reclaimed │
└─────────────────────────┬───────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ 6. Cost Savings │
│ → Reduced storage consumption │
│ → Lower Azure storage bills │
└─────────────────────────────────────────────────────────────┘

Verification After VACUUM

You can verify that VACUUM actually deleted files:

Check File Count:

# Before vacuum
details_before = spark.sql("DESCRIBE DETAIL table_name").collect()[0]
files_before = details_before.numFiles

# After vacuum (should be lower)
details_after = spark.sql("DESCRIBE DETAIL table_name").collect()[0]
files_after = details_after.numFiles

print(f"Files deleted: {files_before - files_after}")

Check Storage Directory:

storage_url = get_s0pmobai_storage_connection("container")
files = dbutils.fs.ls(f"{storage_url}/path")
print(f"Files in storage: {len(files)}")

Key Takeaways

🔑 Retention period = safety threshold, NOT automatic deletion
🔑 VACUUM command = actual deletion mechanism
🔑 Scheduled notebook = when deletion happens
🔑 No VACUUM runs = no files deleted = costs grow
🔑 VACUUM must run regularly to maintain storage costs

Your vacuum maintenance notebook is not just monitoring storage - it IS the deletion process.

Key Points

1. Write Pattern Determines Strategy

Write PatternFile AccumulationPriorityVacuum Frequency
OverwriteVery HighCriticalWeekly
AppendModerateMediumBi-weekly/Monthly
Merge/UpsertLowLowMonthly
Read-OnlyNoneN/ANot needed

Our tables use OVERWRITE → Highest priority for vacuum maintenance

2. Retention Period Trade-offs

RetentionStorage CostTime-TravelUse CaseRisk Level
24 hoursMinimal1 dayDevelopmentHigh
48 hours (2 days)Low2 daysDevelopmentMedium
96 hours (4 days)Low-Moderate4 daysTestingMedium-Low
168 hours (7 days)Moderate1 weekStagingLow
336 hours (14 days)High2 weeksProductionVery Low
720 hours (30 days)Very High1 monthComplianceVery Low

3. Time-Travel Implications

# Only works if files still exist (within retention period)
df = spark.read.format("delta") \
.option("versionAsOf", 5) \
.table("solr_product_df")

df = spark.read.format("delta") \
.option("timestampAsOf", "2025-12-10") \
.table("solr_product_df")

After vacuum: Historical data beyond retention period is permanently deleted.

4. Safety Mechanisms

  • Default retention: 7 days minimum (enforced by Delta Lake)
  • Safety check: Prevents vacuum with retention < 168 hours unless explicitly disabled
  • Concurrent read protection: Files in active use are protected

5. Optimization Synergy

# Best practice: Optimize BEFORE vacuum
spark.sql("OPTIMIZE table_name") # Compact small files
DeltaTable.forName(spark, "table_name").vacuum(168) # Then clean up

Benefits: Better performance + more efficient vacuum

6. Optimization Techniques Comparison

OPTIMIZE

  • Purpose: Compacts small files into larger files
  • When Applied: After table creation, can run anytime
  • Frequency: Weekly (in vacuum job)
  • Code Changes: None - SQL command only
  • Our Usage: ✅ Applied to all 25 tables weekly

ZORDER

  • Purpose: Co-locates related data for data skipping
  • When Applied: During OPTIMIZE operation
  • Frequency: Weekly (in vacuum job)
  • Code Changes: None - managed in SQL configuration
  • Performance Benefit: 80-95% reduction in files scanned for filtered queries
  • Our Usage: ✅ Applied to all high-query tables (StoreNumber, ItemCode, etc.)

PARTITIONING

  • Purpose: Physically separates data into directories by column value
  • When Applied: ⚠️ Must be defined at table creation time
  • Code Changes: ⚠️ Requires ETL notebook modifications
  • Re-partitioning: ⚠️ Requires full table rewrite
  • Our Usage: ❌ Not currently used (see analysis below)

Table Partitioning Analysis

What is Partitioning?

Partitioning divides a Delta table into physical directories based on one or more column values:

# Example: Partitioning by date
df.write.format("delta") \
.partitionBy("order_date") \
.mode("append") \
.saveAsTable("orders")

# Creates directory structure:
# /orders/order_date=2026-01-01/
# /orders/order_date=2026-01-02/
# /orders/order_date=2026-01-03/

Query Optimization:

# Only scans order_date=2026-01-05 directory
spark.sql("SELECT * FROM orders WHERE order_date = '2026-01-05'")

Key Differences: Partitioning vs. ZORDER

AspectPartitioningZORDER
SetupTable creation timeAnytime after creation
Code ChangesRequired (ETL notebooks)None (SQL config only)
Change CostFull table rewriteJust run OPTIMIZE
Best ForTime-series, low cardinalityHigh cardinality, multiple query patterns
Directory StructureSeparate folders per partitionSingle directory, optimized layout
Cardinality10-1000 partitions idealUnlimited distinct values
Write PatternAPPEND/MERGE preferredWorks with any pattern

1. OVERWRITE Pattern Conflicts

Our ETL jobs use mode("overwrite"):

productFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_df")
  • Problem: OVERWRITE replaces entire table, negating partition benefits
  • Solution: Would need to switch to APPEND/MERGE (major code change)

2. Table Sizes Below Threshold

TableSizePartitioning ThresholdRecommendation
StoreProductData48 GB500+ GB recommended❌ Too small
solr_product_df~15 GB500+ GB recommended❌ Too small
Other tables< 10 GB500+ GB recommended❌ Too small

Rule of Thumb: Partitioning becomes beneficial at 500 GB - 1 TB scale

3. High Cardinality Partition Keys

Potential partition columns have too many distinct values:

ColumnDistinct ValuesIdeal RangeAssessment
StoreNumber1,500+ stores10-1000❌ Too granular
ItemCode100,000+ products10-1000❌ Far too granular
SearchCategory150-100 categories10-1000⚠️ Borderline (but no date filter)

Problem: 1,500+ partitions = 1,500+ directories = metadata overhead

4. No Consistent Time-Based Queries

Ideal partitioning scenario:

# Time-series data with consistent date filters
SELECT * FROM logs
WHERE log_date BETWEEN '2026-01-01' AND '2026-01-07'

Our queries:

# Product lookups by ID (not date-based)
SELECT * FROM solr_product_df WHERE ItemCode = 12345

# Store-product lookups (not date-based)
SELECT * FROM StoreProductData WHERE StoreNumber = 1234 AND ItemCode = 5678

Reality: Our mobile API queries filter by ID/Store/Category, not date

5. ZORDER Provides Similar Benefits Without Drawbacks

ZORDER Advantages:

  • ✅ No code changes needed
  • ✅ Works with OVERWRITE pattern
  • ✅ Handles high cardinality (ItemCode, StoreNumber)
  • ✅ Can change columns anytime (SQL config)
  • ✅ 80-95% of partitioning benefits

Current Implementation:

-- StoreProductData optimized for store-product queries
OPTIMIZE delta.`{path}` ZORDER BY (StoreNumber, ItemCode)

-- Result: Similar query performance to partitioning
SELECT * FROM StoreProductData
WHERE StoreNumber = 1234 -- Data skipping works here!

When to Reconsider Partitioning

Revisit partitioning IF:

  1. Table Growth

    • Tables exceed 500 GB
    • Storage costs become significant despite vacuum
  2. Write Pattern Change

    • Switch from OVERWRITE to APPEND/MERGE
    • Incremental updates become the norm
  3. Query Pattern Evolution

    • Queries consistently filter by date
    • Add LastModifiedDate or ProcessDate column
    • Time-based analysis becomes primary use case
  4. Example Future Scenario:

# If we add date columns and switch to MERGE
storeProductFeed.write.format("delta") \
.partitionBy("ProcessDate") \
.mode("append") \
.saveAsTable("StoreProductData")

# Then queries like this become efficient
SELECT * FROM StoreProductData
WHERE ProcessDate = '2026-01-05' AND StoreNumber = 1234

Implementation Example (For Future Reference)

If partitioning is needed, here's how:

# 1. Add partition column to DataFrame
from pyspark.sql.functions import current_date

productFeed = productFeed.withColumn("LoadDate", current_date())

# 2. Define partitioning at write time
productFeed.write.format("delta") \
.partitionBy("LoadDate") \
.mode("append") \ # NOT overwrite!
.saveAsTable("solr_product_df")

# 3. Update configuration in DeltaVacuumConfiguration
# Add PartitionColumns column (similar to ZOrderColumns)

Considerations:

  • Requires changing 8+ ETL notebooks
  • Testing across all environments
  • Full table recreation (backfill historical data)
  • Estimated effort: 2-3 weeks

Current Optimization Strategy: OPTIMAL

Our current approach is correct:

OPTIMIZE - Compact small files (weekly)
ZORDER - Data skipping for filtered queries (StoreNumber, ItemCode)
VACUUM - Reclaim storage (environment-aware retention)
No Partitioning - Not needed at current scale and patterns

Performance Results:

  • Query speedup: 5-10x with ZORDER (vs. no optimization)
  • Storage savings: 80-90% with VACUUM (vs. no cleanup)
  • Maintenance overhead: Minimal (automated weekly job)

Cost-Benefit Analysis:

  • ZORDER implementation: 2 days (✅ DONE)
  • Partitioning implementation: 2-3 weeks
  • Performance difference: < 10% improvement
  • Recommendation: Wait until tables reach 500+ GB

Strategy Description

Environment-Aware Retention Policy

Environment Detection: Uses ENVIRONMENT variable with values:

  • PRODUCTION or PRD → Production environment
  • STAGING or STG → Staging environment
  • TESTING or TST → Testing environment
  • DEVELOPMENT or DEV → Development environment (default)

Production (PRD)

  • Environment Variable: ENVIRONMENT=PRODUCTION or ENVIRONMENT=PRD
  • Retention: 336 hours (14 days)
  • Rationale:
    • Maximum safety for business-critical data
    • Extended time-travel for incident investigation
    • Compliance with data retention requirements
    • Sufficient rollback window for production issues
  • Vacuum Schedule: Weekly (Sunday 2:00 AM)
  • Cost Impact: Higher storage costs, but acceptable for production safety

Staging (STG)

  • Environment Variable: ENVIRONMENT=STAGING or ENVIRONMENT=STG
  • Retention: 168 hours (7 days)
  • Rationale:
    • Balanced approach for pre-production validation
    • Adequate rollback capability for testing scenarios
    • Moderate storage costs
    • Mimics production patterns without full cost
  • Vacuum Schedule: Weekly (Sunday 3:00 AM)
  • Cost Impact: Moderate storage costs, good balance

Testing (TST)

  • Environment Variable: ENVIRONMENT=TESTING or ENVIRONMENT=TST
  • Retention: 96 hours (4 days)
  • Rationale:
    • Moderate retention for QA and integration testing
    • Sufficient time window for test cycle validation
    • Enables defect reproduction and investigation
    • Lower cost than staging while maintaining test integrity
  • Vacuum Schedule: Weekly (Sunday 4:00 AM)
  • Cost Impact: Low-moderate storage costs, optimized for testing needs

Development (DEV)

  • Environment Variable: ENVIRONMENT=DEVELOPMENT or ENVIRONMENT=DEV
  • Retention: 48 hours (2 days)
  • Rationale:
    • Aggressive cleanup to minimize development costs
    • Short retention acceptable since dev data changes frequently
    • Rapid iteration with minimal historical requirements
    • Cost optimization priority
  • Vacuum Schedule: Daily or after major data loads
  • Cost Impact: Minimal storage costs, optimal for development

Relevance to Business Operations

Data Freshness

Our jobs run on schedules that create complete data refreshes:

  • Product data: Daily/Weekly full refresh
  • Store product data: Daily updates
  • Supporting tables: As-needed basis

With full overwrite pattern, each run creates 100% new files → Vacuum is essential

Storage Cost Management

Without vacuum:

  • Baseline: 100 GB per table
  • After 30 days without vacuum: ~3,000 GB (30x growth)
  • Annual impact: Significant cloud storage costs

With proper vacuum:

  • Sustained size: ~100-200 GB per table (1-2 versions)
  • Cost savings: 90%+ reduction in storage

Regulatory Compliance

  • Data retention policies: Must not retain data beyond approved periods
  • Audit requirements: Must maintain adequate history for investigations
  • Balance: Vacuum strategy enforces both requirements

Code Affected

Delta Tables Currently Written (Primary Targets)

1. Product Tables - solr-product-transform.py

# Lines 323-328
productFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_df")
productNounFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_noun_df")
productSynonymFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_synonym_df")
productUpcFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_upc_df")
productNutritionFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_nutritionn_df")
productIngredientsFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_ingredientss_df")

Impact: 6 tables with full overwrite → High file accumulation Vacuum Priority: CRITICAL

2. Store Product Table - solr-store-product-transform.py

# Line 168
storeProductFeed.write.format("delta").mode("overwrite").saveAsTable("solr_store_product_df")

Impact: 1 table with full overwrite → High file accumulation Vacuum Priority: CRITICAL

3. Feed Table - solr-store-product-prepare-dataframe.py

# Line 21
solrFeed.write.format("delta").mode("overwrite").saveAsTable("solr_feed_df")

# Line 30 - Change Data Feed enabled
spark.sql(f"Alter table delta.`{solrFeedUrl}` set tblproperties (delta.enableChangeDataFeed = true)")

Impact: 1 table with full overwrite + CDC enabled Vacuum Priority: CRITICAL Note: Change Data Feed may affect vacuum behavior - monitor carefully

Tables with Read Operations Only (No Vacuum Needed)

Delta Tables Read via DeltaTable.forPath()

deals_picked_for_you_data_refresh.py

Impact: Read-only operations don't create orphaned files Vacuum Priority: NOT NEEDED from these jobs

Summary of Tables Requiring Vacuum

Managed Delta Tables (Metastore)

Table NameLocationWrite PatternPriority
solr_product_dfsolr-product-transform.pyOverwriteCRITICAL
solr_product_noun_dfsolr-product-transform.pyOverwriteCRITICAL
solr_product_synonym_dfsolr-product-transform.pyOverwriteCRITICAL
solr_product_upc_dfsolr-product-transform.pyOverwriteCRITICAL
solr_product_nutritionn_dfsolr-product-transform.pyOverwriteCRITICAL
solr_product_ingredientss_dfsolr-product-transform.pyOverwriteCRITICAL
solr_store_product_dfsolr-store-product-transform.pyOverwriteCRITICAL
solr_feed_dfsolr-store-product-prepare-dataframe.pyOverwriteCRITICAL

Subtotal: 8 managed tables

External Delta Tables (Azure Storage)

Table NameContainerLocationPriority
DPFYdeals-picked-for-you/DPFYHIGH
CouponCategoriescoupons/CouponProductId/CouponCategoriesHIGH
CouponDetailscoupons/CouponProductId/CouponDetailsHIGH
CouponItemcodecoupons/CouponProductId/CouponItemcodeHIGH
StoreCouponIDcoupons/CouponProductId/StoreCouponIDHIGH
StoreAddressstores/storeDetails/AddressHIGH
StoreDepartmentsstores/storeDetails/DepartmentsHIGH
StoreHoursstores/storeDetails/HoursHIGH
StoreImagestores/storeDetails/ImageHIGH
StorePhonestores/storeDetails/PhoneHIGH
StoreSecondaryLocationstores/storeDetails/SecondaryLocationHIGH
StoreSecondaryLocationAddressstores/storeDetails/SecondaryLocationAddressHIGH
StoreSecondaryLocationHoursstores/storeDetails/SecondaryLocationHoursHIGH
WeeklyAdDetailssavings/WeeklyADISS/WeeklyAdDetailsHIGH
WeeklyAdCategoriessavings/WeeklyADISS/WeeklyAdCategoriesHIGH
StoreWeeklyAdIDsavings/WeeklyADISS/StoreWeeklyAdIDHIGH
StoreProductDatastore-product-data/HIGH

Subtotal: 17 external Delta tables

TOTAL: 25 Delta tables requiring regular vacuum maintenance

Note: External Delta tables may be written by external processes (Azure Data Factory, other services) and accumulate files independently of your Databricks jobs.

Implementation

Vacuum Maintenance Notebook

Location: src/etl/notebooks/main/delta-table-vacuum-maintenance.py

Features

Environment Auto-Detection: Automatically identifies PRD/STG/TST/DEV
Environment-Specific Retention: Applies appropriate policy per environment
Environment-Specific External Tables: Each environment only processes its available containers
Pre/Post Analysis: Tracks storage reclaimed
Optimization Integration: Runs OPTIMIZE before VACUUM
Database-Driven ZORDER: ZORDER columns managed centrally in DeltaVacuumConfiguration table
Detailed Optimization Metrics: Before/after file counts, size changes, reduction percentages
Error Handling: Continues on individual table failures, gracefully skips unavailable containers
Comprehensive Logging: Full audit trail via Application Insights
Summary Reporting: Detailed results with storage metrics and skipped tables warning

Key Configuration

# Automatic environment detection and retention configuration
workspace_url = spark.conf.get("spark.databricks.workspaceUrl", "").lower()
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get().lower()

if "prd" in workspace_url or "prod" in workspace_url or "prd" in notebook_path:
environment = "prd"
retention_hours = 336 # 14 days
elif "stg" in workspace_url or "stg" in notebook_path:
environment = "stg"
retention_hours = 168 # 7 days
elif "tst" in workspace_url or "test" in workspace_url or "tst" in notebook_path:
environment = "tst"
retention_hours = 96 # 4 days
else:
environment = "dev"
retention_hours = 48 # 2 days

# Environment-specific external Delta paths
if environment == "prd":
external_delta_paths = prd_external_delta_paths # Full set
elif environment == "stg":
external_delta_paths = stg_external_delta_paths # Staging subset
elif environment == "tst":
external_delta_paths = tst_external_delta_paths # Testing subset
else:
external_delta_paths = dev_external_delta_paths # Development subset

Optimization Logging and Metrics

The vacuum notebook captures detailed before/after metrics during the OPTIMIZE step to track effectiveness and validate ZORDER configurations.

Metrics Tracked Per Table:

  • File Count: Number of data files before and after optimization
  • Table Size: Total size in bytes before and after optimization
  • Files Reduced: Count and percentage of files removed/consolidated
  • Size Change: Net change in table size (typically small due to compression)
  • ZORDER Columns: Columns used for data clustering optimization

Example Log Output:

2024-01-06 14:23:15 | Processing MANAGED table: hive_metastore.s0pmobai_prd.solr_store_product_df
2024-01-06 14:23:15 | Configuration: RetentionHours=336, ZOrderColumns=StoreNumber, ItemCode
2024-01-06 14:23:16 | OPTIMIZE table: hive_metastore.s0pmobai_prd.solr_store_product_df
2024-01-06 14:23:18 | Applying ZORDER BY (StoreNumber, ItemCode)
2024-01-06 14:28:42 | Files: 1623 → 156 (1467 removed, 90.4% reduction)
2024-01-06 14:28:42 | Size: 48.67 GB → 48.52 GB (change: -0.15 GB)

2024-01-06 14:35:28 | Processing MANAGED table: hive_metastore.s0pmobai_prd.solr_product_df
2024-01-06 14:35:28 | Configuration: RetentionHours=336, ZOrderColumns=ItemCode, SearchCategory1
2024-01-06 14:35:29 | OPTIMIZE table: hive_metastore.s0pmobai_prd.solr_product_df
2024-01-06 14:35:30 | Applying ZORDER BY (ItemCode, SearchCategory1)
2024-01-06 14:37:15 | Files: 487 → 42 (445 removed, 91.4% reduction)
2024-01-06 14:37:15 | Size: 12.34 GB → 12.28 GB (change: -0.06 GB)

Summary of OPTIMIZE results:
Total files reduced: 1912 files
Total size before: 61.01 GB
Total size after: 60.80 GB (change: -0.21 GB)

Interpreting the Metrics:

MetricGood ResultsInvestigation Needed
Files Reduced %70-95% reduction on first run20-50% on subsequent runs< 10% reduction may indicate:- Table already optimized- Few small files written since last run
Size Change-0.5% to +2% change(Compression effects)> 5% increase may indicate:- Unexpected data duplication- ZORDER causing suboptimal clustering
File Count After50-200 files for tables < 100GB200-500 files for tables > 100GB> 1000 files after OPTIMIZE indicates:- OPTIMIZE not working correctly- Continuous writes during optimization

Summary Totals:
At the end of the optimization phase, the notebook logs aggregate metrics across all tables:

  • Total files reduced across all optimized tables
  • Total size before/after optimization across all tables
  • Net storage change (typically small negative change)

These totals help validate the overall effectiveness of the optimization strategy and identify trends over time.

ZORDER Configuration Management:
ZORDER columns are managed centrally in the DeltaVacuumConfiguration table. To modify ZORDER columns:

-- Update ZORDER columns for a specific table
UPDATE dbo.DeltaVacuumConfiguration
SET ZOrderColumns = 'ItemCode, StoreNumber, TransactionDate',
ModifiedDate = GETDATE()
WHERE TableName = 'solr_transaction_history';

-- Disable ZORDER for a table
UPDATE dbo.DeltaVacuumConfiguration
SET ZOrderColumns = NULL,
ModifiedDate = GETDATE()
WHERE TableName = 'solr_temporary_cache';

Changes take effect on the next scheduled vacuum run without requiring code deployment.

ZORDER Limitations:

Some external tables cannot use ZORDER and will use regular OPTIMIZE instead:

  1. Tables with Liquid Clustering: Delta Lake's newer Liquid Clustering feature is incompatible with ZORDER

    • All tables in the store container use Liquid Clustering
    • Example tables: StoreAddress, StoreDepartments, StoreHours, StoreImage, StorePhone
    • Error: DELTA_CLUSTERING_WITH_ZORDER_BY
    • These tables are automatically optimized using Liquid Clustering (no ZORDER needed)
  2. Tables with Schema Mismatches: ZORDER columns must exist in the table schema

    • Columns specified in configuration don't exist in actual data
    • Example errors: DELTA_ZORDERING_COLUMN_DOES_NOT_EXIST
    • Tables affected: Coupon tables, WeeklyAd tables, DPFY, StoreProductData
    • Set ZOrderColumns = NULL for these tables to use regular OPTIMIZE

The vacuum notebook automatically handles these errors gracefully - tables skip ZORDER but still benefit from regular OPTIMIZE (file compaction without clustering).

Workflow

Deployment Steps

1. Upload Notebook to Databricks

# Upload to workspace
/Workspace/S0PMOBAI/main/delta-table-vacuum-maintenance

2. Create Databricks Job

Job Configuration (Production):

{
"name": "s0pmobai-delta-vacuum-maintenance-prd",
"schedule": {
"quartz_cron_expression": "0 0 2 ? * SUN",
"timezone_id": "America/New_York",
"pause_status": "UNPAUSED"
},
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "vacuum_maintenance",
"notebook_task": {
"notebook_path": "/Workspace/S0PMOBAI/main/delta-table-vacuum-maintenance",
"source": "WORKSPACE"
},
"job_cluster_key": "maintenance_cluster",
"timeout_seconds": 3600,
"email_notifications": {
"on_failure": ["data-engineering@publix.com"]
},
"spark_env_vars": {
"ENVIRONMENT": "PRODUCTION"
}
}
],
"job_clusters": [
{
"job_cluster_key": "maintenance_cluster",
"new_cluster": {
"spark_version": "13.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2
}
}
]
}

Schedule by Environment:

  • Production: Weekly - Sunday 2:00 AM
  • Staging: Weekly - Sunday 3:00 AM
  • Testing: Weekly - Sunday 4:00 AM
  • Development: Daily - Midnight OR on-demand

3. Initial Execution

  • Run manually first time in each environment
  • Verify environment detection is correct
  • Review storage reclaimed
  • Confirm no impact on concurrent jobs

4. Monitoring Setup

  • Track execution time trends
  • Monitor storage metrics
  • Set up alerts for failures
  • Review Application Insights logs

Final Recommendations

Immediate Actions (Week 1)

  1. ✅ Deploy Vacuum Notebook

    • Upload to all environments (DEV/TST/STG/PRD)
    • Set ENVIRONMENT variable in Databricks job configuration
    • Verify environment detection logic (check logs for correct mapping)
    • Test in DEV first
  2. ✅ Configure Environment-Specific External Tables

    • Review which containers exist in each environment
    • Update prd_external_delta_paths, stg_external_delta_paths, etc.
    • Remove containers that don't exist to avoid errors
    • Test configuration in each environment
  3. ✅ Run Initial Vacuum

    • Execute manually in DEV to baseline
    • Review output and storage reclaimed
    • Check which tables were skipped (if any)
    • Document results
  4. ✅ Create Scheduled Jobs

    • Configure weekly schedule for PRD/STG/TST
    • Configure daily schedule for DEV
    • Set up failure notifications
  5. ✅ Document Process

    • Add runbook for vacuum operations
    • Train team on maintenance procedures
    • Establish escalation paths

Short-Term Actions (Month 1)

  1. 📊 Monitor Performance

    • Track storage costs weekly
    • Measure vacuum execution time
    • Review time-travel usage patterns
    • Adjust retention if needed
  2. 🔍 Validate Time-Travel Requirements

    • Confirm 14-day retention for PRD is adequate
    • Assess whether 7-day STG retention is sufficient
    • Verify 4-day TST retention meets QA test cycles
    • Consider compliance requirements
  3. 📈 Establish Metrics

    • Storage saved per vacuum run
    • Cost reduction achieved
    • Performance impact (if any)
    • Vacuum duration trends

Long-Term Strategy (Ongoing)

  1. 🔄 Regular Review Cycles

    • Monthly: Review storage trends and vacuum effectiveness
    • Quarterly: Assess retention policy appropriateness
    • Annually: Comprehensive cost/benefit analysis
  2. 🎯 Optimization Opportunities

    • Consider Z-ORDER for frequently queried columns
    • Evaluate Auto-Optimize settings
    • Assess whether any overwrite patterns can become merges
    • Review table partitioning strategies
  3. 📋 Policy Refinement

    • Adjust retention based on actual time-travel usage
    • Fine-tune schedules based on job patterns
    • Consider table-specific retention policies if needed

Best Practices to Maintain

✅ DO

  • Always OPTIMIZE before VACUUM - Maximizes benefit
  • Run during low-traffic periods - Minimizes impact
  • Monitor execution metrics - Catch issues early
  • Test in DEV first - Validate changes safely
  • Document retention decisions - Maintain audit trail
  • Review storage costs monthly - Track effectiveness

❌ DON'T

  • Don't vacuum during peak hours - May impact queries
  • Don't reduce retention below 48 hours without testing - Risk data loss
  • Don't skip OPTIMIZE - Misses performance benefits
  • Don't ignore failures - May indicate underlying issues
  • Don't disable safety checks without understanding risks - Protection exists for good reason
  • Don't change retention without stakeholder approval - May impact compliance

Success Criteria

MetricTargetStatus
Storage growth rate< 10% monthly🎯 To measure
Vacuum success rate> 95%🎯 To measure
Time-travel availability100% within retention🎯 To measure
Cost reduction>80% vs. no-vacuum🎯 To measure
Zero data loss incidents100%🎯 To maintain

Risk Mitigation

RiskMitigationOwner
Accidental data loss14-day retention in PRDData Engineering
Vacuum during active querySchedule during off-hoursDevOps
Environment mis-detectionManual verification + loggingData Engineering
Compliance violationDocumented retention policiesCompliance Team
Cost overrunMonthly storage monitoringFinance/Engineering

Emergency Procedures

If aggressive cleanup needed (storage crisis):

# Temporary measure only - use with caution
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
retention_hours = 24 # Override to 24 hours
# Execute vacuum
# Restore normal retention afterward

If time-travel needed beyond retention:

  • Check if data exists in backups
  • Review transaction logs
  • Consider restoring from snapshot

If vacuum job fails repeatedly:

  1. Check for long-running concurrent queries
  2. Verify cluster resources are adequate
  3. Review individual table errors in logs
  4. Consider vacuuming tables individually

Per-Table Retention Periods

Custom Retention Configuration

The vacuum notebook supports table-specific retention periods for tables with special requirements:

Example Use Cases:

Compliance Tables (Longer Retention):

managed_tables_to_vacuum = [
{"name": "audit_log_table", "retention_hours": 720}, # 30 days for compliance
{"name": "solr_product_df", "retention_hours": None} # Use environment default
]

High-Churn Tables (Shorter Retention):

prd_external_delta_paths = [
{"name": "TempData", "container": "temp", "path": "/temp", "retention_hours": 24}, # 1 day only
{"name": "CouponDetails", "container": "coupons", "path": "/...", "retention_hours": 720} # 30 days
]

Retention Hierarchy:

  1. Table-specific retention (if specified in config)
  2. Environment default retention (if not specified)

Configuration Format:

# Managed tables - use dict format to specify custom retention
managed_tables_to_vacuum = [
{"name": "table_name", "retention_hours": 720}, # Custom retention
{"name": "table_name", "retention_hours": None}, # Use environment default
]

# External tables - add retention_hours to existing config
external_delta_paths = [
{"name": "...", "container": "...", "path": "...", "retention_hours": 720}, # Custom
{"name": "...", "container": "...", "path": "..."} # Uses environment default
]

Common Retention Patterns:

Use CaseRetentionExample Tables
Compliance/Audit30-90 daysFinancial, coupon, transaction data
Standard Operations7-14 daysProduct catalogs, store info
High Churn/Temp1-2 daysSession data, temporary staging
Development Only< 24 hoursTest tables in DEV

Benefits:

Flexibility - Each table can have appropriate retention
Compliance - Meet regulatory requirements per table
Cost Optimization - Aggressive cleanup for non-critical tables
Safety - Extended retention for critical business data

Conclusion

This vacuum strategy provides a comprehensive, automated, and environment-aware approach to managing Delta Lake storage costs while maintaining data integrity and time-travel capabilities.

Key Outcomes:

  • 80-90% storage cost reduction through regular maintenance
  • Comprehensive coverage of 25 Delta tables (8 managed + 17 external)
  • Per-table retention control for compliance and cost optimization
  • Zero-touch operation via automated scheduling
  • Environment-appropriate retention balancing cost and safety
  • Full audit trail via Application Insights integration
  • Scalable framework adaptable to future table additions
  • Path-based vacuum for external Azure Storage tables
  • Graceful error handling with detailed reporting

Implementation is ready - proceed with deployment to DEV, followed by STG and PRD validation.

Document Owner: Data Engineering Team
Review Cycle: Monthly (first 3 months), then Quarterly
Next Review: January 16, 2026