Delta Lake Vacuum Strategy

S0PMOBAI Publix Mobile Chat API

Document Version: 1.0
Last Updated: December 16, 2025
Status: Approved for Implementation

What is VACUUM Strategy
Key Points
Strategy Description
Code Affected
Implementation
Final Recommendations

What is VACUUM Strategy

Definition

VACUUM is a Delta Lake maintenance operation that permanently removes data files that are no longer needed, reclaiming storage space while maintaining data integrity and time-travel capabilities.

How It Works

When Delta tables are updated:

New data files are written with current data
Old data files are marked for deletion (but not removed immediately)
Transaction logs track which files are current vs. obsolete
VACUUM operation removes files older than the retention period

Why It's Critical

Storage Cost Management: Prevents accumulation of obsolete data files
Performance: Reduces metadata overhead from tracking too many files
Compliance: Enforces data retention policies
Time-Travel Balance: Maintains historical versions for specified duration

The Problem Without VACUUM

Our ETL jobs use mode("overwrite") which creates a complete new set of files with each run:

productFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_df")

Daily job run = New file set created, old files orphaned
30 days without vacuum = 30x storage consumption
Result: Exponentially growing storage costs

How VACUUM Actually Deletes Files

Critical Understanding: Retention Period ≠ Automatic Deletion

The retention period is just a threshold parameter - it does NOT automatically delete anything.

The retention period tells VACUUM: "Don't delete files newer than X hours"

The Actual Deletion Process

VACUUM is the ONLY mechanism that physically deletes old files. Here's how it works:

When You Execute VACUUM:

DeltaTable.forName(spark, "my_table").vacuum(168)  # 168 hours = 7 days

Step 1: Scan Transaction Log

VACUUM reads the Delta table's _delta_log directory
Identifies which data files are currently referenced (active files)
Identifies which files are no longer referenced (orphaned files)

Step 2: Check File Age

For each orphaned file, VACUUM checks its creation timestamp
Compares the file age against the retention threshold (7 days in this example)

Step 3: Physical Deletion

Files that meet BOTH conditions are deleted:
- ✅ NOT referenced in transaction log (orphaned)
- ✅ AND older than retention period (> 7 days old)
Files are physically removed from Azure Blob Storage
Storage space is immediately reclaimed

Step 4: Preserve Recent History

Files newer than retention period are kept (enables time-travel)
Currently active files are never deleted (protects data integrity)

Your Vacuum Notebook IS the Deletion Process

The scheduled notebook (delta-table-vacuum-maintenance.py) IS the actual deletion mechanism:

# This code PHYSICALLY DELETES files from Azure Blob Storage
DeltaTable.forName(spark, table_name).vacuum(table_retention)

Without running this notebook, NO files are ever deleted, regardless of retention settings.

Scheduled Execution = Scheduled Deletion

Your Databricks job schedule determines WHEN files get deleted:

Environment	Schedule	When Files Are Deleted
PRD	Weekly (Sunday 2 AM)	Every Sunday at 2 AM
STG	Weekly (Sunday 3 AM)	Every Sunday at 3 AM
TST	Weekly (Sunday 4 AM)	Every Sunday at 4 AM
DEV	Daily (Midnight)	Every day at midnight

If the job fails or is disabled, files will NOT be deleted that period.

What Happens Without VACUUM

If you never run VACUUM:

❌ Retention period setting = meaningless (no process using it)
❌ Old files = accumulate forever
❌ Storage = grows exponentially
❌ Cost = keeps increasing
❌ Performance = degrades over time

Setting a retention period without running VACUUM accomplishes nothing.

Deletion Flow Diagram

┌─────────────────────────────────────────────────────────────┐
│ 1. Write Operation (overwrite/append/merge)                │
│    → Creates new data files in Azure Blob Storage          │
└─────────────────────────┬───────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ 2. Transaction Log Update                                   │
│    → Old files marked as "orphaned" (no longer referenced) │
└─────────────────────────┬───────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ 3. Time Passes                                              │
│    → Orphaned files age beyond retention period            │
└─────────────────────────┬───────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ 4. VACUUM RUNS (via scheduled notebook)                    │
│    → Scans transaction log for orphaned files              │
│    → Checks file ages against retention threshold          │
└─────────────────────────┬───────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ 5. Physical Deletion                                        │
│    → Eligible files removed from Azure Blob Storage        │
│    → Storage space immediately reclaimed                   │
└─────────────────────────┬───────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ 6. Cost Savings                                             │
│    → Reduced storage consumption                           │
│    → Lower Azure storage bills                             │
└─────────────────────────────────────────────────────────────┘

Verification After VACUUM

You can verify that VACUUM actually deleted files:

Check File Count:

# Before vacuum
details_before = spark.sql("DESCRIBE DETAIL table_name").collect()[0]
files_before = details_before.numFiles

# After vacuum (should be lower)
details_after = spark.sql("DESCRIBE DETAIL table_name").collect()[0]
files_after = details_after.numFiles

print(f"Files deleted: {files_before - files_after}")

Check Storage Directory:

storage_url = get_s0pmobai_storage_connection("container")
files = dbutils.fs.ls(f"{storage_url}/path")
print(f"Files in storage: {len(files)}")

Key Takeaways

🔑 Retention period = safety threshold, NOT automatic deletion
🔑 VACUUM command = actual deletion mechanism
🔑 Scheduled notebook = when deletion happens
🔑 No VACUUM runs = no files deleted = costs grow
🔑 VACUUM must run regularly to maintain storage costs

Your vacuum maintenance notebook is not just monitoring storage - it IS the deletion process.

Key Points

1. Write Pattern Determines Strategy

Write Pattern	File Accumulation	Priority	Vacuum Frequency
Overwrite	Very High	Critical	Weekly
Append	Moderate	Medium	Bi-weekly/Monthly
Merge/Upsert	Low	Low	Monthly
Read-Only	None	N/A	Not needed

Our tables use OVERWRITE → Highest priority for vacuum maintenance

2. Retention Period Trade-offs

Retention	Storage Cost	Time-Travel	Use Case	Risk Level
24 hours	Minimal	1 day	Development	High
48 hours (2 days)	Low	2 days	Development	Medium
96 hours (4 days)	Low-Moderate	4 days	Testing	Medium-Low
168 hours (7 days)	Moderate	1 week	Staging	Low
336 hours (14 days)	High	2 weeks	Production	Very Low
720 hours (30 days)	Very High	1 month	Compliance	Very Low

3. Time-Travel Implications

# Only works if files still exist (within retention period)
df = spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .table("solr_product_df")

df = spark.read.format("delta") \
    .option("timestampAsOf", "2025-12-10") \
    .table("solr_product_df")

After vacuum: Historical data beyond retention period is permanently deleted.

4. Safety Mechanisms

Default retention: 7 days minimum (enforced by Delta Lake)
Safety check: Prevents vacuum with retention < 168 hours unless explicitly disabled
Concurrent read protection: Files in active use are protected

5. Optimization Synergy

# Best practice: Optimize BEFORE vacuum
spark.sql("OPTIMIZE table_name")  # Compact small files
DeltaTable.forName(spark, "table_name").vacuum(168)  # Then clean up

Benefits: Better performance + more efficient vacuum

6. Optimization Techniques Comparison

OPTIMIZE

Purpose: Compacts small files into larger files
When Applied: After table creation, can run anytime
Frequency: Weekly (in vacuum job)
Code Changes: None - SQL command only
Our Usage: ✅ Applied to all 25 tables weekly

ZORDER

Purpose: Co-locates related data for data skipping
When Applied: During OPTIMIZE operation
Frequency: Weekly (in vacuum job)
Code Changes: None - managed in SQL configuration
Performance Benefit: 80-95% reduction in files scanned for filtered queries
Our Usage: ✅ Applied to all high-query tables (StoreNumber, ItemCode, etc.)

PARTITIONING

Purpose: Physically separates data into directories by column value
When Applied: ⚠️ Must be defined at table creation time
Code Changes: ⚠️ Requires ETL notebook modifications
Re-partitioning: ⚠️ Requires full table rewrite
Our Usage: ❌ Not currently used (see analysis below)

Table Partitioning Analysis

What is Partitioning?

Partitioning divides a Delta table into physical directories based on one or more column values:

# Example: Partitioning by date
df.write.format("delta") \
    .partitionBy("order_date") \
    .mode("append") \
    .saveAsTable("orders")

# Creates directory structure:
# /orders/order_date=2026-01-01/
# /orders/order_date=2026-01-02/
# /orders/order_date=2026-01-03/

Query Optimization:

# Only scans order_date=2026-01-05 directory
spark.sql("SELECT * FROM orders WHERE order_date = '2026-01-05'")

Key Differences: Partitioning vs. ZORDER

Aspect	Partitioning	ZORDER
Setup	Table creation time	Anytime after creation
Code Changes	Required (ETL notebooks)	None (SQL config only)
Change Cost	Full table rewrite	Just run OPTIMIZE
Best For	Time-series, low cardinality	High cardinality, multiple query patterns
Directory Structure	Separate folders per partition	Single directory, optimized layout
Cardinality	10-1000 partitions ideal	Unlimited distinct values
Write Pattern	APPEND/MERGE preferred	Works with any pattern

Why Partitioning is NOT Recommended for Our Tables

1. OVERWRITE Pattern Conflicts

Our ETL jobs use mode("overwrite"):

productFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_df")

Problem: OVERWRITE replaces entire table, negating partition benefits
Solution: Would need to switch to APPEND/MERGE (major code change)

2. Table Sizes Below Threshold

Table	Size	Partitioning Threshold	Recommendation
StoreProductData	48 GB	500+ GB recommended	❌ Too small
solr_product_df	~15 GB	500+ GB recommended	❌ Too small
Other tables	< 10 GB	500+ GB recommended	❌ Too small

Rule of Thumb: Partitioning becomes beneficial at 500 GB - 1 TB scale

3. High Cardinality Partition Keys

Potential partition columns have too many distinct values:

Column	Distinct Values	Ideal Range	Assessment
StoreNumber	1,500+ stores	10-1000	❌ Too granular
ItemCode	100,000+ products	10-1000	❌ Far too granular
SearchCategory1	50-100 categories	10-1000	⚠️ Borderline (but no date filter)

Problem: 1,500+ partitions = 1,500+ directories = metadata overhead

4. No Consistent Time-Based Queries

Ideal partitioning scenario:

# Time-series data with consistent date filters
SELECT * FROM logs 
WHERE log_date BETWEEN '2026-01-01' AND '2026-01-07'

Our queries:

# Product lookups by ID (not date-based)
SELECT * FROM solr_product_df WHERE ItemCode = 12345

# Store-product lookups (not date-based)
SELECT * FROM StoreProductData WHERE StoreNumber = 1234 AND ItemCode = 5678

Reality: Our mobile API queries filter by ID/Store/Category, not date

5. ZORDER Provides Similar Benefits Without Drawbacks

ZORDER Advantages:

✅ No code changes needed
✅ Works with OVERWRITE pattern
✅ Handles high cardinality (ItemCode, StoreNumber)
✅ Can change columns anytime (SQL config)
✅ 80-95% of partitioning benefits

Current Implementation:

-- StoreProductData optimized for store-product queries
OPTIMIZE delta.`{path}` ZORDER BY (StoreNumber, ItemCode)

-- Result: Similar query performance to partitioning
SELECT * FROM StoreProductData 
WHERE StoreNumber = 1234  -- Data skipping works here!

When to Reconsider Partitioning

Revisit partitioning IF:

Table Growth
- Tables exceed 500 GB
- Storage costs become significant despite vacuum
Write Pattern Change
- Switch from OVERWRITE to APPEND/MERGE
- Incremental updates become the norm
Query Pattern Evolution
- Queries consistently filter by date
- Add LastModifiedDate or ProcessDate column
- Time-based analysis becomes primary use case
Example Future Scenario:

# If we add date columns and switch to MERGE
storeProductFeed.write.format("delta") \
    .partitionBy("ProcessDate") \
    .mode("append") \
    .saveAsTable("StoreProductData")

# Then queries like this become efficient
SELECT * FROM StoreProductData 
WHERE ProcessDate = '2026-01-05' AND StoreNumber = 1234

Implementation Example (For Future Reference)

If partitioning is needed, here's how:

# 1. Add partition column to DataFrame
from pyspark.sql.functions import current_date

productFeed = productFeed.withColumn("LoadDate", current_date())

# 2. Define partitioning at write time
productFeed.write.format("delta") \
    .partitionBy("LoadDate") \
    .mode("append") \  # NOT overwrite!
    .saveAsTable("solr_product_df")

# 3. Update configuration in DeltaVacuumConfiguration
# Add PartitionColumns column (similar to ZOrderColumns)

Considerations:

Requires changing 8+ ETL notebooks
Testing across all environments
Full table recreation (backfill historical data)
Estimated effort: 2-3 weeks

Current Optimization Strategy: OPTIMAL

Our current approach is correct:

✅ OPTIMIZE - Compact small files (weekly)
✅ ZORDER - Data skipping for filtered queries (StoreNumber, ItemCode)
✅ VACUUM - Reclaim storage (environment-aware retention)
✅ No Partitioning - Not needed at current scale and patterns

Performance Results:

Query speedup: 5-10x with ZORDER (vs. no optimization)
Storage savings: 80-90% with VACUUM (vs. no cleanup)
Maintenance overhead: Minimal (automated weekly job)

Cost-Benefit Analysis:

ZORDER implementation: 2 days (✅ DONE)
Partitioning implementation: 2-3 weeks
Performance difference: < 10% improvement
Recommendation: Wait until tables reach 500+ GB

Strategy Description

Environment-Aware Retention Policy

Environment Detection: Uses ENVIRONMENT variable with values:

PRODUCTION or PRD → Production environment
STAGING or STG → Staging environment
TESTING or TST → Testing environment
DEVELOPMENT or DEV → Development environment (default)

Production (PRD)

Environment Variable: ENVIRONMENT=PRODUCTION or ENVIRONMENT=PRD
Retention: 336 hours (14 days)
Rationale:
- Maximum safety for business-critical data
- Extended time-travel for incident investigation
- Compliance with data retention requirements
- Sufficient rollback window for production issues
Vacuum Schedule: Weekly (Sunday 2:00 AM)
Cost Impact: Higher storage costs, but acceptable for production safety

Staging (STG)

Environment Variable: ENVIRONMENT=STAGING or ENVIRONMENT=STG
Retention: 168 hours (7 days)
Rationale:
- Balanced approach for pre-production validation
- Adequate rollback capability for testing scenarios
- Moderate storage costs
- Mimics production patterns without full cost
Vacuum Schedule: Weekly (Sunday 3:00 AM)
Cost Impact: Moderate storage costs, good balance

Testing (TST)

Environment Variable: ENVIRONMENT=TESTING or ENVIRONMENT=TST
Retention: 96 hours (4 days)
Rationale:
- Moderate retention for QA and integration testing
- Sufficient time window for test cycle validation
- Enables defect reproduction and investigation
- Lower cost than staging while maintaining test integrity
Vacuum Schedule: Weekly (Sunday 4:00 AM)
Cost Impact: Low-moderate storage costs, optimized for testing needs

Development (DEV)

Environment Variable: ENVIRONMENT=DEVELOPMENT or ENVIRONMENT=DEV
Retention: 48 hours (2 days)
Rationale:
- Aggressive cleanup to minimize development costs
- Short retention acceptable since dev data changes frequently
- Rapid iteration with minimal historical requirements
- Cost optimization priority
Vacuum Schedule: Daily or after major data loads
Cost Impact: Minimal storage costs, optimal for development

Relevance to Business Operations

Data Freshness

Our jobs run on schedules that create complete data refreshes:

Product data: Daily/Weekly full refresh
Store product data: Daily updates
Supporting tables: As-needed basis

With full overwrite pattern, each run creates 100% new files → Vacuum is essential

Storage Cost Management

Without vacuum:

Baseline: 100 GB per table
After 30 days without vacuum: ~3,000 GB (30x growth)
Annual impact: Significant cloud storage costs

With proper vacuum:

Sustained size: ~100-200 GB per table (1-2 versions)
Cost savings: 90%+ reduction in storage

Regulatory Compliance

Data retention policies: Must not retain data beyond approved periods
Audit requirements: Must maintain adequate history for investigations
Balance: Vacuum strategy enforces both requirements

Code Affected

Delta Tables Currently Written (Primary Targets)

1. Product Tables - solr-product-transform.py

# Lines 323-328
productFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_df")
productNounFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_noun_df")
productSynonymFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_synonym_df")
productUpcFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_upc_df")
productNutritionFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_nutritionn_df")
productIngredientsFeed.write.format("delta").mode("overwrite").saveAsTable("solr_product_ingredientss_df")

Impact: 6 tables with full overwrite → High file accumulation Vacuum Priority: CRITICAL

2. Store Product Table - solr-store-product-transform.py

# Line 168
storeProductFeed.write.format("delta").mode("overwrite").saveAsTable("solr_store_product_df")

Impact: 1 table with full overwrite → High file accumulation Vacuum Priority: CRITICAL

3. Feed Table - solr-store-product-prepare-dataframe.py

# Line 21
solrFeed.write.format("delta").mode("overwrite").saveAsTable("solr_feed_df")

# Line 30 - Change Data Feed enabled
spark.sql(f"Alter table delta.`{solrFeedUrl}` set tblproperties (delta.enableChangeDataFeed = true)")

Impact: 1 table with full overwrite + CDC enabled Vacuum Priority: CRITICAL Note: Change Data Feed may affect vacuum behavior - monitor carefully

Tables with Read Operations Only (No Vacuum Needed)

Delta Tables Read via DeltaTable.forPath()

deals_picked_for_you_data_refresh.py

DPFY tables - digital_coupon_data_refresh.py
Coupon tables
store_data_refresh.py - Store detail tables

Impact: Read-only operations don't create orphaned files Vacuum Priority: NOT NEEDED from these jobs

Summary of Tables Requiring Vacuum

Managed Delta Tables (Metastore)

Table Name	Location	Write Pattern	Priority
`solr_product_df`	solr-product-transform.py	Overwrite	CRITICAL
`solr_product_noun_df`	solr-product-transform.py	Overwrite	CRITICAL
`solr_product_synonym_df`	solr-product-transform.py	Overwrite	CRITICAL
`solr_product_upc_df`	solr-product-transform.py	Overwrite	CRITICAL
`solr_product_nutritionn_df`	solr-product-transform.py	Overwrite	CRITICAL
`solr_product_ingredientss_df`	solr-product-transform.py	Overwrite	CRITICAL
`solr_store_product_df`	solr-store-product-transform.py	Overwrite	CRITICAL
`solr_feed_df`	solr-store-product-prepare-dataframe.py	Overwrite	CRITICAL

Subtotal: 8 managed tables

External Delta Tables (Azure Storage)

Table Name	Container	Location	Priority
`DPFY`	deals-picked-for-you	/DPFY	HIGH
`CouponCategories`	coupons	/CouponProductId/CouponCategories	HIGH
`CouponDetails`	coupons	/CouponProductId/CouponDetails	HIGH
`CouponItemcode`	coupons	/CouponProductId/CouponItemcode	HIGH
`StoreCouponID`	coupons	/CouponProductId/StoreCouponID	HIGH
`StoreAddress`	stores	/storeDetails/Address	HIGH
`StoreDepartments`	stores	/storeDetails/Departments	HIGH
`StoreHours`	stores	/storeDetails/Hours	HIGH
`StoreImage`	stores	/storeDetails/Image	HIGH
`StorePhone`	stores	/storeDetails/Phone	HIGH
`StoreSecondaryLocation`	stores	/storeDetails/SecondaryLocation	HIGH
`StoreSecondaryLocationAddress`	stores	/storeDetails/SecondaryLocationAddress	HIGH
`StoreSecondaryLocationHours`	stores	/storeDetails/SecondaryLocationHours	HIGH
`WeeklyAdDetails`	savings	/WeeklyADISS/WeeklyAdDetails	HIGH
`WeeklyAdCategories`	savings	/WeeklyADISS/WeeklyAdCategories	HIGH
`StoreWeeklyAdID`	savings	/WeeklyADISS/StoreWeeklyAdID	HIGH
`StoreProductData`	store-product-data	/	HIGH

Subtotal: 17 external Delta tables

TOTAL: 25 Delta tables requiring regular vacuum maintenance

Note: External Delta tables may be written by external processes (Azure Data Factory, other services) and accumulate files independently of your Databricks jobs.

Implementation

Vacuum Maintenance Notebook

Location: src/etl/notebooks/main/delta-table-vacuum-maintenance.py

Features

✅ Environment Auto-Detection: Automatically identifies PRD/STG/TST/DEV
✅ Environment-Specific Retention: Applies appropriate policy per environment
✅ Environment-Specific External Tables: Each environment only processes its available containers
✅ Pre/Post Analysis: Tracks storage reclaimed
✅ Optimization Integration: Runs OPTIMIZE before VACUUM
✅ Database-Driven ZORDER: ZORDER columns managed centrally in DeltaVacuumConfiguration table
✅ Detailed Optimization Metrics: Before/after file counts, size changes, reduction percentages
✅ Error Handling: Continues on individual table failures, gracefully skips unavailable containers
✅ Comprehensive Logging: Full audit trail via Application Insights
✅ Summary Reporting: Detailed results with storage metrics and skipped tables warning

Key Configuration

# Automatic environment detection and retention configuration
workspace_url = spark.conf.get("spark.databricks.workspaceUrl", "").lower()
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get().lower()

if "prd" in workspace_url or "prod" in workspace_url or "prd" in notebook_path:
    environment = "prd"
    retention_hours = 336  # 14 days
elif "stg" in workspace_url or "stg" in notebook_path:
    environment = "stg"
    retention_hours = 168  # 7 days
elif "tst" in workspace_url or "test" in workspace_url or "tst" in notebook_path:
    environment = "tst"
    retention_hours = 96   # 4 days
else:
    environment = "dev"
    retention_hours = 48   # 2 days

# Environment-specific external Delta paths
if environment == "prd":
    external_delta_paths = prd_external_delta_paths  # Full set
elif environment == "stg":
    external_delta_paths = stg_external_delta_paths  # Staging subset
elif environment == "tst":
    external_delta_paths = tst_external_delta_paths  # Testing subset
else:
    external_delta_paths = dev_external_delta_paths  # Development subset

Optimization Logging and Metrics

The vacuum notebook captures detailed before/after metrics during the OPTIMIZE step to track effectiveness and validate ZORDER configurations.

Metrics Tracked Per Table:

File Count: Number of data files before and after optimization
Table Size: Total size in bytes before and after optimization
Files Reduced: Count and percentage of files removed/consolidated
Size Change: Net change in table size (typically small due to compression)
ZORDER Columns: Columns used for data clustering optimization

Example Log Output:

2024-01-06 14:23:15 | Processing MANAGED table: hive_metastore.s0pmobai_prd.solr_store_product_df
2024-01-06 14:23:15 | Configuration: RetentionHours=336, ZOrderColumns=StoreNumber, ItemCode
2024-01-06 14:23:16 | OPTIMIZE table: hive_metastore.s0pmobai_prd.solr_store_product_df
2024-01-06 14:23:18 |   Applying ZORDER BY (StoreNumber, ItemCode)
2024-01-06 14:28:42 |   Files: 1623 → 156 (1467 removed, 90.4% reduction)
2024-01-06 14:28:42 |   Size: 48.67 GB → 48.52 GB (change: -0.15 GB)

2024-01-06 14:35:28 | Processing MANAGED table: hive_metastore.s0pmobai_prd.solr_product_df
2024-01-06 14:35:28 | Configuration: RetentionHours=336, ZOrderColumns=ItemCode, SearchCategory1
2024-01-06 14:35:29 | OPTIMIZE table: hive_metastore.s0pmobai_prd.solr_product_df
2024-01-06 14:35:30 |   Applying ZORDER BY (ItemCode, SearchCategory1)
2024-01-06 14:37:15 |   Files: 487 → 42 (445 removed, 91.4% reduction)
2024-01-06 14:37:15 |   Size: 12.34 GB → 12.28 GB (change: -0.06 GB)

Summary of OPTIMIZE results:
  Total files reduced: 1912 files
  Total size before: 61.01 GB
  Total size after: 60.80 GB (change: -0.21 GB)

Interpreting the Metrics:

Metric	Good Results	Investigation Needed
Files Reduced %	70-95% reduction on first run20-50% on subsequent runs	< 10% reduction may indicate:- Table already optimized- Few small files written since last run
Size Change	-0.5% to +2% change(Compression effects)	> 5% increase may indicate:- Unexpected data duplication- ZORDER causing suboptimal clustering
File Count After	50-200 files for tables < 100GB200-500 files for tables > 100GB	> 1000 files after OPTIMIZE indicates:- OPTIMIZE not working correctly- Continuous writes during optimization

Summary Totals:
At the end of the optimization phase, the notebook logs aggregate metrics across all tables:

Total files reduced across all optimized tables
Total size before/after optimization across all tables
Net storage change (typically small negative change)

These totals help validate the overall effectiveness of the optimization strategy and identify trends over time.

ZORDER Configuration Management:
ZORDER columns are managed centrally in the DeltaVacuumConfiguration table. To modify ZORDER columns:

-- Update ZORDER columns for a specific table
UPDATE dbo.DeltaVacuumConfiguration
SET ZOrderColumns = 'ItemCode, StoreNumber, TransactionDate',
    ModifiedDate = GETDATE()
WHERE TableName = 'solr_transaction_history';

-- Disable ZORDER for a table
UPDATE dbo.DeltaVacuumConfiguration
SET ZOrderColumns = NULL,
    ModifiedDate = GETDATE()
WHERE TableName = 'solr_temporary_cache';

Changes take effect on the next scheduled vacuum run without requiring code deployment.

ZORDER Limitations:

Some external tables cannot use ZORDER and will use regular OPTIMIZE instead:

Tables with Liquid Clustering: Delta Lake's newer Liquid Clustering feature is incompatible with ZORDER
- All tables in the store container use Liquid Clustering
- Example tables: StoreAddress, StoreDepartments, StoreHours, StoreImage, StorePhone
- Error: DELTA_CLUSTERING_WITH_ZORDER_BY
- These tables are automatically optimized using Liquid Clustering (no ZORDER needed)
Tables with Schema Mismatches: ZORDER columns must exist in the table schema
- Columns specified in configuration don't exist in actual data
- Example errors: DELTA_ZORDERING_COLUMN_DOES_NOT_EXIST
- Tables affected: Coupon tables, WeeklyAd tables, DPFY, StoreProductData
- Set ZOrderColumns = NULL for these tables to use regular OPTIMIZE

The vacuum notebook automatically handles these errors gracefully - tables skip ZORDER but still benefit from regular OPTIMIZE (file compaction without clustering).

Workflow

Deployment Steps

1. Upload Notebook to Databricks

# Upload to workspace
/Workspace/S0PMOBAI/main/delta-table-vacuum-maintenance

2. Create Databricks Job

Job Configuration (Production):

{
  "name": "s0pmobai-delta-vacuum-maintenance-prd",
  "schedule": {
    "quartz_cron_expression": "0 0 2 ? * SUN",
    "timezone_id": "America/New_York",
    "pause_status": "UNPAUSED"
  },
  "max_concurrent_runs": 1,
  "tasks": [
    {
      "task_key": "vacuum_maintenance",
      "notebook_task": {
        "notebook_path": "/Workspace/S0PMOBAI/main/delta-table-vacuum-maintenance",
        "source": "WORKSPACE"
      },
      "job_cluster_key": "maintenance_cluster",
      "timeout_seconds": 3600,
      "email_notifications": {
        "on_failure": ["data-engineering@publix.com"]
      },
      "spark_env_vars": {
        "ENVIRONMENT": "PRODUCTION"
      }
    }
  ],
  "job_clusters": [
    {
      "job_cluster_key": "maintenance_cluster",
      "new_cluster": {
        "spark_version": "13.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 2
      }
    }
  ]
}

Schedule by Environment:

Production: Weekly - Sunday 2:00 AM
Staging: Weekly - Sunday 3:00 AM
Testing: Weekly - Sunday 4:00 AM
Development: Daily - Midnight OR on-demand

3. Initial Execution

Run manually first time in each environment
Verify environment detection is correct
Review storage reclaimed
Confirm no impact on concurrent jobs

4. Monitoring Setup

Track execution time trends
Monitor storage metrics
Set up alerts for failures
Review Application Insights logs

Final Recommendations

Immediate Actions (Week 1)

✅ Deploy Vacuum Notebook
- Upload to all environments (DEV/TST/STG/PRD)
- Set ENVIRONMENT variable in Databricks job configuration
- Verify environment detection logic (check logs for correct mapping)
- Test in DEV first
✅ Configure Environment-Specific External Tables
- Review which containers exist in each environment
- Update prd_external_delta_paths, stg_external_delta_paths, etc.
- Remove containers that don't exist to avoid errors
- Test configuration in each environment
✅ Run Initial Vacuum
- Execute manually in DEV to baseline
- Review output and storage reclaimed
- Check which tables were skipped (if any)
- Document results
✅ Create Scheduled Jobs
- Configure weekly schedule for PRD/STG/TST
- Configure daily schedule for DEV
- Set up failure notifications
✅ Document Process
- Add runbook for vacuum operations
- Train team on maintenance procedures
- Establish escalation paths

Short-Term Actions (Month 1)

📊 Monitor Performance
- Track storage costs weekly
- Measure vacuum execution time
- Review time-travel usage patterns
- Adjust retention if needed
🔍 Validate Time-Travel Requirements
- Confirm 14-day retention for PRD is adequate
- Assess whether 7-day STG retention is sufficient
- Verify 4-day TST retention meets QA test cycles
- Consider compliance requirements
📈 Establish Metrics
- Storage saved per vacuum run
- Cost reduction achieved
- Performance impact (if any)
- Vacuum duration trends

Long-Term Strategy (Ongoing)

🔄 Regular Review Cycles
- Monthly: Review storage trends and vacuum effectiveness
- Quarterly: Assess retention policy appropriateness
- Annually: Comprehensive cost/benefit analysis
🎯 Optimization Opportunities
- Consider Z-ORDER for frequently queried columns
- Evaluate Auto-Optimize settings
- Assess whether any overwrite patterns can become merges
- Review table partitioning strategies
📋 Policy Refinement
- Adjust retention based on actual time-travel usage
- Fine-tune schedules based on job patterns
- Consider table-specific retention policies if needed

Best Practices to Maintain

✅ DO

Always OPTIMIZE before VACUUM - Maximizes benefit
Run during low-traffic periods - Minimizes impact
Monitor execution metrics - Catch issues early
Test in DEV first - Validate changes safely
Document retention decisions - Maintain audit trail
Review storage costs monthly - Track effectiveness

❌ DON'T

Don't vacuum during peak hours - May impact queries
Don't reduce retention below 48 hours without testing - Risk data loss
Don't skip OPTIMIZE - Misses performance benefits
Don't ignore failures - May indicate underlying issues
Don't disable safety checks without understanding risks - Protection exists for good reason
Don't change retention without stakeholder approval - May impact compliance

Success Criteria

Metric	Target	Status
Storage growth rate	< 10% monthly	🎯 To measure
Vacuum success rate	> 95%	🎯 To measure
Time-travel availability	100% within retention	🎯 To measure
Cost reduction	>80% vs. no-vacuum	🎯 To measure
Zero data loss incidents	100%	🎯 To maintain

Risk Mitigation

Risk	Mitigation	Owner
Accidental data loss	14-day retention in PRD	Data Engineering
Vacuum during active query	Schedule during off-hours	DevOps
Environment mis-detection	Manual verification + logging	Data Engineering
Compliance violation	Documented retention policies	Compliance Team
Cost overrun	Monthly storage monitoring	Finance/Engineering

Emergency Procedures

If aggressive cleanup needed (storage crisis):

# Temporary measure only - use with caution
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
retention_hours = 24  # Override to 24 hours
# Execute vacuum
# Restore normal retention afterward

If time-travel needed beyond retention:

Check if data exists in backups
Review transaction logs
Consider restoring from snapshot

If vacuum job fails repeatedly:

Check for long-running concurrent queries
Verify cluster resources are adequate
Review individual table errors in logs
Consider vacuuming tables individually

Per-Table Retention Periods

Custom Retention Configuration

The vacuum notebook supports table-specific retention periods for tables with special requirements:

Example Use Cases:

Compliance Tables (Longer Retention):

managed_tables_to_vacuum = [
    {"name": "audit_log_table", "retention_hours": 720},  # 30 days for compliance
    {"name": "solr_product_df", "retention_hours": None}  # Use environment default
]

High-Churn Tables (Shorter Retention):

prd_external_delta_paths = [
    {"name": "TempData", "container": "temp", "path": "/temp", "retention_hours": 24},  # 1 day only
    {"name": "CouponDetails", "container": "coupons", "path": "/...", "retention_hours": 720}  # 30 days
]

Retention Hierarchy:

Table-specific retention (if specified in config)
Environment default retention (if not specified)

Configuration Format:

# Managed tables - use dict format to specify custom retention
managed_tables_to_vacuum = [
    {"name": "table_name", "retention_hours": 720},  # Custom retention
    {"name": "table_name", "retention_hours": None}, # Use environment default
]

# External tables - add retention_hours to existing config
external_delta_paths = [
    {"name": "...", "container": "...", "path": "...", "retention_hours": 720},  # Custom
    {"name": "...", "container": "...", "path": "..."}  # Uses environment default
]

Common Retention Patterns:

Use Case	Retention	Example Tables
Compliance/Audit	30-90 days	Financial, coupon, transaction data
Standard Operations	7-14 days	Product catalogs, store info
High Churn/Temp	1-2 days	Session data, temporary staging
Development Only	< 24 hours	Test tables in DEV

Benefits:

✅ Flexibility - Each table can have appropriate retention
✅ Compliance - Meet regulatory requirements per table
✅ Cost Optimization - Aggressive cleanup for non-critical tables
✅ Safety - Extended retention for critical business data

Conclusion

This vacuum strategy provides a comprehensive, automated, and environment-aware approach to managing Delta Lake storage costs while maintaining data integrity and time-travel capabilities.

Key Outcomes:

80-90% storage cost reduction through regular maintenance
Comprehensive coverage of 25 Delta tables (8 managed + 17 external)
Per-table retention control for compliance and cost optimization
Zero-touch operation via automated scheduling
Environment-appropriate retention balancing cost and safety
Full audit trail via Application Insights integration
Scalable framework adaptable to future table additions
Path-based vacuum for external Azure Storage tables
Graceful error handling with detailed reporting

Implementation is ready - proceed with deployment to DEV, followed by STG and PRD validation.

Document Owner: Data Engineering Team
Review Cycle: Monthly (first 3 months), then Quarterly
Next Review: January 16, 2026

S0PMOBAI Publix Mobile Chat API​

Table of Contents​

What is VACUUM Strategy​

Definition​

How It Works​

Why It's Critical​

The Problem Without VACUUM​

How VACUUM Actually Deletes Files​

Critical Understanding: Retention Period ≠ Automatic Deletion​

The Actual Deletion Process​

When You Execute VACUUM:​

Your Vacuum Notebook IS the Deletion Process​

Scheduled Execution = Scheduled Deletion​

What Happens Without VACUUM​

Deletion Flow Diagram​

Verification After VACUUM​

Key Takeaways​

Key Points​

1. Write Pattern Determines Strategy​

2. Retention Period Trade-offs​

3. Time-Travel Implications​

4. Safety Mechanisms​

5. Optimization Synergy​

6. Optimization Techniques Comparison​

OPTIMIZE​

ZORDER​

PARTITIONING​

Table Partitioning Analysis​

What is Partitioning?​

Key Differences: Partitioning vs. ZORDER​

Why Partitioning is NOT Recommended for Our Tables​

1. OVERWRITE Pattern Conflicts​

2. Table Sizes Below Threshold​

3. High Cardinality Partition Keys​

4. No Consistent Time-Based Queries​

5. ZORDER Provides Similar Benefits Without Drawbacks​

When to Reconsider Partitioning​

Implementation Example (For Future Reference)​

Current Optimization Strategy: OPTIMAL​

Strategy Description​

Environment-Aware Retention Policy​

Production (PRD)​

Staging (STG)​

Testing (TST)​

Development (DEV)​

Relevance to Business Operations​

Data Freshness​

Storage Cost Management​

Regulatory Compliance​

Code Affected​

Delta Tables Currently Written (Primary Targets)​

1. Product Tables - solr-product-transform.py​

2. Store Product Table - solr-store-product-transform.py​

3. Feed Table - solr-store-product-prepare-dataframe.py​

Tables with Read Operations Only (No Vacuum Needed)​

Delta Tables Read via DeltaTable.forPath()​

Summary of Tables Requiring Vacuum​

Managed Delta Tables (Metastore)​

External Delta Tables (Azure Storage)​

Implementation​

Vacuum Maintenance Notebook​

Features​

Key Configuration​

Optimization Logging and Metrics​

Workflow​

Deployment Steps​

1. Upload Notebook to Databricks​

2. Create Databricks Job​

3. Initial Execution​

4. Monitoring Setup​

Final Recommendations​

Immediate Actions (Week 1)​

Short-Term Actions (Month 1)​

Long-Term Strategy (Ongoing)​

Best Practices to Maintain​

✅ DO​

❌ DON'T​

Success Criteria​

Risk Mitigation​

Emergency Procedures​