Skip to main content

Enriching Data with Databricks

Learn how to use Databricks notebooks to transform and enrich your Lakehouse data.

Overview

Data enrichment allows you to transform raw data into valuable insights using Databricks notebooks. In this guide, you'll learn how to:

  • Use the enrichment notebook template
  • Pass parameters to notebooks
  • Return metrics from notebook execution
  • Handle errors and logging

Prerequisites

Step-by-Step Guide

1. Understanding the Enrichment Template

Insight Factory provides a notebook template that standardises enrichment patterns. The template includes:

  • Parameter handling
  • Connection to the Lakehouse
  • Metric reporting
  • Error handling

2. Create an enrichment task

  1. Open your production line and navigate to the Graph view
  2. Add a new task using one of these methods:
    • Click the + button in the graph side menu
    • Right-click on an existing node and select Add Task from the context menu
  3. Enter a unique Code and Name for your task
  4. Select "Run Databricks Notebook" from the Activity dropdown
  5. Configure the task properties:
    • Select your Databricks connection
    • Choose the notebook path
    • Configure parameters

3. Passing parameters

Parameters allow you to make notebooks reusable. Common parameters include:

# Access parameters in your notebook
source_schema = dbutils.widgets.get("source_schema")
source_table = dbutils.widgets.get("source_table")
target_schema = dbutils.widgets.get("target_schema")
target_table = dbutils.widgets.get("target_table")

4. Writing transformation logic

Your notebook can include any transformation logic:

# Read source data
df = spark.table(f"{source_schema}.{source_table}")

# Apply transformations
df_enriched = df.withColumn("processed_date", current_date())

# Write to target
df_enriched.write.format("delta").mode("overwrite").saveAsTable(f"{target_schema}.{target_table}")

5. Returning metrics

Report metrics back to Insight Factory for monitoring:

# Report metrics
dbutils.notebook.exit({
"rows_processed": df_enriched.count(),
"status": "success"
})

6. Run and monitor

  1. Save your task configuration
  2. Run the task
  3. Monitor execution in the task details
  4. Review returned metrics

Key Concepts

TermDefinition
EnrichmentThe process of transforming and adding value to raw data
NotebookA Databricks document containing code and documentation
ParametersValues passed to notebooks at runtime
MetricsMeasurements returned from notebook execution