Enriching Data with Databricks
Learn how to use Databricks notebooks to transform and enrich your Lakehouse data.
Overview
Data enrichment allows you to transform raw data into valuable insights using Databricks notebooks. In this guide, you'll learn how to:
- Use the enrichment notebook template
- Pass parameters to notebooks
- Return metrics from notebook execution
- Handle errors and logging
Prerequisites
- Data already ingested into the Lakehouse (see Ingesting Data from a Database)
- Access to a Databricks workspace
- Basic understanding of Python or SQL
Step-by-Step Guide
1. Understanding the Enrichment Template
Insight Factory provides a notebook template that standardises enrichment patterns. The template includes:
- Parameter handling
- Connection to the Lakehouse
- Metric reporting
- Error handling
2. Create an enrichment task
- Open your production line and navigate to the Graph view
- Add a new task using one of these methods:
- Click the + button in the graph side menu
- Right-click on an existing node and select Add Task from the context menu
- Enter a unique Code and Name for your task
- Select "Run Databricks Notebook" from the Activity dropdown
- Configure the task properties:
- Select your Databricks connection
- Choose the notebook path
- Configure parameters
3. Passing parameters
Parameters allow you to make notebooks reusable. Common parameters include:
# Access parameters in your notebook
source_schema = dbutils.widgets.get("source_schema")
source_table = dbutils.widgets.get("source_table")
target_schema = dbutils.widgets.get("target_schema")
target_table = dbutils.widgets.get("target_table")
4. Writing transformation logic
Your notebook can include any transformation logic:
# Read source data
df = spark.table(f"{source_schema}.{source_table}")
# Apply transformations
df_enriched = df.withColumn("processed_date", current_date())
# Write to target
df_enriched.write.format("delta").mode("overwrite").saveAsTable(f"{target_schema}.{target_table}")
5. Returning metrics
Report metrics back to Insight Factory for monitoring:
# Report metrics
dbutils.notebook.exit({
"rows_processed": df_enriched.count(),
"status": "success"
})
6. Run and monitor
- Save your task configuration
- Run the task
- Monitor execution in the task details
- Review returned metrics
Key Concepts
| Term | Definition |
|---|---|
| Enrichment | The process of transforming and adding value to raw data |
| Notebook | A Databricks document containing code and documentation |
| Parameters | Values passed to notebooks at runtime |
| Metrics | Measurements returned from notebook execution |