Incremental Loading Patterns
Learn how to implement incremental loading to efficiently process only new or changed data.
Overview
Incremental loading processes only new or modified data since the last run, significantly improving performance and reducing resource usage. In this guide, you'll learn about:
- Extract control queries and high-water marks
- Different delta update types
- Choosing the right incremental strategy
- Handling edge cases
Prerequisites
- Understanding of data ingestion (see Ingesting Data from a Database)
- Knowledge of your source data's change tracking mechanism
- Familiarity with task configuration
Understanding Incremental Loading
Why use incremental loading?
| Full Load | Incremental Load |
|---|---|
| Extracts all data every time | Extracts only new/changed data |
| Simple but slow for large tables | Faster and more efficient |
| High resource usage | Lower resource usage |
| Can miss deletes | Requires change tracking |
Prerequisites for incremental loading
Your source data must have a way to identify changes:
- Timestamp column: LastModifiedDate, UpdatedAt
- Sequence/ID column: Auto-incrementing ID
- Change tracking: Native CDC or change tracking
Step-by-Step Guide
1. Configure extract control
Extract control queries allow Insight Factory to track what data has been processed:
- Open your ingestion task
- Navigate to Extract Control settings
- Configure:
- Control column (e.g.,
LastModifiedDate) - Control variable name
- Initial value for first run
- Control column (e.g.,
2. Understanding high-water marks
The high-water mark is the maximum value processed in the last run:
-- Example: Extract control query
SELECT * FROM source_table
WHERE LastModifiedDate > @LastLoadDate
After each run, Insight Factory stores the new high-water mark for the next execution.
3. Delta update types
Configure how data is written to the destination:
| Update Type | Behaviour | Use Case |
|---|---|---|
| Append | Add new rows only | Log/event data |
| Merge | Insert new, update existing | Dimension tables |
| Replace | Replace matching partitions | Partitioned fact tables |
| Dimension (SCD2) | Track historical changes | Slowly changing dimensions |
4. Configure merge settings
For merge operations, configure:
- Business key columns: Columns that identify unique records
- Compare columns: Columns to check for changes (optional)
- Update behaviour: What happens when records match
5. Handle initial load
The first run needs special handling:
- Set an appropriate initial high-water mark value
- Consider running initial load during off-peak hours
- Monitor the first execution for performance
Advanced patterns
Handling deletes
Soft deletes (recommended):
- Source has an
IsDeletedflag - Include deleted records in extract
- Mark as deleted in destination
Hard deletes (complex):
- Requires full comparison periodically
- Or use CDC/change tracking at source
Partitioned incremental loading
For very large tables:
- Partition by date or category
- Only process recently modified partitions
- Combine with high-water mark for efficiency
Multiple control columns
Some scenarios need multiple controls:
-- Example: Multiple control columns
WHERE ModifiedDate > @LastModifiedDate
OR (ModifiedDate = @LastModifiedDate AND ID > @LastID)
Troubleshooting
| Issue | Possible Cause | Solution |
|---|---|---|
| Missing records | Timestamp precision | Use higher precision timestamp |
| Duplicate records | Non-unique high-water mark | Add secondary sort column |
| Slow performance | Too much history | Adjust initial high-water mark |
| Data gaps | Source updates after extract | Add buffer time to query |
Key Concepts
| Term | Definition |
|---|---|
| High-Water Mark | The maximum value of the control column from the last run |
| Extract Control | Configuration for tracking incremental progress |
| Delta Update | Processing only changed data instead of all data |
| Business Key | Column(s) that uniquely identify a record for merge operations |