Skip to main content

Incremental Loading Patterns

Learn how to implement incremental loading to efficiently process only new or changed data.

Overview

Incremental loading processes only new or modified data since the last run, significantly improving performance and reducing resource usage. In this guide, you'll learn about:

  • Extract control queries and high-water marks
  • Different delta update types
  • Choosing the right incremental strategy
  • Handling edge cases

Prerequisites

  • Understanding of data ingestion (see Ingesting Data from a Database)
  • Knowledge of your source data's change tracking mechanism
  • Familiarity with task configuration

Understanding Incremental Loading

Why use incremental loading?

Full LoadIncremental Load
Extracts all data every timeExtracts only new/changed data
Simple but slow for large tablesFaster and more efficient
High resource usageLower resource usage
Can miss deletesRequires change tracking

Prerequisites for incremental loading

Your source data must have a way to identify changes:

  • Timestamp column: LastModifiedDate, UpdatedAt
  • Sequence/ID column: Auto-incrementing ID
  • Change tracking: Native CDC or change tracking

Step-by-Step Guide

1. Configure extract control

Extract control queries allow Insight Factory to track what data has been processed:

  1. Open your ingestion task
  2. Navigate to Extract Control settings
  3. Configure:
    • Control column (e.g., LastModifiedDate)
    • Control variable name
    • Initial value for first run

2. Understanding high-water marks

The high-water mark is the maximum value processed in the last run:

-- Example: Extract control query
SELECT * FROM source_table
WHERE LastModifiedDate > @LastLoadDate

After each run, Insight Factory stores the new high-water mark for the next execution.

3. Delta update types

Configure how data is written to the destination:

Update TypeBehaviourUse Case
AppendAdd new rows onlyLog/event data
MergeInsert new, update existingDimension tables
ReplaceReplace matching partitionsPartitioned fact tables
Dimension (SCD2)Track historical changesSlowly changing dimensions

4. Configure merge settings

For merge operations, configure:

  • Business key columns: Columns that identify unique records
  • Compare columns: Columns to check for changes (optional)
  • Update behaviour: What happens when records match

5. Handle initial load

The first run needs special handling:

  1. Set an appropriate initial high-water mark value
  2. Consider running initial load during off-peak hours
  3. Monitor the first execution for performance

Advanced patterns

Handling deletes

Soft deletes (recommended):

  • Source has an IsDeleted flag
  • Include deleted records in extract
  • Mark as deleted in destination

Hard deletes (complex):

  • Requires full comparison periodically
  • Or use CDC/change tracking at source

Partitioned incremental loading

For very large tables:

  1. Partition by date or category
  2. Only process recently modified partitions
  3. Combine with high-water mark for efficiency

Multiple control columns

Some scenarios need multiple controls:

-- Example: Multiple control columns
WHERE ModifiedDate > @LastModifiedDate
OR (ModifiedDate = @LastModifiedDate AND ID > @LastID)

Troubleshooting

IssuePossible CauseSolution
Missing recordsTimestamp precisionUse higher precision timestamp
Duplicate recordsNon-unique high-water markAdd secondary sort column
Slow performanceToo much historyAdjust initial high-water mark
Data gapsSource updates after extractAdd buffer time to query

Key Concepts

TermDefinition
High-Water MarkThe maximum value of the control column from the last run
Extract ControlConfiguration for tracking incremental progress
Delta UpdateProcessing only changed data instead of all data
Business KeyColumn(s) that uniquely identify a record for merge operations