Convert a Text file into a Lakehouse Table

Create or Update Lakehouse Table from Data Lake file.

Category: Copy to Lakehouse Table | Tags: Data-Delta

How it works

Update Delta Table '<<DeltaSchemaName>>.<<DeltaTableName>>' (using Update Type '<<DeltaTableUpdateType>>') from Data Lake location 'raw/<<DataLakeSystemFolder>>/<<DataLakeDatasetFolder>>'

To use this activity within the API, use an ActivityCode of TEXT-FILE-TO-DELTA-TABLE.

Example JSON

An example of what the Task Config would look like for a task using this activity. Some of these variables would be set at the group level to avoid duplication between tasks.

{
  "DataLakeSystemFolder": "my_folder",
  "DataLakeDatasetFolder": "data",
  "DeltaSchemaName": "example_schema",
  "DeltaTableName": "my_table",
  "DeltaTableUpdateType": "Replace"
}

Variable Reference

The following variables are supported:

DatabricksClusterId (Optional) - The Databricks Cluster to use for this task.
DataLakeDatasetFolder (Required) - Name of the folder in the Data Lake containing the dataset.
DataLakeSystemFolder (Required) - Name of the parent (System) folder in the Data Lake containing the dataset.
DaysToRetainInRawFolderAfterSuccessfulProcessing (Optional) - The number of days of raw files to retain in the raw folder once the file has been successfully processed.
DeltaSchemaName (Required) - The name of the Schema this transformation lives in.
DeltaTableBusinessKeyColumnList (Optional) - Comma-separated list of Business Key columns in the Lakehouse Table. This is required if 'Lakehouse Table Update Type' is 'Dimension' or 'Merge'. If a value is specified, a uniqueness test is performed against this (composite) key for both the result of the Enrichment and the Lakehouse Table.
DeltaTableComments (Optional) - Comments to add to the Lakehouse Table.
DeltaTableName (Required) - The name of the Table representing this transformation.
DeltaTablePartitionColumnList (Optional) - Comma-separated ordered list of columns forming the Partitioning strategy of the Lakehouse Table.
DeltaTableUpdateType (Required) - Indicates what type of update (if any) is to be performed on the Lakehouse Table.
FailTaskIfNoDataToProcess (Optional) - If there is no data to process (either the raw file does not exist or the high-water mark is beyond the maximum load date of raw files), should the Task FAIL?
FileSourceIsIncremental (Optional) - Do the source files represent an incremental set of data (i.e. not full)?
Show more details
What This Setting Means
- YES (Incremental): Files contain only new or changed records, not the complete dataset
- NO (Full): Each file contains the complete dataset
How Data Is Processed
When Set to YES (Incremental)
- Without Business Key: All records from all ingested files are processed
- With Business Key: Only the most recent version of each record is kept (based on load timestamp)
When Set to NO (Full)
- Only records from the most recent file are processed
- All previous files are ignored
When To Use Each Option
Choose YES (Incremental) for:
- Partial updates (new transactions, changes, additions)
- Building your dataset over time across multiple file loads
- Examples: Daily sales transactions, customer updates, record changes
Choose NO (Full) for:
- Complete data refreshes where each file replaces previous data
- Examples: Weekly product catalogs, monthly directories, complete snapshots
IncludeFileMetadataColumn (Optional) - Should the metadata of the source data be included as a column in the Lakehouse Table?
IncludeSourceFileName (Optional) - Should the filename the Source data comes from be included as a column in the table?
IncludeSourceRecordOrder (Optional) - Should the record order in the Source data be included as a column in the Lakehouse table?
IsFederated (Optional) - Makes task available to other Insight Factories within this organisation.
LineSeparator (Optional) - The line separator that is used in the file. If omitted, all of \r, \r\n, \n will be used.
MaximumNumberOfAttemptsAllowed (Optional) - The total number of times the running of this Task can be attempted.
MinutesToWaitBeforeNextAttempt (Optional) - If a Task run fails, the number of minutes to wait before re-attempting the Task.
PartitionDepthToReplace (Optional) - The number of columns in 'Lakehouse Table Partition Column List' (counting from the first column in order) to use in a Partition Replacement. NOTE: This cannot be greater than the number of columns defined in the 'Lakehouse Table Partition Column List'. Defaults to 1 if only one column has been specified in 'Lakehouse Table Partition Column List'.
PartitionDiscoveryBasePath (Optional) - Specify the base path that partition discovery should start with if this occurs higher in the folder hierarchy. For example, suppose you are reading a data file from a folder sys1/dataset1/site=site1 and you want a partition-inferenced column of site in your dataset, specify the base path as sys1/dataset1.
ProcessEmptyDataFile (Optional) - Should processing continue if the source data file is empty?
ReadFileAsSingleRow (Optional) - If True, read entire contents of file as a single row.
RelativeFilePathFromDatasetFolder (Optional) - The relative file path under the Dataset folder. Only use this if you are targeting a specific file from a list of files in a sub-folder.
SaveRawFilesToHistoryInThisTask (Optional) - Raw files are normally saved to History in the ingestion task. However, if the ingestion task is not capable of doing this, you can request for the raw files to be saved to history in this task.
SkipCreateVolumeAndSchema (Optional) - If a Schema and/or Volume has already been created, you can opt to skip this check - it will lead to better performance.

Example JSON​

Variable Reference​

Example JSON

Variable Reference