Convert a Delimited File into a Lakehouse Table

Create or Update Lakehouse Table from Delimited File in the Lakehouse.

Category: Copy to Lakehouse Table | Tags: Data-Delta

How it works

Update Delta Table '<<DeltaSchemaName>>.<<DeltaTableName>>' (using Update Type '<<DeltaTableUpdateType>>') from Data Lake location 'raw/<<DataLakeSystemFolder>>/<<DataLakeDatasetFolder>>'

To use this activity within the API, use an ActivityCode of CSV-FILE-TO-DELTA-TABLE.

Example JSON

An example of what the Task Config would look like for a task using this activity. Some of these variables would be set at the group level to avoid duplication between tasks.

{
  "DataLakeSystemFolder": "my_folder",
  "DataLakeDatasetFolder": "data",
  "DeltaSchemaName": "example_schema",
  "DeltaTableName": "my_table",
  "DeltaTableUpdateType": "Replace"
}

Variable Reference

The following variables are supported:

CaseSensitiveColumnNames (Optional) - Should the column names in the source file be treated in a case-sensitive way?
CsvFileColumnDelimiter (Optional) - The column-delimiter used if the source file is delimited.
CsvFileColumnQuoteCharacter (Optional) - The character used to quote columns that contain the column delimiter if the source file is delimited.
CsvFileEscapeCharacter (Optional) - Character used to escape the character that immediately follows it. Usually backslash or double-quote.
DatabricksClusterId (Optional) - The Databricks Cluster to use for this task.
DataLakeDatasetFolder (Required) - Name of the folder in the Data Lake containing the dataset.
DataLakeSystemFolder (Required) - Name of the parent (System) folder in the Data Lake containing the dataset.
DaysToRetainInRawFolderAfterSuccessfulProcessing (Optional) - The number of days of raw files to retain in the raw folder once the file has been successfully processed.
DeltaSchemaName (Required) - The name of the Schema this transformation lives in.
DeltaTableBusinessKeyColumnList (Optional) - Comma-separated list of Business Key columns in the Lakehouse Table. This is required if 'Lakehouse Table Update Type' is 'Dimension' or 'Merge'. If a value is specified, a uniqueness test is performed against this (composite) key for both the result of the Enrichment and the Lakehouse Table.
DeltaTableComments (Optional) - Comments to add to the Lakehouse Table.
DeltaTableName (Required) - The name of the Table representing this transformation.
DeltaTablePartitionColumnList (Optional) - Comma-separated ordered list of columns forming the Partitioning strategy of the Lakehouse Table.
DeltaTableUpdateType (Required) - Indicates what type of update (if any) is to be performed on the Lakehouse Table.
Encoding (Optional) - The encoding type used to read/write text files.
FailTaskIfNoDataToProcess (Optional) - If there is no data to process (either the raw file does not exist or the high-water mark is beyond the maximum load date of raw files), should the Task FAIL?
FileHasHeaderRow (Optional) - Does the delimited file contain a header row?
FileIsMultiLine (Optional) - Is the source file multi-line?
FileSourceIsIncremental (Optional) - Do the source files represent an incremental set of data (i.e. not full)?
Show more details
What This Setting Means
- YES (Incremental): Files contain only new or changed records, not the complete dataset
- NO (Full): Each file contains the complete dataset
How Data Is Processed
When Set to YES (Incremental)
- Without Business Key: All records from all ingested files are processed
- With Business Key: Only the most recent version of each record is kept (based on load timestamp)
When Set to NO (Full)
- Only records from the most recent file are processed
- All previous files are ignored
When To Use Each Option
Choose YES (Incremental) for:
- Partial updates (new transactions, changes, additions)
- Building your dataset over time across multiple file loads
- Examples: Daily sales transactions, customer updates, record changes
Choose NO (Full) for:
- Complete data refreshes where each file replaces previous data
- Examples: Weekly product catalogs, monthly directories, complete snapshots
IncludeFileMetadataColumn (Optional) - Should the metadata of the source data be included as a column in the Lakehouse Table?
IncludeSourceFileName (Optional) - Should the filename the Source data comes from be included as a column in the table?
IncludeSourceRecordOrder (Optional) - Should the record order in the Source data be included as a column in the Lakehouse table?
InferSchema (Optional) - Should the Schema be inferred from the source file? This applies mainly to csv and json source files.
IsFederated (Optional) - Makes task available to other Insight Factories within this organisation.
MaximumNumberOfAttemptsAllowed (Optional) - The total number of times the running of this Task can be attempted.
MergeSchema (Optional) - Should the Schemas from the possibly numerous source files be merged? It is recommended to leave this as False unless you are catering for schema drift.
MinutesToWaitBeforeNextAttempt (Optional) - If a Task run fails, the number of minutes to wait before re-attempting the Task.
NumberOfRowsToSkip (Optional) - How many rows should be skipped before reading data?
PartitionDepthToReplace (Optional) - The number of columns in 'Lakehouse Table Partition Column List' (counting from the first column in order) to use in a Partition Replacement. NOTE: This cannot be greater than the number of columns defined in the 'Lakehouse Table Partition Column List'. Defaults to 1 if only one column has been specified in 'Lakehouse Table Partition Column List'.
PartitionDiscoveryBasePath (Optional) - Specify the base path that partition discovery should start with if this occurs higher in the folder hierarchy. For example, suppose you are reading a data file from a folder sys1/dataset1/site=site1 and you want a partition-inferenced column of site in your dataset, specify the base path as sys1/dataset1.
ProcessEmptyDataFile (Optional) - Should processing continue if the source data file is empty?
RelativeFilePathFromDatasetFolder (Optional) - The relative file path under the Dataset folder. Only use this if you are targeting a specific file from a list of files in a sub-folder.
SaveRawFilesToHistoryInThisTask (Optional) - Raw files are normally saved to History in the ingestion task. However, if the ingestion task is not capable of doing this, you can request for the raw files to be saved to history in this task.
SchemaEnforcementColumnList (Optional) - An array of JSON structs that will enforce a schema on a data file as it is being converted to a Lakehouse table. See help info for more details.
Show more details
Schema Enforcement Column List When a Lakehouse table is created from a raw data file, you have the opportunity to change its schema - either or both of the column name and/or data type.
The syntax of this element is a JSON Array containing JSON Structs. The syntax of the JSON Struct is as follows: To RENAME a column
{ "action": "replace", "column_name": "<new column name>", "renamed_from": "<original column name>" }
where <original column name> is the column name in the data file and <new column name> is the name for this column in the Lakehouse table. To CAST a column to a different data type
{ "action": "replace", "column_name": "<column name>", "data_type": "<new data type>" }
where <column name> is the column name in the data file and <new data type> is the data type of this column in the Lakehouse table. To RENAME and CAST a column
{ "action": "replace", "column_name": "<new column name>", "data_type": "<new data type>", "renamed_from": "<original column name>" }
To ADD a Generated column
{ "action": "generated", "column_name": "<generated column name>", "data_type": "<data type of generated column>", "generated_as": "<SQL formula>" }
where <generated column name> is the name of the column in the Lakehouse table, <data type of generated column> is the data type of this column in the Lakehouse table, and <SQL formula> is the formula in SQL to use for the generated column. To ADD an Identity column
{ "action": "identity", "column_name": "<identity column name>", "start_value": "<seed value>" }
where <identity column name> is the name of the identity column in the Lakehouse table and <seed value> is the value to seed the identity column with. Example: If the schema of a raw data file contains columns Col1 and Col2 - both strings - and we apply the following Schema Enforcement Column List to it:
[ { "action": "replace", "column_name": "NewCol1", "renamed_from": "Col1" }, { "action": "replace", "column_name": "NewCol2", "data_type": "int", "renamed_from": "Col2" } ]
This will rename Col1 to NewCol1 (still as a string data type) and rename Col2 to NewCol2 and cast it as an integer data type.
SkipCreateVolumeAndSchema (Optional) - If a Schema and/or Volume has already been created, you can opt to skip this check - it will lead to better performance.

Example JSON​

Variable Reference​

Example JSON

Variable Reference