Convert a Delimited File into a Lakehouse Table
Create or Update Lakehouse Table from Delimited File in the Lakehouse.
To use this activity within the API, use an ActivityCode of CSV-FILE-TO-DELTA-TABLE.
Example JSON
An example of what the Task Config would look like for a task using this activity. Some of these variables would be set at the group level to avoid duplication between tasks.
{
"DatasetName": "climate_data",
"DeltaTableName": "climate",
"DataLakeFileFormat": "parquet"
}
Variable Reference
The following variables are supported:
-
DataLakeSystemFolder- (Required) Name of the parent (System) folder in the Data Lake containing the dataset. -
DataLakeDatasetFolder- (Required) Name of the folder in the Data Lake containing the dataset. -
FileIsMultiLine- (Optional) Is there a line-break in any double-quote escaped columns? -
FileSourceIsIncremental- (Optional) Do the source files represent an incremental set of data (i.e. not full)? -
DeltaSchemaName- (Required) The name of the Schema this transformation lives in. -
DeltaTableName- (Required) The name of the Lakehouse Table representing this transformation. -
DeltaTableComments- (Optional) Comments to add to the Lakehouse Table. -
DeltaTableUpdateType- (Required) Indicates what type of update (if any) is to be performed on the Lakehouse Table. -
DeltaTableBusinessKeyColumnList- (Optional) Comma-separated list of Business Key columns in the Lakehouse Table. This is required if 'Lakehouse Table Update Type' is 'Dimension' or 'Merge'. If a value is specified, a uniqueness test is performed against this (composite) key for both the result of the Enrichment and the Lakehouse Table. -
DeltaTablePartitionColumnList- (Optional) Comma-separated ordered list of columns forming the Partitioning strategy of the Lakehouse Table. -
DatabricksClusterId- (Optional) The Id of the Databricks Cluster to use to perform the transformation. -
Encoding- (Optional) The encoding type used to read/write text files. -
CsvFileColumnDelimiter- (Optional) The column-delimiter used if the source file is delimited. -
CsvFileColumnQuoteCharacter- (Optional) The character used to quote columns that contain the column delimiter if the source file is delimited. -
CsvFileEscapeCharacter- (Optional) Character used to escape the character that immediately follows it. Usually backslash or double-quote. -
CaseSensitiveColumnNames- (Optional) Should the column names in the source file be treated in a case-sensitive way? -
NumberOfRowsToSkip- (Optional) How many rows should be skipped before reading data? -
FileHasHeaderRow- (Optional) Does the delimited file contain a header row? -
InferSchema- (Optional) Should the Schema be inferred from the source file? This applies mainly to csv and json source files. -
MergeSchema- (Optional) Should the Schemas from the possibly numerous source files be merged? It is recommended to leave this as False unless you are catering for schema drift. -
IncludeSourceFileName- (Optional) Should the filename the Source data comes from be included as a column in the Lakehouse Table? -
IncludeSourceRecordOrder- (Optional) Should the record order in the Source data be included as a column in the Lakehouse Table? -
IncludeFileMetadataColumn- (Optional) Should the metadata of the Source data be included as a column in the Lakehouse Table? -
ProcessEmptyDataFile- (Optional) Should processing continue if the source data file is empty? -
FailTaskIfNoDataToProcess- (Optional) If there is no data to process (either the raw file does not exist or the high-water mark is beyond the maximum load date of raw files), should the Task FAIL? -
PartitionDepthToReplace- (Optional) The number of columns in 'Lakehouse Table Partition Column List' (counting from the first column in order) to use in a Partition Replacement. NOTE: This cannot be greater than the number of columns defined in the 'Lakehouse Table Partition Column List'. Defaults to 1 if only one column has been specified in 'Lakehouse Table Partition Column List'. -
SchemaEnforcementColumnList- (Optional) An array of JSON structs that will enforce a schema on a data file as it is being converted to a Lakehouse Table. See help info for more details. -
RelativeFilePathFromDatasetFolder- (Optional) The relative file path under the Dataset folder. Only use this if you are targeting a specific file from a list of files in a sub-folder. -
PartitionDiscoveryBasePath- (Optional) Specify the base path that partition discovery should start with if this occurs higher in the folder hierarchy. For example, suppose you are reading a data file from a folder sys1/dataset1/site=site1 and you want a partition-inferenced column of site in your dataset, specify the base path as sys1/dataset1. -
SkipCreateVolumeAndSchema- (Optional) If a Schema and/or Volume has already been created, you can opt to skip this check - it will lead to better performance. -
DaysToRetainInRawFolderAfterSuccessfulProcessing- (Optional) The number of days of raw files to retain in the raw folder once the file has been successfully processed. -
SaveRawFilesToHistoryInThisTask- (Optional) Raw files are normally saved to History in the ingestion task. However, if the ingestion task is not capable of doing this, you can request for the raw files to be saved to history in this task. -
MaximumNumberOfAttemptsAllowed- (Optional) The total number of times the running of this Task can be attempted. -
MinutesToWaitBeforeNextAttempt- (Optional) If a Task run fails, the number of minutes to wait before re-attempting the Task. -
IsFederated- (Optional) Makes task available to other Insight Factories within this organisation.