Build an ML Model and update ML Catalog

Build an ML Model into the ML Catalog.

Category: Enrich Lakehouse Table | Tags: Enrichment

To use this activity within the API, use an ActivityCode of ML-BUILD-MODEL.

Example JSON

An example of what the Task Config would look like for a task using this activity. Some of these variables would be set at the group level to avoid duplication between tasks.

{
  "NotebookPath": "/Users/fred.nurks@example.com/MyRepo/My Notebook",
  "ModelSchemaName": "example_schema",
  "ModelName": "",
  "NotebookParameters": {
    "Param1": "Value1",
    "Param2": "Value2"
  }
}

Variable Reference

The following variables are supported:

AdditionalNotebooks (Optional) - The path to other notebooks, Python files etc., referenced by the main notebook.
DatabricksComputeId (Optional) - The Id of the Databricks compute resource to use to run the Notebook.
DatabricksJobClusterSpec (Required) - JSON configuration for the job cluster. See help for examples.
Show more details
Job Cluster Spec
JSON configuration for the ephemeral job cluster. A new cluster is created for each run and terminated on completion.
Required properties
- RuntimeVersion — Databricks Runtime version (e.g. "17.3")
- NodeTypeId or InstancePoolId — one must be specified
Optional properties
- NumWorkers — number of worker nodes (omit or set to 0 for single-node)
- DriverNodeTypeId — override the driver node type
- DriverInstancePoolId — use a separate pool for the driver node
- SparkVersionFull — override the full Spark version string (e.g. "17.3.x-gpu-ml-scala2.12")
- Libraries — additional libraries to install on the cluster
azure-keyvault-secrets is always installed automatically.
Example — Standard job cluster
{ "RuntimeVersion": "17.3", "NodeTypeId": "Standard_DS3_v2", "NumWorkers": 4 }
Example — Using instance pools
{ "RuntimeVersion": "17.3", "InstancePoolId": "0101-120000-pool123", "NumWorkers": 0, "Libraries": [{ "pypi": { "package": "pandas==2.0.0" } }] }
DatabricksServerlessSpec (Optional) - Optional JSON configuration for serverless job runs. See help for examples.
Show more details
Serverless Job Spec
Optional JSON configuration for serverless job runs. Allows specifying Python dependencies and an explicit environment version for serverless compute. All fields are optional.
Required properties
None.
Optional properties
- Libraries — list of pip requirement strings to install
- EnvironmentVersion — Databricks environment version. If omitted, Databricks uses its default.
Example — With libraries only. Will use the "default" environment
{ "Libraries": ["pandas", "numpy>=1.21.0"] }
Example — With libraries and a pinned environment version
{ "Libraries": ["pandas==2.0.0"], "EnvironmentVersion": "5" }
Notes
- The spec is only applied to serverless runs (cluster IDs starting with SERVERLESS). It is ignored for existing clusters and job clusters.
- Libraries map to environments[].spec.dependencies in the Databricks Jobs API.
ExtractControlVariableName (Optional) - For incremental loads only, the name to assign the Extract Control variable in State Config for the ExtractControl value derived from the Extract Control Query above.
ExtractControlVariableSeedValue (Optional) - The initial value to set for the Extract Control variable in State Config - this will have no impact beyond the original seeding of the Extract Control variable in State Config.
IsFederated (Optional) - Makes task available to other Insight Factories within this organisation.
Links (Optional)
MaximumNumberOfAttemptsAllowed (Optional) - The total number of times the running of this Task can be attempted.
MaximumRunTimeHours (Optional) - The maximum number of hours a task run is expected to take. If a run exceeds this duration it will be flagged as stuck.
MinutesToWaitBeforeNextAttempt (Optional) - If a Task run fails, the number of minutes to wait before re-attempting the Task.
ModelName (Required) - Name of the ML Model.
ModelSchemaName (Required) - The Schema the ML Model resides in.
NotebookParameters (Optional) - Parameters for use in the Databricks Notebook. This is JSON format e.g. { "Param1": "Value1", "Param2": "Value2" }.
NotebookPath (Required) - The relative path to the Databricks Notebook.

Example JSON​

Variable Reference​

Example JSON

Variable Reference