Map Databricks jobs to the tables from which they read/write via ADLS Diagnostic Logs.
Databricks jobs read and write various tables. Historically, these tables have been managed by a Hive Metastore (HMS), either the Databricks provided HMS that is default for each workspace, or an externally managed HMS. In a project to upgrade a Databricks deployment to leverage Unity Catalog, it sometimes becomes necessary to build a "lineage graph" of which jobs read and write which tables in order to sequence an upgrade properly, as some tables may be dependent on other tables. While UC provides this lineage capability out of the box, it is possible (if complex) to reconstruct aspects of it with Databricks and Azure primitives.
To successfully follow the notebooks in this repo, the following must be true:
- You have enabled ADLS Diagnostic Logs on the storage locations for the tables on which you want to construct lineage.
- You have run the External Metastore Inventory script to "dump" HMS information to a queryable Delta table.
- You have run the script (provided by your Databricks representative or manually constructed) to a create a queryable Delta table with all Jobs Run information available on the
jobs/runs/list
API.
In upgrading to Unity Catalog, it is important to identify the dependencies between automated jobs, and the data tables they read from and write to. Instead of asking each developer or job owner to determine inputs/outputs of each production job, in Azure it is usually possible to determine this by mapping ADLS logs to cluster IDs.
For each job run (e.g. an instance of a job occurrence), a new ephemeral job cluster gets created to run the job’s tasks. As this cluster accesses cloud storage, ADLS StorageBlobLogs detail the cluster that used some credential (e.g. Service Principal) to access a table at a specific storage path
By creating a mapping of: job run → job cluster id → storage access event → storage path → external HMS table entry, we can determine the specific tables associated with each job:
Assuming all pre-requisites are satisfied, you are ready to proceed with the following notebooks in this repo:
1 ADLS Log Dump to Delta Table
: this script creates a queryable Delta table with ADLS access logs.
2 Job to Cluster to Location Mapping
: this script includes logic to perform joins between ADLS access Logs (step 1), HMS table definitions (pre-req), and job run instances (pre-req).
Once completed successfully, the result is a table that maps Job Ids to the Tables from which is Reads and Writes. This mapping should be used in upgrading to Unity Catalog in order to sequence upgrade of data applications.