HMS Job to Table Lineage scripts

Objective

Map Databricks jobs to the tables from which they read/write via ADLS Diagnostic Logs.

Context

Databricks jobs read and write various tables. Historically, these tables have been managed by a Hive Metastore (HMS), either the Databricks provided HMS that is default for each workspace, or an externally managed HMS. In a project to upgrade a Databricks deployment to leverage Unity Catalog, it sometimes becomes necessary to build a "lineage graph" of which jobs read and write which tables in order to sequence an upgrade properly, as some tables may be dependent on other tables. While UC provides this lineage capability out of the box, it is possible (if complex) to reconstruct aspects of it with Databricks and Azure primitives.

Pre-requisites

To successfully follow the notebooks in this repo, the following must be true:

You have enabled ADLS Diagnostic Logs on the storage locations for the tables on which you want to construct lineage.
You have run the External Metastore Inventory script to "dump" HMS information to a queryable Delta table.
You have run the script (provided by your Databricks representative or manually constructed) to a create a queryable Delta table with all Jobs Run information available on the jobs/runs/list API.

Details

In upgrading to Unity Catalog, it is important to identify the dependencies between automated jobs, and the data tables they read from and write to. Instead of asking each developer or job owner to determine inputs/outputs of each production job, in Azure it is usually possible to determine this by mapping ADLS logs to cluster IDs.

For each job run (e.g. an instance of a job occurrence), a new ephemeral job cluster gets created to run the job’s tasks. As this cluster accesses cloud storage, ADLS StorageBlobLogs detail the cluster that used some credential (e.g. Service Principal) to access a table at a specific storage path

By creating a mapping of: job run → job cluster id → storage access event → storage path → external HMS table entry, we can determine the specific tables associated with each job:

Instructions

Assuming all pre-requisites are satisfied, you are ready to proceed with the following notebooks in this repo:

1 ADLS Log Dump to Delta Table: this script creates a queryable Delta table with ADLS access logs.

2 Job to Cluster to Location Mapping: this script includes logic to perform joins between ADLS access Logs (step 1), HMS table definitions (pre-req), and job run instances (pre-req).

Once completed successfully, the result is a table that maps Job Ids to the Tables from which is Reads and Writes. This mapping should be used in upgrading to Unity Catalog in order to sequence upgrade of data applications.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
_resources		_resources
1 ADLS Log Dump to Delta table.py		1 ADLS Log Dump to Delta table.py
2 Job to Cluster to Location Mapping.py		2 Job to Cluster to Location Mapping.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HMS Job to Table Lineage scripts

Objective

Context

Pre-requisites

Details

Instructions

About

Releases

Packages

Languages

tj-cycyota/hms_lineage

Folders and files

Latest commit

History

Repository files navigation

HMS Job to Table Lineage scripts

Objective

Context

Pre-requisites

Details

Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages