README

Overview

This repository contains the implementation of enforcement UDFs.

core: contains modules for field path representation, and engine-agnostic enforcement implementation
extensions: contains engine-specifc UDF implementations
transport: contains extensions to generic type system implementations in transport
benchmark: contains code for benchmarking UDF performance (both microbenchmarks and cluster benchmarks)

Development

Building the repo

./gradlew build

Core Modules

The dataguard-fieldpaths module in the core directory contains reusable logic for redacting fields, implemented in an engine-agnostic way.

The grammar for representing fieldpaths is described in VirtualFieldPath.g4, and processed using ANTLR library. The code for parsing, semantic analysis and enforcement corresponding to this grammar is present in com.linkedin.dataguard.runtime.fieldpaths.virtual module.

A non-trivial number of field paths in LinkedIn's data catalog exist in a slightly different and less expressive legacy language. This module also contains code for processing those fieldpaths under TMSPath.g4. The code for processing the legacy field paths is in com.linkedin.dataguard.runtime.fieldpaths.tms module.

Transport Modules

Enforcement code leverages the generic type system (StdType and StdData objects) defined in transport library in order to implement the engine-agnostic enforcement logic. This common code is then leveraged in engine-specific UDFs, by providing engine-specific implementations for the generic objects.

However, the implementation in transport library has some gaps, e.g. the objects are non-nullable. It also lacks APIs (e.g. FormatSpecificTypeDataProvider) relevant for enforcement, and restricts the ability to perform optimizations. So we extend the implementations in modules under transport/ directory to modify the transport implementations and add APIs as needed.

dataguard-transport-common: Defines common APIs to be used in enforcement
dataguard-transport-java: A Java implementation for transport API, useful for testing enforcement code in a type-agnostic way
dataguard-transport-spark: extension of Spark type-system implementation from transport library
dataguard-transport-trino: extension of Trino type-system implementation from transport library

Extensions

The extensions/ directory contains engine-specific implementations of enforcement code.

Spark

Contains the following UDFs to perform enforcement:

RedactFieldIfUDF.scala and
RedactSecondarySchemaFieldIfUDF.scala

As described in the [Core Modules](#Core Modules) section, LinkedIn's metadata ecosystem contains fieldpaths in two languages, one of them being legacy, and less expressive. The two UDFs implement redaction logic corresponding to the fieldpaths represented in the two languages. But they can be combined into a single UDF as described in the accompanying VLDB submission.

Note that for implementing the UDFs, we use the Expression API in Spark. This lets us define generic enforcement UDFs where input and output column types do not need to be pre-defined, as opposed to the Scalar UDF API. The other approach for defining such generic UDFs is Hive GenericUDF API, but we go with the Spark native imlementation to avoid overhead associated with data conversion between Hive<->Spark during UDF execution.

Trino

For Trino UDF implementation, the plugin SPI is used. The implementation can be found in RedactFieldIf.java. Note that we only implement the UDF for one of the two languages, since all Trino usecases for policy enforcement so far have been limited to datasets with metadata defined using only one of those languages.

Benchmarking

Microbechmarking

The dataguard-enforcement-udfs-microbenchmark module contains micro-benchmarking code and scenarios implemented using the JMH framework.

The microbenchmarking is done for Spark UDFs.

./gradlew :benchmark:dataguard-enforcement-udfs-microbenchmark:jmhExec

This will also generate a CPU profile by default under build/ directory.

Cluster Benchmarking

The dataguard-enforcement-udfs-benchmark-impl-cluster module contains code to benchmark queries for various scenarios on a Spark cluster.

For running cluster benchmark locally:

[one-time] Download and extract Apache Spark, and update PATH env variable.

export SPARK_HOME=/Users/padesai/Downloads/spark-3.1.1-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH

Build the project with shadowJar configuration

./gradlew :benchmark:dataguard-enforcement-udfs-benchmark-impl-cluster:shadowJar

Execute the spark-submit command to launch the DataGenerator or QueryRunner class.

spark-submit --class com.linkedin.dataguard.benchmark.enforcement.DataGenerator --master local[*] benchmark/dataguard-enforcement-udfs-benchmark-impl-cluster/build/libs/dataguard-enforcement-udfs-benchmark-impl-cluster-1.0-all.jar

For running this on your Yarn cluster, please use --master yarn instead of local[*], and refer to the instructions here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

README

Overview

Development

Building the repo

Core Modules

Transport Modules

Extensions

Spark

Trino

Benchmarking

Microbechmarking

Cluster Benchmarking

Files

README.md

Latest commit

History

README.md

File metadata and controls

README

Overview

Development

Building the repo

Core Modules

Transport Modules

Extensions

Spark

Trino

Benchmarking

Microbechmarking

Cluster Benchmarking