This repository contains the implementation of enforcement UDFs.
core
: contains modules for field path representation, and engine-agnostic enforcement implementationextensions
: contains engine-specifc UDF implementationstransport
: contains extensions to generic type system implementations in transportbenchmark
: contains code for benchmarking UDF performance (both microbenchmarks and cluster benchmarks)
./gradlew build
The dataguard-fieldpaths module in the core
directory contains reusable logic for redacting
fields, implemented in an engine-agnostic way.
The grammar for representing fieldpaths is described in VirtualFieldPath.g4, and processed using ANTLR library. The code for parsing, semantic analysis and enforcement corresponding to this grammar is present in com.linkedin.dataguard.runtime.fieldpaths.virtual module.
A non-trivial number of field paths in LinkedIn's data catalog exist in a slightly different and less expressive legacy language. This module also contains code for processing those fieldpaths under TMSPath.g4. The code for processing the legacy field paths is in com.linkedin.dataguard.runtime.fieldpaths.tms module.
Enforcement code leverages the generic type system (StdType and StdData objects) defined in transport library in order to implement the engine-agnostic enforcement logic. This common code is then leveraged in engine-specific UDFs, by providing engine-specific implementations for the generic objects.
However, the implementation in transport library has some gaps, e.g. the objects are non-nullable. It also lacks
APIs (e.g. FormatSpecificTypeDataProvider
) relevant for enforcement, and restricts the ability to perform optimizations. So
we extend the implementations in modules under transport/
directory to modify the transport implementations and add
APIs as needed.
dataguard-transport-common
: Defines common APIs to be used in enforcementdataguard-transport-java
: A Java implementation for transport API, useful for testing enforcement code in a type-agnostic waydataguard-transport-spark
: extension of Spark type-system implementation from transport librarydataguard-transport-trino
: extension of Trino type-system implementation from transport library
The extensions/
directory contains engine-specific implementations of enforcement code.
Contains the following UDFs to perform enforcement:
As described in the [Core Modules](#Core Modules) section, LinkedIn's metadata ecosystem contains fieldpaths in two languages, one of them being legacy, and less expressive. The two UDFs implement redaction logic corresponding to the fieldpaths represented in the two languages. But they can be combined into a single UDF as described in the accompanying VLDB submission.
Note that for implementing the UDFs, we use the Expression API
in Spark. This lets us define generic enforcement UDFs where input and output column types do not need to be pre-defined, as
opposed to the Scalar UDF API. The other approach
for defining such generic UDFs is Hive GenericUDF API, but
we go with the Spark native imlementation to avoid overhead associated with data conversion between Hive<->Spark
during
UDF execution.
For Trino UDF implementation, the plugin SPI is used. The implementation can be found in RedactFieldIf.java. Note that we only implement the UDF for one of the two languages, since all Trino usecases for policy enforcement so far have been limited to datasets with metadata defined using only one of those languages.
The dataguard-enforcement-udfs-microbenchmark module contains micro-benchmarking code and scenarios implemented using the JMH framework.
The microbenchmarking is done for Spark UDFs.
./gradlew :benchmark:dataguard-enforcement-udfs-microbenchmark:jmhExec
This will also generate a CPU profile by default under build/
directory.
The dataguard-enforcement-udfs-benchmark-impl-cluster module contains code to benchmark queries for various scenarios on a Spark cluster.
For running cluster benchmark locally:
- [one-time] Download and extract Apache Spark,
and update
PATH
env variable.
export SPARK_HOME=/Users/padesai/Downloads/spark-3.1.1-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
- Build the project with
shadowJar
configuration
./gradlew :benchmark:dataguard-enforcement-udfs-benchmark-impl-cluster:shadowJar
- Execute the spark-submit command to launch the
DataGenerator
orQueryRunner
class.
spark-submit --class com.linkedin.dataguard.benchmark.enforcement.DataGenerator --master local[*] benchmark/dataguard-enforcement-udfs-benchmark-impl-cluster/build/libs/dataguard-enforcement-udfs-benchmark-impl-cluster-1.0-all.jar
For running this on your Yarn cluster, please use --master yarn
instead of local[*]
, and refer to the instructions
here.