PythonUDF - Trifacta User Defined Functions Using Python

2020-02-18 Vijay Balasubramaniam

Why PythonUDF?

Trifacta offers a rich library of data transformations and functions out of the box. They cover a wide variety of use cases, and can be combined through nesting and macros to perform higher-level functions.

Every once in a while, you might encounter a situation where there is no built-in function available to solve your problem. In such cases, you can leverage Trifacta's user defined function (UDF) framework to write your own function in Java. This UDF can then be used just like you would use any other function.

PythonUDF gives you the ability to write Trifacta UDFs in Python. PythonUDF leverages Jython to combine the power of Java and Python, and provides the following advantages:

Functions are expressed in Python, which is a much simpler language to learn and use than Java
Python is typically 3x to 10x more compact. Fewer lines of code directly translates to improved developer productivity and fewer bugs
No compilation necessary. Python is an interpreted language, so you can deploy Python source easily just by copying a .py file
Python has a 'batteries included' philosophy, and comes with a rich standard library that provides many capabilities out of the box. You can leverage the Python standard library as well as 3rd party pure Python libraries from PyPI when developing your Trifacta UDF
Additionally you also have access to the Java ecosystem, and can easily import Java libraries too

Limitations

PythonUDFs inherits some limitations from Java UDFs

Previews are not available
Functions are 'map only' and apply to each row of data. Aggregate functions are not supported
You cannot retain state between function invocations

Additionally, the following limitations apply:

PythonUDFs typically run slower than Java UDFs
PythonUDFs are limited to a single string input column as of now ¹
No support for numpy and pandas ²
No support for Python 3 ²
Not tested on Spark ³

Examples

To illustrate the simplicity and power of PythonUDF, here are a few examples. By leveraging the Python standard library, many complex operations can be accomplished with 1-2 lines of code. The following code snippet implements 6 functions!

PythonUDF.py

def base64encode(s):
  import base64
  return base64.b64encode(s)
def base64decode(s):
  import base64
  return base64.b64decode(s)
def sha256(s):
  import hashlib
  return hashlib.sha256(s).hexdigest()
def sha512(s):
  import hashlib
  return hashlib.sha512(s).hexdigest()
def mask(s):
  return s[-4:].rjust(len(s), "*")
def test(*args): # Simply converts to uppercase, but illustrates that multiple string arguments are possible
    return ' '.join(list(args)).upper()

Usage

Developing a Python UDF

The workflow for developing a Python UDF is as follows:

Build the python function. The function should take a string parameter, and return a string as output
Test the function locally. It is recommended to use Jython for testing if possible, otherwise CPython or Jupyter is ok
Add your function code to PythonUDF.py
Deploy the updated PythonUDF.py (including your code) to the Trifacta server in the /udfs folder
Restart Trifacta services: sudo service trifacta restart
Refresh the browser. Clear the browser cache if necessary

The above assumes that the PythonUDF jar file is already installed on the Trifacta server. If not, please see installation instructions further below.

Using a Python UDF

Using a PythonUDF is very similar to using Java UDFs:

In the recipe editor, click on 'New Step' and choose 'Invoke External Function'
From the Function drop-down list, select 'PythonUDF'
Choose the column to apply the function to. As of now, only a single column is supported
In the Arguments section, enter the function name (eg. mask) as the first argument
Optionally, give the new column a name

This is illustrated in the screenshots below

Screenshots

Search for the UDF transformation

PythonUDF: mask

More examples

Installing PythonUDF jar in a Trifacta system

To install PythonUDF on a Trifacta server, follow these steps:

Connect to the Trifacta server using ssh
(Optional) Clone the Github repository to your home directory: git clone [email protected]:vbalasu/PythonUDF.git
Create a new directory to hold the udfs: sudo mkdir /udfs
Copy PythonUDF.jar, PythonUDF.py and PythonUDF.sh to the /udfs directory
Change permissions to make trifacta user the owner of this folder and its contents: sudo chown -R trifacta:trifacta /udfs
Make sure you are able to run /udfs/PythonUDF.sh from the command line. This is a hello world script that tests the jar
Log into Trifacta web as an Administrator and go to Admin Settings
Modify the following admin settings:
- udf-service.udfPackages: com.trifacta.trifactaudfs
- udf-service.udfCommunicationTimeout: 100000 (increase timeout to 100 seconds)
- udf-service.additionalJars: /udfs/PythonUDF.jar (this is a comma separated list if you are adding to existing jars)
- feature.enableUDFTransform.enabled: true
- Leave the remaining udf-service settings at their defaults. See the Trifacta Developer Guide for more details
- Save the admin settings. This will restart the Trifacta server
Remember to refresh the browser, otherwise your UDF won't appear. Clear the browser cache if required

Resources

All the jar files, python files and source code referenced on this page are available on Github at this location

Although you are limited to a single string input, the input can contain a complex string that is made up of multiple objects and arrays (eg. CSV, JSON, XML, YAML, etc.), which you can parse using Python. It is possible to add multiple input support in the future. ↩
Support for numpy, pandas and Python 3 is lacking at the moment because Jython does not support it. However, the Jython community is actively working on adding this support. See here and here ↩ ↩²
All testing was performed on the Photon (Trifacta Server) runtime. Spark should also work in theory, but this was not tested ↩

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
__pycache__		__pycache__
build		build
cachedir/packages		cachedir/packages
com/trifacta/trifactaudfs		com/trifacta/trifactaudfs
custom-udf-sdk		custom-udf-sdk
media		media
.README.md.swp		.README.md.swp
InterpreterExample.class		InterpreterExample.class
InterpreterExample.java		InterpreterExample.java
JythonHelloWorld.class		JythonHelloWorld.class
JythonHelloWorld.java		JythonHelloWorld.java
MyJarFile.jar		MyJarFile.jar
PythonUDF.jar		PythonUDF.jar
PythonUDF.py		PythonUDF.py
PythonUDF.sh		PythonUDF.sh
README.md		README.md
app.jar		app.jar
build-jar-1.sh		build-jar-1.sh
build-jar-2.sh		build-jar-2.sh
combined.jar		combined.jar
compile-udf.sh		compile-udf.sh
compile.sh		compile.sh
hello.py		hello.py
jython-docker.sh		jython-docker.sh
jython-standalone-2.7.1.jar		jython-standalone-2.7.1.jar
jython.sh		jython.sh
run-udf.sh		run-udf.sh
run.sh		run.sh
trifacta-base-udf.jar		trifacta-base-udf.jar
trifacta-java-custom-udf-sdk-6.8.1.zip		trifacta-java-custom-udf-sdk-6.8.1.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PythonUDF - Trifacta User Defined Functions Using Python

Why PythonUDF?

Limitations

Examples

PythonUDF.py

Usage

Developing a Python UDF

Using a Python UDF

Screenshots

Search for the UDF transformation

PythonUDF: mask

More examples

Installing PythonUDF jar in a Trifacta system

Resources

About

Releases

Packages

Languages

vbalasu/PythonUDF

Folders and files

Latest commit

History

Repository files navigation

PythonUDF - Trifacta User Defined Functions Using Python

Why PythonUDF?

Limitations

Examples

PythonUDF.py

Usage

Developing a Python UDF

Using a Python UDF

Screenshots

Search for the UDF transformation

PythonUDF: mask

More examples

Installing PythonUDF jar in a Trifacta system

Resources

Footnotes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages