This document walks through how the profiler profiles the data.
End user is expected to pass in the a configuration which drives the utility.
The infomation is provided in the form of JSON when the program runs. This json passed is then translated to language specific variables.
An example on how to run this program by passing the JSON configuration can be seen here.
"dataFormat":"Whats the format of the data ? Currently supported formats include JSON, CSV and PARQUET"
"inputDataLocation":"s3 or hdfs location of data"
"appName":"Meaningful name to this application"
"schemaRepoUrl":"host name of schema repository"
"scheRepoSubjectName":"name of the subject for which data is being validated"
"schemaVersionId":"numerical version of the schema"
"customQ1":"custom sql, make sure that this returns Long value"
"customQ1ResultThreshold": 0
"customQ1Operator":" = | > | < | <= | >= ",
"customQ2":"custom sql, make sure that this returns Long value",
"customQ2ResultThreshold": 0,
"customQ2Operator":" = | > | < | <= | >= ",
"customQ3":"custom sql, make sure that this returns Long value",
"customQ3ResultThreshold": 0,
"customQ3Operator":" = | > | < | <= | >= "
- First we launch the jupyter notebook which is scala based.
- We configure the notebook to submit a spark job to an external spark cluster. This is where we set details like number of cores required, memory and some dependencies like
spark-avro
anddatadog-statsd
client. You should be able to use anyspark-submit
configuration here. - Next we define default values for changing parameters. These values will be updated by
papermill
everytime the notebook is run based on what parameters are passed in using thepapermill
script. - Now we will initialize the
datadog
statsd client and forward metrics to local statsd port. - We will now read the data that needs to be profiled in using apache spark. We will select options based on what the data format is.
- Once we have read the data in as a
dataframe
we will report some basic stats around the dataframe todatadog
. - Now we will query the
schema repository
and fetch the registered schema. Schema is defined usingAvro
formar. - This
Avro
schema will be converted tospark sql
schema. - We will be infering the
spark sql
schema from the incoming data and thencomparing
the registered schema with the inferred schema from the data. - We will publish amount of matches and mis-matches to datadog.
- Next, we are going to perform
custom data quality
checks based onsql
statements fired on the dataset. - We will assert that the result of the
sql
statements meet the thresholds set.