Skip to content

Commit

Permalink
Add great expectation option for the export command (datacontract#92)
Browse files Browse the repository at this point in the history
* add great expectations support for export command
  • Loading branch information
SimonAuger authored Mar 18, 2024
1 parent 0977430 commit a9e845d
Show file tree
Hide file tree
Showing 8 changed files with 580 additions and 16 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
- Added export format **great-expectations**: `datacontract export --format great-expectations`



Expand Down
44 changes: 28 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -440,22 +440,34 @@ datacontract export --format dbt

Available export options:

| Type | Description | Status |
|--------------------|---------------------------------------------------------|--------|
| `jsonschema` | Export to JSON Schema | ✅ |
| `odcs` | Export to Open Data Contract Standard (ODCS) | ✅ |
| `sodacl` | Export to SodaCL quality checks in YAML format | ✅ |
| `dbt` | Export to dbt models in YAML format | ✅ |
| `dbt-sources` | Export to dbt sources in YAML format | ✅ |
| `dbt-staging-sql` | Export to dbt staging SQL models | ✅ |
| `rdf` | Export data contract to RDF representation in N3 format | ✅ |
| `avro` | Export to AVRO models | ✅ |
| `protobuf` | Export to Protobuf | ✅ |
| `terraform` | Export to terraform resources | ✅ |
| `sql` | Export to SQL DDL | ✅ |
| `sql-query` | Export to SQL Query | ✅ |
| `pydantic` | Export to pydantic models | TBD |
| Missing something? | Please create an issue on GitHub | TBD |
| Type | Description | Status |
|----------------------|---------------------------------------------------------|--------|
| `jsonschema` | Export to JSON Schema | ✅ |
| `odcs` | Export to Open Data Contract Standard (ODCS) | ✅ |
| `sodacl` | Export to SodaCL quality checks in YAML format | ✅ |
| `dbt` | Export to dbt models in YAML format | ✅ |
| `dbt-sources` | Export to dbt sources in YAML format | ✅ |
| `dbt-staging-sql` | Export to dbt staging SQL models | ✅ |
| `rdf` | Export data contract to RDF representation in N3 format | ✅ |
| `avro` | Export to AVRO models | ✅ |
| `protobuf` | Export to Protobuf | ✅ |
| `terraform` | Export to terraform resources | ✅ |
| `sql` | Export to SQL DDL | ✅ |
| `sql-query` | Export to SQL Query | ✅ |
| `great-expectations` | Export to Great Expectations Suites in JSON Format | ✅ |
| `pydantic` | Export to pydantic models | TBD |
| Missing something? | Please create an issue on GitHub | TBD |

#### Great Expectations
The export function transforms a specified data contract into a comprehensive Great Expectations JSON suite.
If the contract includes multiple models, you need to specify the names of the model you wish to export.
```shell
datacontract export datacontract.yaml --format great-expectations --model orders
```
The export creates a list of expectations by utilizing:

- The data from the Model definition with a fixed mapping
- The expectations provided in the quality field for each model (find here the expectations gallery https://greatexpectations.io/expectations/)

#### RDF

Expand Down
1 change: 1 addition & 0 deletions datacontract/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ class ExportFormat(str, Enum):
rdf = "rdf"
avro = "avro"
protobuf = "protobuf"
great_expectations = "great-expectations"
terraform = "terraform"
avro_idl = "avro-idl"
sql = "sql"
Expand Down
25 changes: 25 additions & 0 deletions datacontract/data_contract.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
from datacontract.export.jsonschema_converter import to_jsonschema, to_jsonschema_json
from datacontract.export.odcs_converter import to_odcs_yaml
from datacontract.export.protobuf_converter import to_protobuf
from datacontract.export.great_expectations_converter import to_great_expectations
from datacontract.export.rdf_converter import to_rdf, to_rdf_n3
from datacontract.export.rdf_converter import to_rdf_n3
from datacontract.export.sodacl_converter import to_sodacl_yaml
from datacontract.imports.avro_importer import import_avro
Expand Down Expand Up @@ -389,6 +391,29 @@ def export(self, export_format, model: str = "all", rdf_base: str = None, sql_se
raise RuntimeError(f"Model {model_name} not found in the data contract. Available models: {model_names}")

return to_sql_query(data_contract, model_name, model_value, server_type)

if export_format == "great-expectations":
if data_contract.models is None:
raise RuntimeError(f"Export to {export_format} requires models in the data contract.")

model_names = list(data_contract.models.keys())

if model == "all":
if len(data_contract.models.items()) != 1:
raise RuntimeError(f"Export to {export_format} is model specific. Specify the model via --model "
f"$MODEL_NAME. Available models: {model_names}")

model_name, model_value = next(iter(data_contract.models.items()))
return to_great_expectations(data_contract, model_name)
else:
model_name = model
model_value = data_contract.models.get(model_name)
if model_value is None:
raise RuntimeError(f"Model {model_name} not found in the data contract. "
f"Available models: {model_names}")

return to_great_expectations(data_contract, model_name)

else:
print(f"Export format {export_format} not supported.")
return ""
Expand Down
151 changes: 151 additions & 0 deletions datacontract/export/great_expectations_converter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
import json
from typing import Dict, List, Any

import yaml

from datacontract.model.data_contract_specification import \
DataContractSpecification, Field, Quality


def to_great_expectations(data_contract_spec: DataContractSpecification, model_key: str) -> str:
"""
Convert each model in the contract to a Great Expectation suite
@param data_contract_spec: data contract to export to great expectations
@param model_key: model to great expectations to
@return: a dictionary of great expectation suites
"""
expectations = []
model_value = data_contract_spec.models.get(model_key)
quality_checks = get_quality_checks(data_contract_spec.quality)
expectations.extend(model_to_expectations(model_value.fields))
expectations.extend(checks_to_expectations(quality_checks, model_key))
model_expectation_suite = to_suite(model_key, data_contract_spec.info.version, expectations)

return model_expectation_suite


def to_suite(model_key: str, contract_version: str, expectations: List[Dict[str, Any]], ) -> str:
return json.dumps({
"data_asset_type": "null",
"expectation_suite_name": "user-defined.{model_key}.{contract_version}"
.format(model_key=model_key,
contract_version=contract_version),
"expectations": expectations,
"meta": {
}
}, indent=2)


def model_to_expectations(fields: Dict[str, Field]) -> List[Dict[str, Any]]:
"""
Convert the model information to expectations
@param fields: model field
@return: list of expectations
"""
expectations = []
add_column_order_exp(fields, expectations)
for field_name, field in fields.items():
add_field_expectations(field_name, field, expectations)
return expectations


def add_field_expectations(field_name, field: Field, expectations: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
if field.type is not None:
expectations.append(to_column_types_exp(field_name, field.type))
if field.unique is not None:
expectations.append(to_column_unique_exp(field_name))
if field.maxLength is not None or field.minLength is not None:
expectations.append(to_column_length_exp(field_name, field.minLength, field.maxLength))
if field.minimum is not None or field.maximum is not None:
expectations.append(to_column_min_max_exp(field_name, field.minimum, field.maximum))

# TODO: all constraints
return expectations


def add_column_order_exp(fields: Dict[str, Field], expectations: List[Dict[str, Any]]):
expectations.append({"expectation_type": "expect_table_columns_to_match_ordered_list",
"kwargs": {
"column_list": list(fields.keys())
},
"meta": {}
})


def to_column_types_exp(field_name, field_type) -> Dict[str, Any]:
return {
"expectation_type": "expect_column_values_to_be_of_type",
"kwargs": {
"column": field_name,
"type_": field_type
},
"meta": {}
}


def to_column_unique_exp(field_name) -> Dict[str, Any]:
return {
"expectation_type": "expect_column_values_to_be_unique",
"kwargs": {
"column": field_name
},
"meta": {}
}


def to_column_length_exp(field_name, min_length, max_length) -> Dict[str, Any]:
return {
"expectation_type": "expect_column_value_lengths_to_be_between",
"kwargs": {
"column": field_name,
"min_value": min_length,
"max_value": max_length
},
"meta": {}
}


def to_column_min_max_exp(field_name, minimum, maximum) -> Dict[str, Any]:
return {
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {
"column": field_name,
"min_value": minimum,
"max_value": maximum
},
"meta": {}
}


def get_quality_checks(quality: Quality) -> Dict[str, Any]:
if quality is None:
return {}
if quality.type is None:
return {}
if quality.type.lower() != "great-expectations":
return {}
if isinstance(quality.specification, str):
quality_specification = yaml.safe_load(quality.specification)
else:
quality_specification = quality.specification
return quality_specification


def checks_to_expectations(quality_checks: Dict[str, Any], model_key: str) -> List[Dict[str, Any]]:
"""
Get the quality definition for each model to the model expectation list
@param quality_checks: dictionary of quality checks by model
@param model_key: id of the model
@return: the list of expectations for that model
"""
if quality_checks is None or model_key not in quality_checks:
return []

model_quality_checks = quality_checks[model_key]

if model_quality_checks is None:
return []

if isinstance(model_quality_checks, str):
expectation_list = json.loads(model_quality_checks)
return expectation_list
8 changes: 8 additions & 0 deletions datacontract/lint/linters/quality_schema_linter.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,11 @@ def lint_montecarlo(self, check, models: dict[str, Model]) ->\
return LinterResult().with_warning(
"Linting montecarlo checks is not currently implemented")

def lint_great_expectations(self, check, models: dict[str, Model]) ->\
LinterResult:
return LinterResult().with_warning(
"Linting great expectations checks is not currently implemented")

def lint_implementation(self, contract: DataContractSpecification) ->\
LinterResult:
result = LinterResult()
Expand All @@ -50,6 +55,9 @@ def lint_implementation(self, contract: DataContractSpecification) ->\
case "montecarlo":
result = result.combine(
self.lint_montecarlo(check_specification, models))
case "great-expectations":
result = result.combine(
self.lint_great_expectations(check_specification, models))
case _:
result = result.with_warning("Can't lint quality check "
f"with type '{check.type}'")
Expand Down
34 changes: 34 additions & 0 deletions tests/examples/great-expectations/datacontract.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
dataContractSpecification: 0.9.1
info:
title: Orders Unit Test
version: 1.0.0
owner: checkout
description: The orders data contract
contact:
email: [email protected]
url: https://wiki.example.com/teams/checkout
models:
orders:
description: test
fields:
order_id:
type: string
required: true
processed_timestamp:
type: timestamp
required: true
quality:
type: great-expectations
specification:
orders: |-
[
{
"expectation_type": "expect_table_row_count_to_be_between",
"kwargs": {
"min_value": 10
},
"meta": {
}
}
]
Loading

0 comments on commit a9e845d

Please sign in to comment.