Skip to content

Commit

Permalink
[196] Add support for MSSQL (datacontract#204)
Browse files Browse the repository at this point in the history
* [196] Add support for MSSQL

* [196] Correct soda config serializer

* [196] Move requirements for MSSql into dev reqs

* [196] Correct usage of regex in datacontract to be pattern instead

* [196] Correct pyproject toml

* [196] Formalize support for SQLServer
- Add required packages
- Add required connection details
- Semi complete tests

* feat(export/jsonschema): supports array type (datacontract#200)

* Support logical Types in Avro Export (datacontract#199)

* Support logical Types in Avro Export

- Map Datacontracts date-type fields to avro logical types
- date: `int/date`
- timestamp, timestamp_tz: `long/timestamp-millis`
- timestamp_ntz: `long/local-timestamp-millis`

* Update CHANGELOG

* Add ability to export to go types (datacontract#195)

* Add ability to export to go types

* add test

* rename to types

* updated naming

* update docs

* Update boto3 requirement from <1.34.99,>=1.34.41 to >=1.34.41,<1.34.104 (datacontract#189)

Updates the requirements on [boto3](https://github.com/boto/boto3) to permit the latest version.
- [Release notes](https://github.com/boto/boto3/releases)
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst)
- [Commits](boto/boto3@1.34.41...1.34.103)

---
updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update botocore requirement (datacontract#190)

Updates the requirements on [botocore](https://github.com/boto/botocore) to permit the latest version.
- [Changelog](https://github.com/boto/botocore/blob/develop/CHANGELOG.rst)
- [Commits](boto/botocore@1.34.41...1.34.103)

---
updated-dependencies:
- dependency-name: botocore
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jochenchrist <[email protected]>

* Update snowflake-connector-python[pandas] requirement (datacontract#172)

Updates the requirements on [snowflake-connector-python[pandas]](https://github.com/snowflakedb/snowflake-connector-python) to permit the latest version.
- [Release notes](https://github.com/snowflakedb/snowflake-connector-python/releases)
- [Commits](snowflakedb/snowflake-connector-python@v3.6.0...v3.10.0)

---
updated-dependencies:
- dependency-name: snowflake-connector-python[pandas]
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* 91 JSON Schema (datacontract#201)

* add import for jsonschemas and extend export

* fix tests

* remove unused import

* Update changelog

* First support for server-specific types in a config map.

Resolves datacontract#150

* Issues/193 fix all todos in html export (datacontract#203)

* Make Examples in Fields work

- we have to have them declared in the model to make them show up at all :)

* Add Definitions in HTML Export

- add examples to the model, so we can render them
- create new tables akin to what we do for the models

* Add examples

* Handle nested fields in HTML Export

We just go one level deep but add an additional set of rows for fields contained in a models field.

* Update CHANGELOG

* Update Tests for breaking and changelog

Now that we include the `example` property in the field there's more things being pointed out, so adjust the tests accordingly

* Handle Model Fields and their nesting through partials

- added jinja partials as dependency
- extracted the model and the nesting handling out to its own partial

* Update definitions

- move them to their opwn partial
- move enum into the content column
- try to highlight the different optional aspects a tad

* Move some more blocks into partials

* Add partials to manifest

* Removew the nested headline

---------

Co-authored-by: jochen <[email protected]>

[196] Formalize support for SQLServer
- Add required packages
- Add required connection details
- Semi complete tests

[196] Add SQLServer type serializer

* [196] Add msodbcsql18 to docker file

* [196] Apply ruff formatting

* @simonharrer PR suggestion to make naming more consistent

* [196] Add changes to changelog

* [196] Update readme with new SQLServer information

* [196] Add CI/CD step to install msodbcsql driver

* [196] Skip test if outside of CI/CD environment

* [196] Add msqsql package back

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Robert DeRienzo <[email protected]>
Co-authored-by: jochen <[email protected]>
Co-authored-by: JAEJIN LEE <[email protected]>
Co-authored-by: Joachim Praetorius <[email protected]>
Co-authored-by: Mark Olliver <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jochenchrist <[email protected]>
Co-authored-by: Mark Olliver <[email protected]>
Co-authored-by: Simon Harrer <[email protected]>
  • Loading branch information
10 people authored May 24, 2024
1 parent 479c72e commit 702ba50
Show file tree
Hide file tree
Showing 17 changed files with 322 additions and 58 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ jobs:
uses: actions/setup-python@v5
with:
python-version: ${{matrix.python-version}}
- name: Install msodbcsql18
run: |
sudo apt-get update
sudo apt-get install -y msodbcsql18
- name: Install dependencies
run: |
python -m pip install --upgrade pip
Expand Down
11 changes: 6 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]

### Added
- Added support for `sqlserver` (#196)

- `datacontract export --format dbml`: Export to [Database Markup Language (DBML)](https://dbml.dbdiagram.io/home/) (#135)

Expand Down Expand Up @@ -43,7 +44,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Added support for `delta` tables on S3 (#24)
- Added new command `datacontract catalog` that generates a data contract catalog with an `index.html` file.
- Added field format information to HTML export

### Fixed
- RDF Export: Fix error if owner is not a URI/URN

Expand All @@ -70,13 +71,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Added export format **html** (#15)
- Added descriptions as comments to `datacontract export --format sql` for Databricks dialects
- Added import of arrays in Avro import
- Added import of arrays in Avro import

## [0.9.8] - 2024-04-01

### Added
- Added export format **great-expectations**: `datacontract export --format great-expectations`

- Added export format **great-expectations**: `datacontract export --format great-expectations`
- Added gRPC support to OpenTelemetry integration for publishing test results
- Added AVRO import support for namespace (#121)
- Added handling for optional fields in avro import (#112)
Expand Down Expand Up @@ -158,7 +159,7 @@ We start with JSON messages and avro, and Protobuf will follow.
## [0.9.0] - 2024-01-26 - BREAKING

This is a breaking change (we are still on a 0.x.x version).
The project migrated from Golang to Python.
The project migrated from Golang to Python.
The Golang version can be found at [cli-go](https://github.com/datacontract/cli-go)

### Added
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ RUN python -c "import duckdb; duckdb.connect().sql(\"INSTALL httpfs\");"

FROM ubuntu:22.04 AS runner-image

RUN apt-get update && apt-get install --no-install-recommends -y python3.11 python3.11-venv && \
RUN apt-get update && apt-get install --no-install-recommends -y python3.11 python3.11-venv msodbcsql18 && \
apt-get clean && rm -rf /var/lib/apt/lists/*

COPY --from=builder-image /opt/venv /opt/venv
Expand Down
118 changes: 78 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ It uses data contract YAML files to lint the data contract, connect to data sour

## Getting started

Let's look at this data contract:
Let's look at this data contract:
[https://datacontract.com/examples/orders-latest/datacontract.yaml](https://datacontract.com/examples/orders-latest/datacontract.yaml)

We have a _servers_ section with endpoint details to the S3 bucket, _models_ for the structure of the data, _servicelevels_ and _quality_ attributes that describe the expected freshness and number of rows.
Expand Down Expand Up @@ -191,11 +191,11 @@ Commands

### init

```
Usage: datacontract init [OPTIONS] [LOCATION]
Download a datacontract.yaml template and write it to file.
```
Usage: datacontract init [OPTIONS] [LOCATION]
Download a datacontract.yaml template and write it to file.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────╮
│ location [LOCATION] The location (url or path) of the data contract yaml to create. │
│ [default: datacontract.yaml] │
Expand All @@ -213,10 +213,10 @@ Commands
### lint

```
Usage: datacontract lint [OPTIONS] [LOCATION]
Validate that the datacontract.yaml is correctly formatted.
Usage: datacontract lint [OPTIONS] [LOCATION]
Validate that the datacontract.yaml is correctly formatted.
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ location [LOCATION] The location (url or path) of the data contract yaml. [default: datacontract.yaml] │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Expand All @@ -230,10 +230,10 @@ Commands
### test

```
Usage: datacontract test [OPTIONS] [LOCATION]
Run schema and quality tests on configured servers.
Usage: datacontract test [OPTIONS] [LOCATION]
Run schema and quality tests on configured servers.
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ location [LOCATION] The location (url or path) of the data contract yaml. [default: datacontract.yaml] │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Expand Down Expand Up @@ -266,11 +266,11 @@ Data Contract CLI connects to a data source and runs schema and quality tests to
$ datacontract test --server production datacontract.yaml
```

To connect to the databases the `server` block in the datacontract.yaml is used to set up the connection.
To connect to the databases the `server` block in the datacontract.yaml is used to set up the connection.
In addition, credentials, such as username and passwords, may be defined with environment variables.

The application uses different engines, based on the server `type`.
Internally, it connects with DuckDB, Spark, or a native connection and executes the most tests with _soda-core_ and _fastjsonschema_.
Internally, it connects with DuckDB, Spark, or a native connection and executes the most tests with _soda-core_ and _fastjsonschema_.

Credentials are provided with environment variables.

Expand Down Expand Up @@ -456,7 +456,7 @@ dbutils.library.restartPython()
from datacontract.data_contract import DataContract
data_contract = DataContract(
data_contract_file="/Volumes/acme_catalog_prod/orders_latest/datacontract/datacontract.yaml",
data_contract_file="/Volumes/acme_catalog_prod/orders_latest/datacontract/datacontract.yaml",
spark=spark)
run = data_contract.test()
run.result
Expand All @@ -481,7 +481,7 @@ servers:
models:
my_table_1: # corresponds to a table
type: table
fields:
fields:
my_column_1: # corresponds to a column
type: varchar
```
Expand Down Expand Up @@ -539,7 +539,7 @@ servers:
models:
my_table_1: # corresponds to a table
type: table
fields:
fields:
my_column_1: # corresponds to a column
type: varchar
```
Expand All @@ -553,9 +553,47 @@ models:




### Postgres

Data Contract CLI can test data in Postgres or Postgres-compliant databases (e.g., RisingWave).

#### Example

datacontract.yaml
```yaml
servers:
postgres:
type: sqlserver
host: localhost
port: 5432
database: tempdb
schema: dbo
driver: ODBC Driver 18 for SQL Server
models:
my_table_1: # corresponds to a table
type: table
fields:
my_column_1: # corresponds to a column
type: varchar
```

#### Environment Variables

| Environment Variable | Example | Description |
|----------------------------------|--------------------|-------------|
| `DATACONTRACT_SQLSERVER_USERNAME` | `root` | Username |
| `DATACONTRACT_SQLSERVER_PASSWORD` | `toor` | Password |
| `DATACONTRACT_SQLSERVER_TRUSTED_CONNECTION` | `True` | Use windows authentication, instead of login |
| `DATACONTRACT_SQLSERVER_TRUST_SERVER_CERTIFICATE` | `True` | Trust self-signed certificate |
| `DATACONTRACT_SQLSERVER_ENCRYPTED_CONNECTION` | `True` | Use SSL |



### export

```

Usage: datacontract export [OPTIONS] [LOCATION]

Convert data contract to a specific format. Prints to stdout or to the specified output file.
Expand Down Expand Up @@ -599,9 +637,9 @@ Available export options:

| Type | Description | Status |
|----------------------|---------------------------------------------------------|--------|
| `html` | Export to HTML ||
| `jsonschema` | Export to JSON Schema ||
| `odcs` | Export to Open Data Contract Standard (ODCS) ||
| `html` | Export to HTML ||
| `jsonschema` | Export to JSON Schema ||
| `odcs` | Export to Open Data Contract Standard (ODCS) ||
| `sodacl` | Export to SodaCL quality checks in YAML format ||
| `dbt` | Export to dbt models in YAML format ||
| `dbt-sources` | Export to dbt sources in YAML format ||
Expand All @@ -621,11 +659,11 @@ Available export options:

#### Great Expectations

The export function transforms a specified data contract into a comprehensive Great Expectations JSON suite.
The export function transforms a specified data contract into a comprehensive Great Expectations JSON suite.
If the contract includes multiple models, you need to specify the names of the model you wish to export.

```shell
datacontract export datacontract.yaml --format great-expectations --model orders
datacontract export datacontract.yaml --format great-expectations --model orders
```

The export creates a list of expectations by utilizing:
Expand All @@ -635,7 +673,7 @@ The export creates a list of expectations by utilizing:

#### RDF

The export function converts a given data contract into a RDF representation. You have the option to
The export function converts a given data contract into a RDF representation. You have the option to
add a base_url which will be used as the default prefix to resolve relative IRIs inside the document.

```shell
Expand Down Expand Up @@ -688,7 +726,7 @@ In this case there's no need to specify `source` but instead `bt-project-id`, `b

For providing authentication to the Client, please see [the google documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc#how-to) or the one [about authorizing client libraries](https://cloud.google.com/bigquery/docs/authentication#client-libs).

Example:
Example:
```bash
# Example import from SQL DDL
datacontract import --format sql --source my_ddl.sql
Expand Down Expand Up @@ -722,10 +760,10 @@ Available import options:
### breaking

```
Usage: datacontract breaking [OPTIONS] LOCATION_OLD LOCATION_NEW
Identifies breaking changes between data contracts. Prints to stdout.
Usage: datacontract breaking [OPTIONS] LOCATION_OLD LOCATION_NEW
Identifies breaking changes between data contracts. Prints to stdout.
╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * location_old TEXT The location (url or path) of the old data contract yaml. [default: None] [required] │
│ * location_new TEXT The location (url or path) of the new data contract yaml. [default: None] [required] │
Expand All @@ -738,10 +776,10 @@ Available import options:
### changelog

```
Usage: datacontract changelog [OPTIONS] LOCATION_OLD LOCATION_NEW
Generate a changelog between data contracts. Prints to stdout.
Usage: datacontract changelog [OPTIONS] LOCATION_OLD LOCATION_NEW
Generate a changelog between data contracts. Prints to stdout.
╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * location_old TEXT The location (url or path) of the old data contract yaml. [default: None] [required] │
│ * location_new TEXT The location (url or path) of the new data contract yaml. [default: None] [required] │
Expand All @@ -754,10 +792,10 @@ Available import options:
### diff

```
Usage: datacontract diff [OPTIONS] LOCATION_OLD LOCATION_NEW
PLACEHOLDER. Currently works as 'changelog' does.
Usage: datacontract diff [OPTIONS] LOCATION_OLD LOCATION_NEW
PLACEHOLDER. Currently works as 'changelog' does.
╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * location_old TEXT The location (url or path) of the old data contract yaml. [default: None] [required] │
│ * location_new TEXT The location (url or path) of the new data contract yaml. [default: None] [required] │
Expand Down Expand Up @@ -889,14 +927,14 @@ Create a data contract based on the requirements from use cases.
```bash
$ datacontract init
```

2. Add examples to the `datacontract.yaml`. Do not start with the data model, although you are probably tempted to do that. Examples are the fastest way to get feedback from everybody and not loose someone in the discussion.

3. Create the model based on the examples. Test the model against the examples to double-check whether the model matches the examples.
```bash
$ datacontract test --examples
```

4. Add quality checks and additional type constraints one by one to the contract and make sure the examples and the actual data still adheres to the contract. Check against examples for a very fast feedback loop.
```bash
$ datacontract test --examples
Expand Down
4 changes: 2 additions & 2 deletions datacontract/data_contract.py
Original file line number Diff line number Diff line change
Expand Up @@ -391,7 +391,7 @@ def _get_examples_server(self, data_contract, run, tmp_dir):
)
run.log_info(f"Using {server} for testing the examples")
return server

def _check_models_for_export(self, data_contract: DataContractSpecification, model: str, export_format: str) -> typing.Tuple[str, str]:
if data_contract.models is None:
raise RuntimeError(f"Export to {export_format} requires models in the data contract.")
Expand All @@ -412,7 +412,7 @@ def _check_models_for_export(self, data_contract: DataContractSpecification, mod
raise RuntimeError(
f"Model {model_name} not found in the data contract. Available models: {model_names}"
)

return model_name, model_value

def import_from_source(self, format: str, source: typing.Optional[str] = None, bigquery_tables: typing.Optional[typing.List[str]] = None, bigquery_project: typing.Optional[str] = None, bigquery_dataset: typing.Optional[str] = None) -> DataContractSpecification:
Expand Down
5 changes: 5 additions & 0 deletions datacontract/engines/soda/check_soda_execute.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from datacontract.engines.soda.connections.kafka import create_spark_session, read_kafka_topic
from datacontract.engines.soda.connections.postgres import to_postgres_soda_configuration
from datacontract.engines.soda.connections.snowflake import to_snowflake_soda_configuration
from datacontract.engines.soda.connections.sqlserver import to_sqlserver_soda_configuration
from datacontract.export.sodacl_converter import to_sodacl_yaml
from datacontract.model.data_contract_specification import DataContractSpecification, Server
from datacontract.model.run import Run, Check, Log
Expand Down Expand Up @@ -69,6 +70,10 @@ def check_soda_execute(
read_kafka_topic(spark, data_contract, server, tmp_dir)
scan.add_spark_session(spark, data_source_name=server.type)
scan.set_data_source_name(server.type)
elif server.type == "sqlserver":
soda_configuration_str = to_sqlserver_soda_configuration(server)
scan.add_configuration_yaml_str(soda_configuration_str)
scan.set_data_source_name(server.type)

else:
run.checks.append(
Expand Down
Loading

0 comments on commit 702ba50

Please sign in to comment.