Skip to content

Commit

Permalink
Issues/122 support specifying tables for glue import (datacontract#230)
Browse files Browse the repository at this point in the history
* Add a glue-table parameter to filter tables imported from AWS

- works the same way as it does for `bigquery`
- the user needs to supply the names, can give the parameter multiple times
- there's no validation of the names, so typos will lead to runtime errors

* Adapt tests
  • Loading branch information
jpraetorius authored May 29, 2024
1 parent 0210fc9 commit e727468
Show file tree
Hide file tree
Showing 7 changed files with 120 additions and 32 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `datacontract export --format avro`: Now supports config map on field level for logicalTypes and default values [Custom Avro Properties](./README.md#custom-avro-properties)
- `datacontract import --format avro`: Now supports importing logicalType and default definition on avro files [Custom Avro Properties](./README.md#custom-avro-properties)
- Support `config.bigqueryType` for testing BigQuery types
- Added support for selecting specific tables in an AWS Glue `import` through the `glue-table` parameter (#122)

### Fixed

Expand Down
86 changes: 59 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -745,41 +745,30 @@ models:
```
Usage: datacontract import [OPTIONS]

Create a data contract from the given source location. Prints to stdout.

╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --format [sql|avro|glue|bigquery|jsonschema] The format of the source file. [default: None] [required]
│ --source TEXT The path to the file or Glue Database that should be imported. [default: None]
│ --bigquery-project TEXT The bigquery project id. [default: None]
│ --bigquery-dataset TEXT The bigquery dataset id. [default: None]
│ --bigquery-table TEXT List of table ids to import from the bigquery API (repeat for multiple table ids, leave empty for all │
│ tables in the dataset). │
[default: None]
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Create a data contract from the given source location. Prints to stdout.

╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --format [sql|avro|glue|bigquery|jsonschema] The format of the source file. [default: None] [required]
│ --source TEXT The path to the file or Glue Database that should be imported. │
[default: None]
│ --glue-table TEXT List of table ids to import from the Glue Database (repeat for │
│ multiple table ids, leave empty for all tables in the dataset). │
[default: None]
│ --bigquery-project TEXT The bigquery project id. [default: None]
│ --bigquery-dataset TEXT The bigquery dataset id. [default: None]
│ --bigquery-table TEXT List of table ids to import from the bigquery API (repeat for │
│ multiple table ids, leave empty for all tables in the dataset). │
[default: None]
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```
As shown, some options are only relevant in certain conditions: For `format` Bigtable we support to directly read off the Bigtable APIs.
In this case there's no need to specify `source` but instead `bt-project-id`, `bt-dataset-id` and `table` must be specified.
For providing authentication to the Client, please see [the google documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc#how-to) or the one [about authorizing client libraries](https://cloud.google.com/bigquery/docs/authentication#client-libs).
Example:
```bash
# Example import from SQL DDL
datacontract import --format sql --source my_ddl.sql
```

```bash
# Example import from Bigquery JSON
datacontract import --format bigquery --source my_bigquery_table.json
```

```bash
# Example import from Bigquery API
datacontract import --format bigquery --btProjectId <project_id> --btDatasetId <dataset_id> --table <tableid_1> --table <tableid_2> --table <tableid_3>
```

Available import options:

| Type | Description | Status |
Expand All @@ -795,6 +784,49 @@ Available import options:
| Missing something? | Please create an issue on GitHub | TBD |


#### BigQuery

Bigquery data can either be imported off of JSON Files generated from the table descriptions or directly from the Bigquery API. In case you want to use JSON Files, specify the `source` parameter with a path to the JSON File.

To import from the Bigquery API, you have to _omit_ `source` and instead need to provide `bigquery-project` and `bigquery-dataset`. Additionally you may specify `bigquery-table` to enumerate the tables that should be imported. If no tables are given, _all_ available tables of the dataset will be imported.

For providing authentication to the Client, please see [the google documentation](https://cloud.google.com/docs/authentication/provide-credentials-adc#how-to) or the one [about authorizing client libraries](https://cloud.google.com/bigquery/docs/authentication#client-libs).

Examples:

```bash
# Example import from Bigquery JSON
datacontract import --format bigquery --source my_bigquery_table.json
```

```bash
# Example import from Bigquery API with specifying the tables to import
datacontract import --format bigquery --bigquery-project <project_id> --bigquery-dataset <dataset_id> --bigquery-table <tableid_1> --bigquery-table <tableid_2> --bigquery-table <tableid_3>
```

```bash
# Example import from Bigquery API importing all tables in the dataset
datacontract import --format bigquery --bigquery-project <project_id> --bigquery-dataset <dataset_id>
```

### Glue

Importing from Glue reads the necessary Data directly off of the AWS API.
You may give the `glue-table` parameter to enumerate the tables that should be imported. If no tables are given, _all_ available tables of the database will be imported.

Examples:

```bash
# Example import from AWS Glue with specifying the tables to import
datacontract import --format glue --source <database_name> --glue-table <table_name_1> --glue-table <table_name_2> --glue-table <table_name_3>
```

```bash
# Example import from AWS Glue importing all tables in the database
datacontract import --format glue --source <database_name>
```


### breaking

```
Expand Down
8 changes: 7 additions & 1 deletion datacontract/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,12 @@ def import_(
source: Annotated[
Optional[str], typer.Option(help="The path to the file or Glue Database that should be imported.")
] = None,
glue_table: Annotated[
Optional[List[str]],
typer.Option(
help="List of table ids to import from the Glue Database (repeat for multiple table ids, leave empty for all tables in the dataset)."
),
] = None,
bigquery_project: Annotated[Optional[str], typer.Option(help="The bigquery project id.")] = None,
bigquery_dataset: Annotated[Optional[str], typer.Option(help="The bigquery dataset id.")] = None,
bigquery_table: Annotated[
Expand All @@ -243,7 +249,7 @@ def import_(
"""
Create a data contract from the given source location. Prints to stdout.
"""
result = DataContract().import_from_source(format, source, bigquery_table, bigquery_project, bigquery_dataset)
result = DataContract().import_from_source(format, source, glue_table, bigquery_table, bigquery_project, bigquery_dataset)
console.print(result.to_yaml())


Expand Down
3 changes: 2 additions & 1 deletion datacontract/data_contract.py
Original file line number Diff line number Diff line change
Expand Up @@ -422,6 +422,7 @@ def import_from_source(
self,
format: str,
source: typing.Optional[str] = None,
glue_tables: typing.Optional[typing.List[str]] = None,
bigquery_tables: typing.Optional[typing.List[str]] = None,
bigquery_project: typing.Optional[str] = None,
bigquery_dataset: typing.Optional[str] = None,
Expand All @@ -433,7 +434,7 @@ def import_from_source(
elif format == "avro":
data_contract_specification = import_avro(data_contract_specification, source)
elif format == "glue":
data_contract_specification = import_glue(data_contract_specification, source)
data_contract_specification = import_glue(data_contract_specification, source, glue_tables)
elif format == "jsonschema":
data_contract_specification = import_jsonschema(data_contract_specification, source)
elif format == "bigquery":
Expand Down
7 changes: 4 additions & 3 deletions datacontract/imports/glue_importer.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ def get_glue_table_schema(database_name: str, table_name: str):
return table_schema


def import_glue(data_contract_specification: DataContractSpecification, source: str):
def import_glue(data_contract_specification: DataContractSpecification, source: str, table_names: List[str]):
"""Import the schema of a Glue database."""

catalogid, location_uri = get_glue_database(source)
Expand All @@ -116,13 +116,14 @@ def import_glue(data_contract_specification: DataContractSpecification, source:
if catalogid is None:
return data_contract_specification

tables = get_glue_tables(source)
if table_names is None:
table_names = get_glue_tables(source)

data_contract_specification.servers = {
"production": Server(type="glue", account=catalogid, database=source, location=location_uri),
}

for table_name in tables:
for table_name in table_names:
if data_contract_specification.models is None:
data_contract_specification.models = {}

Expand Down
15 changes: 15 additions & 0 deletions tests/fixtures/glue/datacontract-empty-model.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
dataContractSpecification: 0.9.3
id: my-data-contract-id
info:
title: My Data Contract
version: 0.0.1
servers:
production:
account: '123456789012'
database: test_database
location: s3://test_bucket/testdb
type: glue
models:
table_1:
type: table

32 changes: 32 additions & 0 deletions tests/test_import_glue.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,25 @@ def test_cli(setup_mock_glue):
)
assert result.exit_code == 0

@mock_aws
def test_cli_with_table_filters(setup_mock_glue):
runner = CliRunner()
result = runner.invoke(
app,
[
"import",
"--format",
"glue",
"--source",
"test_database",
"--glue-table",
"table_1",
"--glue-table",
"table_2",
],
)
assert result.exit_code == 0


@mock_aws
def test_import_glue_schema(setup_mock_glue):
Expand All @@ -96,3 +115,16 @@ def test_import_glue_schema(setup_mock_glue):
assert yaml.safe_load(result.to_yaml()) == yaml.safe_load(expected)
# Disable linters so we don't get "missing description" warnings
assert DataContract(data_contract_str=expected).lint(enabled_linters=set()).has_passed()

@mock_aws
def test_import_glue_schema_with_table_filters(setup_mock_glue):
result = DataContract().import_from_source("glue", "test_database", ["table_1"])

# we specify a table that the Mock doesn't have and thus expect an empty result
with open("fixtures/glue/datacontract-empty-model.yaml") as file:
expected = file.read()

print("Result", result.to_yaml())
assert yaml.safe_load(result.to_yaml()) == yaml.safe_load(expected)
# Disable linters so we don't get "missing description" warnings
assert DataContract(data_contract_str=expected).lint(enabled_linters=set()).has_passed()

0 comments on commit e727468

Please sign in to comment.