Added Threshold Query Builder #188

ravit-db · 2024-03-21T09:28:11Z

No description provided.

codecov · 2024-03-21T09:29:49Z

Codecov Report

Attention: Patch coverage is 93.53234% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 93.78%. Comparing base (45323b4) to head (b02970d).
Report is 2 commits behind head on main.

Files	Patch %	Lines
...s/labs/remorph/reconcile/connectors/data_source.py	53.84%	6 Missing ⚠️
...bricks/labs/remorph/reconcile/connectors/oracle.py	40.00%	6 Missing ⚠️
...databricks/labs/remorph/reconcile/query_builder.py	98.91%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #188      +/-   ##
==========================================
- Coverage   95.95%   93.78%   -2.17%     
==========================================
  Files          19       28       +9     
  Lines        1236     1642     +406     
  Branches      200      244      +44     
==========================================
+ Hits         1186     1540     +354     
- Misses         25       77      +52     
  Partials       25       25

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/databricks/labs/remorph/reconcile/query_builder.py

…class

src/databricks/labs/remorph/reconcile/connectors/data_source.py

nfx · 2024-03-22T09:29:03Z

src/databricks/labs/remorph/reconcile/query_builder.py

+    def _get_custom_transformation(self, columns, transformation_dict, column_mapping):
+        transformation_rule_mapping = []
+        for column in columns:
+            if column in transformation_dict.keys():


Introduce a method that returns transformation rule mapping

src/databricks/labs/remorph/reconcile/query_builder.py

src/databricks/labs/remorph/reconcile/query_config.py

nfx · 2024-03-22T11:48:46Z

src/databricks/labs/remorph/reconcile/connectors/data_source.py

@@ -20,11 +20,16 @@ def __init__(self, source: str, spark: SparkSession, ws: WorkspaceClient, scope:
        self.scope = scope

    @abstractmethod
-    def read_data(self, schema_name: str, catalog_name: str, query: str, table_conf: Tables) -> DataFrame:
+    def read_data(self, catalog: str, schema: str, query: str, jdbc_reader_options: JdbcReaderOptions) -> DataFrame:


Suggested change

def read_data(self, catalog: str, schema: str, query: str, jdbc_reader_options: JdbcReaderOptions) -> DataFrame:

def read_data(self, catalog: str, schema: str, query: str, options: JdbcReaderOptions) -> DataFrame:

nit: can you still rename all arguments to make them reasonably shorter? :)

updated the arguments to shorter.

nfx

every time you self-review this code, look at the comments from codecov and make sure there's as little as possible of those:

src/databricks/labs/remorph/reconcile/connectors/oracle.py

nfx · 2024-03-22T11:54:11Z

src/databricks/labs/remorph/reconcile/connectors/snowflake.py

        # Implement Snowflake-specific logic here
        return NotImplemented

-    def get_schema(self, table_name: str, schema_name: str, catalog_name: str) -> list[Schema]:
+    def get_schema(


does make fmt put it back to one line?...

nfx · 2024-03-22T11:58:59Z

src/databricks/labs/remorph/reconcile/query_config.py

+        self.table_conf = table_conf
+        self.schema = schema
+        self.layer = layer
+        self.db_type = db_type
+        self.schema_dict = {v.column_name: v for v in schema}
+        self.tgt_column_mapping = table_conf.list_to_dict(ColumnMapping, "target_name")
+        self.src_column_mapping = table_conf.list_to_dict(ColumnMapping, "source_name")
+        self.transformations_dict = table_conf.list_to_dict(Transformation, "column_name")


Suggested change

self.table_conf = table_conf

self.schema = schema

self.layer = layer

self.db_type = db_type

self.schema_dict = {v.column_name: v for v in schema}

self.tgt_column_mapping = table_conf.list_to_dict(ColumnMapping, "target_name")

self.src_column_mapping = table_conf.list_to_dict(ColumnMapping, "source_name")

self.transformations_dict = table_conf.list_to_dict(Transformation, "column_name")

self._table_conf = table_conf

self._schema = schema

self._layer = layer

self._db_type = db_type

self._schema_dict = {v.column_name: v for v in schema}

self._tgt_column_mapping = table_conf.list_to_dict(ColumnMapping, "target_name")

self._src_column_mapping = table_conf.list_to_dict(ColumnMapping, "source_name")

self._transformations_dict = table_conf.list_to_dict(Transformation, "column_name")

can we make all fields private and turn all the usage to methods? this way it's more robust and would allow for field renames.

nfx · 2024-03-22T11:59:52Z

src/databricks/labs/remorph/reconcile/query_config.py

+        self.src_column_mapping = table_conf.list_to_dict(ColumnMapping, "source_name")
+        self.transformations_dict = table_conf.list_to_dict(Transformation, "column_name")
+
+    def get_threshold_columns(self):


can we add typing information to public members? this way mypy would behave better at finding bugs.

Suggested change

def get_threshold_columns(self):

def get_threshold_columns(self) -> set[str]:

nfx · 2024-03-22T12:01:40Z

src/databricks/labs/remorph/reconcile/query_config.py

+        if self.db_type == SourceType.ORACLE.value:
+            return "{{schema_name}}.{table_name}".format(  # pylint: disable=consider-using-f-string
+                table_name=table_name
+            )
+        return "{{catalog_name}}.{{schema_name}}.{table_name}".format(  # pylint: disable=consider-using-f-string
+            table_name=table_name
+        )


Suggested change

if self.db_type == SourceType.ORACLE.value:

return "{{schema_name}}.{table_name}".format( # pylint: disable=consider-using-f-string

table_name=table_name

)

return "{{catalog_name}}.{{schema_name}}.{table_name}".format( # pylint: disable=consider-using-f-string

table_name=table_name

)

if self.db_type == SourceType.ORACLE.value:

return f"{{schema_name}}.{table_name}"

return f"{{catalog_name}}.{{schema_name}}.{table_name}"

DO NOT disable pylint messages.

automating in #191

Fixed few and two pylint errors are un-resolved.one is an in-built function override and the other is with import error.

nfx

Lgtm

* Added Databricks Source Adapter ([#185](#185)). In this release, the project has been enhanced with several new features for the Databricks Source Adapter. A new `engine` parameter has been added to the `DataSource` class, replacing the original `source` parameter. The `_get_secrets` and `_get_table_or_query` methods have been updated to use the `engine` parameter for key naming and handling queries with a `select` statement differently, respectively. A Databricks Source Adapter for Oracle databases has been introduced, which includes a new `OracleDataSource` class that provides functionality to connect to an Oracle database using JDBC. A Databricks Source Adapter for Snowflake has also been added, featuring the `SnowflakeDataSource` class that handles data reading and schema retrieval from Snowflake. The `DatabricksDataSource` class has been updated to handle data reading and schema retrieval from Databricks, including a new `get_schema_query` method that generates the query to fetch the schema based on the provided catalog and table name. Exception handling for reading data and fetching schema has been implemented for all new classes. These changes provide increased flexibility for working with various data sources, improved code maintainability, and better support for different use cases. * Added Issue Templates for bugs, feature and config ([#194](#194)). Two new issue templates have been added to the project's GitHub repository to improve issue creation and management. The first template, located in `.github/ISSUE_TEMPLATE/bug.yml`, is for reporting bugs and prompts users to provide detailed information about the issue, including the current and expected behavior, steps to reproduce, relevant log output, and sample query. The second template, added under the path `.github/ISSUE_TEMPLATE/config.yml`, is for configuration-related issues and includes support contact links for general Databricks questions and Remorph documentation, as well as fields for specifying the operating system and software version. A new issue template for feature requests, named "Feature Request", has also been added, providing a structured format for users to submit requests for new functionality for the Remorph project. These templates will help streamline the issue creation process, improve the quality of information provided, and make it easier for the development team to quickly identify and address bugs and feature requests. * Added Threshold Query Builder ([#188](#188)). In this release, the open-source library has added a Threshold Query Builder feature, which includes several changes to the existing functionality in the data source connector. A new import statement adds the `re` module for regular expressions, and new parameters have been added to the `read_data` and `get_schema` abstract methods. The `_get_jdbc_reader_options` method has been updated to accept a `options` parameter of type "JdbcReaderOptions", and a new static method, "_get_table_or_query", has been added to construct the table or query string based on provided parameters. Additionally, a new class, "QueryConfig", has been introduced in the "databricks.labs.remorph.reconcile" package to configure queries for data reconciliation tasks. A new abstract base class QueryBuilder has been added to the query_builder.py file, along with HashQueryBuilder and ThresholdQueryBuilder classes to construct SQL queries for generating hash values and selecting columns based on threshold values, transformation rules, and filtering conditions. These changes aim to enhance the functionality of the data source connector, add modularity, customizability, and reusability to the query builder, and improve data reconciliation tasks. * Added serverless validation using lsql library ([#176](#176)). * Added snowflake connector code ([#177](#177)). In this release, the open-source library has been updated to add a Snowflake connector for data extraction and schema manipulation. The changes include the addition of the SnowflakeDataSource class, which is used to read data from Snowflake using PySpark, and has methods for getting the JDBC URL, reading data with and without JDBC reader options, getting the schema, and handling exceptions. A new constant, SNOWFLAKE, has been added to the SourceDriver enum in constants.py, which represents the Snowflake JDBC driver class. The code modifications include updating the constructor of the DataSource abstract base class to include a new parameter 'scope', and updating the `_get_secrets` method to accept a `key_name` parameter instead of 'key'. Additionally, a test file 'test_snowflake.py' has been added to test the functionality of the SnowflakeDataSource class. This release also updates the pyproject.toml file to version lock the dependencies like black, ruff, and isort, and modifies the coverage report configuration to exclude certain files and lines from coverage checks. These changes were completed by Ravikumar Thangaraj and SundarShankar89. * Enhanced install script to enforce usage of a warehouse or cluster when `skip-validation` is set to `False` ([#213](#213)). In this release, the installation process has been enhanced to mandate the use of a warehouse or cluster when the `skip-validation` parameter is set to `False`. This change has been implemented across various components, including the install script, `transpile` function, and `get_sql_backend` function. Additionally, new pytest fixtures and methods have been added to improve test configuration and resource management during testing. Unit tests have been updated to enforce usage of a warehouse or cluster when the `skip-validation` flag is set to `False`, ensuring proper resource allocation and validation process improvement. This development focuses on promoting a proper setup and usage of the system, guiding new users towards a correct configuration and improving the overall reliability of the tool. * Patch subquery with json column access ([#190](#190)). The open-source library has been updated with new functionality to modify how subqueries with JSON column access are handled in the `snowflake.py` file. This change includes the addition of a check for an opening parenthesis after the `FROM` keyword to detect and break loops when a subquery is found, as opposed to a table name. This improvement enhances the handling of complex subqueries and JSON column access, making the code more robust and adaptable to different query structures. Additionally, a new test method, `test_nested_query_with_json`, has been introduced to the `tests/unit/snow/test_databricks.py` file to test the behavior of nested queries involving JSON column access when using a Snowflake dialect. This new method validates the expected output of a specific nested query when it is transpiled to Snowflake's SQL dialect, allowing for more comprehensive testing of JSON column access and type casting in Snowflake dialects. The existing `test_delete_from_keyword` method remains unchanged. * Prevent adding `# pylint: disable` comments without explicit approval ([#191](#191)). A new job, "no-lint-disabled", has been added to the GitHub Actions workflow defined in the "push.yml" file to enforce the use of the pylint linter. This job checks for the addition of "# pylint: disable" comments in new code without explicit approval, preventing the linter from being bypassed without permission. It runs on the latest version of Ubuntu, checks out the repository with a full history, extracts the new code using the `git diff` command, and searches for any instances of "# pylint: disable" using "grep". If any are found, the script outputs an error message and exits with a non-zero status, causing the workflow to fail. This new job helps maintain code quality and consistency across the project by ensuring that the pylint linter is used appropriately in new code. * Snowflake `UPDATE FROM` to Databricks `MERGE INTO` implementation ([#198](#198)). * Use Runtime SQL backend in Notebooks ([#211](#211)). In this update, the `db_sql.py` file in the `databricks/labs/remorph/helpers` directory has been modified to support the use of the Runtime SQL backend in Notebooks. This change includes the addition of a new `RuntimeBackend` class in the `backends` module and an import statement for `os`. The `get_sql_backend` function now returns a `RuntimeBackend` instance when the `DATABRICKS_RUNTIME_VERSION` environment variable is present, allowing for more efficient and secure SQL statement execution in Databricks notebooks. Additionally, a new test case for the `get_sql_backend` function has been added to ensure the correct behavior of the function in various runtime environments. These enhancements improve SQL execution performance and security in Databricks notebooks and increase the project's versatility for different use cases. * `remorph reconcile` baseline for Query Builder and Source Adapter for oracle as source ([#150](#150)). Dependency updates: * Bump sqlglot from 22.4.0 to 22.5.0 ([#175](#175)). * Updated databricks-sdk requirement from <0.22,>=0.18 to >=0.18,<0.23 ([#178](#178)). * Updated databricks-sdk requirement from <0.23,>=0.18 to >=0.18,<0.24 ([#189](#189)). * Bump actions/checkout from 3 to 4 ([#203](#203)). * Bump actions/setup-python from 4 to 5 ([#201](#201)). * Bump codecov/codecov-action from 1 to 4 ([#202](#202)). * Bump softprops/action-gh-release from 1 to 2 ([#204](#204)).

* Added serverless validation using lsql library ([#176](#176)). Workspaceclient object is used with `product` name and `product_version` along with corresponding `cluster_id` or `warehouse_id` as `sdk_config` in `MorphConfig` object. * Enhanced install script to enforce usage of a warehouse or cluster when `skip-validation` is set to `False` ([#213](#213)). In this release, the installation process has been enhanced to mandate the use of a warehouse or cluster when the `skip-validation` parameter is set to `False`. This change has been implemented across various components, including the install script, `transpile` function, and `get_sql_backend` function. Additionally, new pytest fixtures and methods have been added to improve test configuration and resource management during testing. Unit tests have been updated to enforce usage of a warehouse or cluster when the `skip-validation` flag is set to `False`, ensuring proper resource allocation and validation process improvement. This development focuses on promoting a proper setup and usage of the system, guiding new users towards a correct configuration and improving the overall reliability of the tool. * Patch subquery with json column access ([#190](#190)). The open-source library has been updated with new functionality to modify how subqueries with JSON column access are handled in the `snowflake.py` file. This change includes the addition of a check for an opening parenthesis after the `FROM` keyword to detect and break loops when a subquery is found, as opposed to a table name. This improvement enhances the handling of complex subqueries and JSON column access, making the code more robust and adaptable to different query structures. Additionally, a new test method, `test_nested_query_with_json`, has been introduced to the `tests/unit/snow/test_databricks.py` file to test the behavior of nested queries involving JSON column access when using a Snowflake dialect. This new method validates the expected output of a specific nested query when it is transpiled to Snowflake's SQL dialect, allowing for more comprehensive testing of JSON column access and type casting in Snowflake dialects. The existing `test_delete_from_keyword` method remains unchanged. * Snowflake `UPDATE FROM` to Databricks `MERGE INTO` implementation ([#198](#198)). * Use Runtime SQL backend in Notebooks ([#211](#211)). In this update, the `db_sql.py` file in the `databricks/labs/remorph/helpers` directory has been modified to support the use of the Runtime SQL backend in Notebooks. This change includes the addition of a new `RuntimeBackend` class in the `backends` module and an import statement for `os`. The `get_sql_backend` function now returns a `RuntimeBackend` instance when the `DATABRICKS_RUNTIME_VERSION` environment variable is present, allowing for more efficient and secure SQL statement execution in Databricks notebooks. Additionally, a new test case for the `get_sql_backend` function has been added to ensure the correct behavior of the function in various runtime environments. These enhancements improve SQL execution performance and security in Databricks notebooks and increase the project's versatility for different use cases. * Added Issue Templates for bugs, feature and config ([#194](#194)). Two new issue templates have been added to the project's GitHub repository to improve issue creation and management. The first template, located in `.github/ISSUE_TEMPLATE/bug.yml`, is for reporting bugs and prompts users to provide detailed information about the issue, including the current and expected behavior, steps to reproduce, relevant log output, and sample query. The second template, added under the path `.github/ISSUE_TEMPLATE/config.yml`, is for configuration-related issues and includes support contact links for general Databricks questions and Remorph documentation, as well as fields for specifying the operating system and software version. A new issue template for feature requests, named "Feature Request", has also been added, providing a structured format for users to submit requests for new functionality for the Remorph project. These templates will help streamline the issue creation process, improve the quality of information provided, and make it easier for the development team to quickly identify and address bugs and feature requests. * Added Databricks Source Adapter ([#185](#185)). In this release, the project has been enhanced with several new features for the Databricks Source Adapter. A new `engine` parameter has been added to the `DataSource` class, replacing the original `source` parameter. The `_get_secrets` and `_get_table_or_query` methods have been updated to use the `engine` parameter for key naming and handling queries with a `select` statement differently, respectively. A Databricks Source Adapter for Oracle databases has been introduced, which includes a new `OracleDataSource` class that provides functionality to connect to an Oracle database using JDBC. A Databricks Source Adapter for Snowflake has also been added, featuring the `SnowflakeDataSource` class that handles data reading and schema retrieval from Snowflake. The `DatabricksDataSource` class has been updated to handle data reading and schema retrieval from Databricks, including a new `get_schema_query` method that generates the query to fetch the schema based on the provided catalog and table name. Exception handling for reading data and fetching schema has been implemented for all new classes. These changes provide increased flexibility for working with various data sources, improved code maintainability, and better support for different use cases. * Added Threshold Query Builder ([#188](#188)). In this release, the open-source library has added a Threshold Query Builder feature, which includes several changes to the existing functionality in the data source connector. A new import statement adds the `re` module for regular expressions, and new parameters have been added to the `read_data` and `get_schema` abstract methods. The `_get_jdbc_reader_options` method has been updated to accept a `options` parameter of type "JdbcReaderOptions", and a new static method, "_get_table_or_query", has been added to construct the table or query string based on provided parameters. Additionally, a new class, "QueryConfig", has been introduced in the "databricks.labs.remorph.reconcile" package to configure queries for data reconciliation tasks. A new abstract base class QueryBuilder has been added to the query_builder.py file, along with HashQueryBuilder and ThresholdQueryBuilder classes to construct SQL queries for generating hash values and selecting columns based on threshold values, transformation rules, and filtering conditions. These changes aim to enhance the functionality of the data source connector, add modularity, customizability, and reusability to the query builder, and improve data reconciliation tasks. * Added snowflake connector code ([#177](#177)). In this release, the open-source library has been updated to add a Snowflake connector for data extraction and schema manipulation. The changes include the addition of the SnowflakeDataSource class, which is used to read data from Snowflake using PySpark, and has methods for getting the JDBC URL, reading data with and without JDBC reader options, getting the schema, and handling exceptions. A new constant, SNOWFLAKE, has been added to the SourceDriver enum in constants.py, which represents the Snowflake JDBC driver class. The code modifications include updating the constructor of the DataSource abstract base class to include a new parameter 'scope', and updating the `_get_secrets` method to accept a `key_name` parameter instead of 'key'. Additionally, a test file 'test_snowflake.py' has been added to test the functionality of the SnowflakeDataSource class. This release also updates the pyproject.toml file to version lock the dependencies like black, ruff, and isort, and modifies the coverage report configuration to exclude certain files and lines from coverage checks. These changes were completed by Ravikumar Thangaraj and SundarShankar89. * `remorph reconcile` baseline for Query Builder and Source Adapter for oracle as source ([#150](#150)). Dependency updates: * Bump sqlglot from 22.4.0 to 22.5.0 ([#175](#175)). * Updated databricks-sdk requirement from <0.22,>=0.18 to >=0.18,<0.23 ([#178](#178)). * Updated databricks-sdk requirement from <0.23,>=0.18 to >=0.18,<0.24 ([#189](#189)). * Bump actions/checkout from 3 to 4 ([#203](#203)). * Bump actions/setup-python from 4 to 5 ([#201](#201)). * Bump codecov/codecov-action from 1 to 4 ([#202](#202)). * Bump softprops/action-gh-release from 1 to 2 ([#204](#204)).

ravit-db added 6 commits March 20, 2024 21:32

Added Threshold Query Builder feature

e544108

Refactored the code to reduce redundancy

8f1af10

Added inline comments and renamed a function

44329b7

Added a test case for threshold query builder

f8bf844

Fixed the format issue

9011f5e

Added test case for snowflake source and exception scenario

1c3f87c

ravit-db added the do-not-merge label Mar 21, 2024

ravit-db requested review from nfx and sundarshankar89 March 21, 2024 09:28

ravit-db requested a review from a team as a code owner March 21, 2024 09:28

Fixed the format

f7d41bc

ravit-db requested a review from ganeshdogiparthi-db March 21, 2024 09:32

Added the schema and catalog information to query

822e506

ganeshdogiparthi-db reviewed Mar 22, 2024

View reviewed changes

src/databricks/labs/remorph/reconcile/query_builder.py Outdated Show resolved Hide resolved

Refactored the QueryBuilder to Abstract class and added query_config …

ea30630

…class

nfx requested changes Mar 22, 2024

View reviewed changes

Fixed the review feedbacks

c184458

nfx reviewed Mar 22, 2024

View reviewed changes

nfx requested changes Mar 22, 2024

View reviewed changes

ravit-db added 3 commits March 22, 2024 18:29

Renamed the function parameters

367bf75

Fixed the review comments

39dd558

Fixed the pylint error

8d13c01

ravit-db changed the title ~~Feature/threshold query builder~~ Added Threshold Query Builder Mar 22, 2024

ravit-db added 2 commits March 22, 2024 22:38

Fixed the pylint error

4605f55

Refactored the class

b02970d

nfx approved these changes Mar 24, 2024

View reviewed changes

nfx merged commit 3d8f0a4 into main Mar 24, 2024
4 of 7 checks passed

nfx deleted the feature/threshold_query_builder branch March 24, 2024 08:23

ganeshdogiparthi-db added a commit that referenced this pull request Mar 24, 2024

fixed data class changes from #188

a0307bd

sundarshankar89 mentioned this pull request Apr 4, 2024

Release v0.1.6 #217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Threshold Query Builder #188

Added Threshold Query Builder #188

ravit-db commented Mar 21, 2024

codecov bot commented Mar 21, 2024 •

edited

Loading

nfx Mar 22, 2024

nfx Mar 22, 2024

ravit-db Mar 22, 2024

nfx left a comment

nfx Mar 22, 2024

nfx Mar 22, 2024

ravit-db Mar 22, 2024

nfx Mar 22, 2024

ravit-db Mar 22, 2024

nfx Mar 22, 2024

ravit-db Mar 22, 2024

nfx left a comment

	def read_data(self, catalog: str, schema: str, query: str, jdbc_reader_options: JdbcReaderOptions) -> DataFrame:
	def read_data(self, catalog: str, schema: str, query: str, options: JdbcReaderOptions) -> DataFrame:

	def get_threshold_columns(self):
	def get_threshold_columns(self) -> set[str]:

Added Threshold Query Builder #188

Added Threshold Query Builder #188

Conversation

ravit-db commented Mar 21, 2024

codecov bot commented Mar 21, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DO NOT disable pylint messages.

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 21, 2024 •

edited

Loading

DO NOT disable `pylint` messages.