feat(ingest/oracle): support changing data dictionary (ALL_ or DBA_) #8873

sleeperdeep · 2023-09-21T10:38:24Z

actual discussion in community

change oracle plugin config: add additional field 'data_dictionary_mode'
change OracleInspectorObjectWrapper: add methods to extract data from dba_ views

Fixes #6711

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…) mode for oracle module

mayurinehate

Thank you for the PR. I've left a few comments for changes.

Also, oracle tests are failing in CI. Can you please fix the tests ?

mayurinehate · 2023-09-27T06:28:58Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

+    # custom
+    data_dictionary_mode: Optional[str] = Field(
+        default='ALL',
+        description="The data dictionary views mode, to extract information about schema objects ('All' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)"


Suggested change

description="The data dictionary views mode, to extract information about schema objects ('All' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)"

description="The data dictionary views mode, to extract information about schema objects ('ALL' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)"

mayurinehate · 2023-09-27T06:48:24Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

@@ -186,13 +1002,16 @@ def get_inspectors(self) -> Iterable[Inspector]:
            event.listen(
                inspector.engine, "before_cursor_execute", before_cursor_execute
            )
+            logger.info(f'Data dictionary mode is: "{self.config.data_dictionary_mode}".')
+            if self.config.data_dictionary_mode != OracleConfig.__fields__.get("data_dictionary_mode").default:


Suggested change

if self.config.data_dictionary_mode != OracleConfig.__fields__.get("data_dictionary_mode").default:

# Sqlalchemy inspector uses ALL_* tables as per oracle dialect implementation.

# OracleInspectorObjectWrapper provides alternate implementation using DBA_* tables.

if self.config.data_dictionary_mode != "ALL":

mayurinehate · 2023-09-27T07:03:00Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

@@ -97,6 +132,7 @@ def get_identifier(self, schema: str, table: str) -> str:
            return regular


+@inspector_wraper_usage_notificcation(class_usage_notification)


I'm guessing this decorator was added for debugging/reporting inspector methods as they are used. I think, we should be able to remove this now.

If required, debug logs can be added in methods directly.

@mayurinehate Fixed

mayurinehate · 2023-09-27T08:21:06Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

@@ -108,26 +144,163 @@ def __init__(self, inspector_instance: Inspector):
        # tables that we don't want to ingest into the DataHub
        self.exclude_tablespaces: Tuple[str, str] = ("SYSTEM", "SYSAUX")

+    def has_table(self, table_name, schema=None):


DataHub uses a few selective methods from inspector interface. You can look them up here . Also listed below for quick reference. I believe it should suffice to implement only these, as required.

get_schema_names get_table_names get_view_names get_pk_constraint get_table_comment get_view_definition get_columns get_foreign_keys

I doubt whether other methods implemented in wrapper here are needed.
These methods are not used by Datahub and therefore should be removed.

has_table has_sequence _resolve_synonym _prepare_reflection_args get_temp_table_names get_table_options get_indexes get_check_constraints

Unused methods were deleted.

…le module (1. fixed syntax mistakes 2. simplified data_dictionary_mode value check 3. usage_decorator_wrapped was deleted 4. unused methods were deleted)

mayurinehate

Hey @sleeperdeep , can you please fix oracle tests for your changes. It looks like the default behavior earlier was for data_dictionay_mode=DBA. I'm fine with current default as well but you may need to regenerate golden files for oracle tests. This should help there - https://datahubproject.io/docs/metadata-ingestion/developing#updating-golden-test-files

sleeperdeep · 2023-10-18T11:30:28Z

Hey @sleeperdeep , can you please fix oracle tests for your changes. It looks like the default behavior earlier was for data_dictionay_mode=DBA. I'm fine with current default as well but you may need to regenerate golden files for oracle tests. This should help there - https://datahubproject.io/docs/metadata-ingestion/developing#updating-golden-test-files

Thank you for the link! It will help me. Try to fix it in near future.

…nary_mode

…_ or DBA_…" This reverts commit 284bb31.

…) mode for oracle module, fix tests.

…com/sleeperdeep/datahub into feature/oracle_data_dictionary_mode

…) mode for oracle module, fix tests.

…nary_mode

mayurinehate · 2023-11-22T08:45:56Z

metadata-ingestion/src/datahub/ingestion/source/sql/sql_config.py

@@ -90,6 +90,8 @@ class SQLCommonConfig(
    profiling: GEProfilingConfig = GEProfilingConfig()
    # Custom Stateful Ingestion settings
    stateful_ingestion: Optional[StatefulStaleMetadataRemovalConfig] = None
+    # Custom data_dictionary_mode
+    data_dictionary_mode: Optional[str] = None


This should not be part of SQLCommonConfig. Can you remove this ?

@mayurinehate I`ve deleted it from SQLCommonConfig, but mypy returns attr-defined error.

mayurinehate · 2023-11-22T08:47:47Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

@@ -299,7 +336,7 @@ def get_columns(
                computed = None

            if identity_options is not None:
-                identity = self._inspector_instance.dialect._parse_identity_options(identity_options, default_on_nul)
+                identity = self.parse_identity_options(identity_options, default_on_nul)


Motivation behind this change is not clear . Can you help me understand why this change is needed ?

mayurinehate · 2023-11-22T08:48:00Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

@@ -323,7 +360,7 @@ def get_columns(
        return columns

    def get_table_comment(
-            self, table_name: str, schema: str = None
+            self, table_name: Optional[str], schema: Optional[str] = None


same for this

mayurinehate · 2023-12-01T09:40:30Z

Hey @sleeperdeep did you get chance to go through the comments and update golden files ? Let me know if you have any queries.

It would be great if you can verify oracle tests are passing locally by running them in development environment activated using this.
pytest tests/unit/test_oracle_source.py
pytest tests/integration/oracle/test_oracle.py

…nary_mode

…le module (1. fix integration tests 2. update golden-files)

…le module (1. fix integration tests 2. update golden-files 3. delete debug rows)

sleeperdeep · 2024-01-20T07:20:33Z

Hi all! @mayurinehate, @hsheth2, @maggiehays, @mhw
Can you, please, review new commit?

mayurinehate · 2024-01-23T05:16:50Z

Hi all! @mayurinehate, @hsheth2, @maggiehays, @mhw Can you, please, review new commit?

Thanks for the update @sleeperdeep . I can take a look early next week. Meanwhile - there seem to be lint and test failures, if you can take a look and fix those.

If you have already setup and activated venv as described in developing guide, you should be able to run below command to find and fix lint issues

black src/ tests/ && isort src/ tests/ && flake8 src/ tests/ && mypy src/ tests/

and below commands to find and fix test issues
pytest tests/unit/test_oracle_source.py tests/integration/oracle/test_oracle.py

Feel free to reach out to me on datahub slack if any questions.

…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests)

mayurinehate

Left few minor comments. Everything else looks good to me.
Thanks for the amazing work.

mayurinehate · 2024-01-30T04:41:48Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

@@ -52,13 +57,20 @@ def before_cursor_execute(conn, cursor, statement, parameters, context, executem
    cursor.outputtypehandler = output_type_handler


+def class_usage_notification(cls, func):


This can be removed, right ? Its not used anywhere.

Yes, it can. Main idea behind this method consist in logging usage of wrapper class.

mayurinehate · 2024-01-30T04:46:12Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

+            # Sqlalchemy inspector uses ALL_* tables as per oracle dialect implementation.
+            # OracleInspectorObjectWrapper provides alternate implementation using DBA_* tables.
+            if self.config.data_dictionary_mode != "ALL":
+                yield cast(Inspector, OracleInspectorObjectWrapper(inspector))


Wondering, if we can rename OracleInspectorObjectWrapper to something like OracleDBAInspectorObjectWrapper, to make this more explicit in class name itself.

Oracle allows three levels of access (based on levels of privilege): ALL, DBA, USER. Right now ALL mode is using as default. DBA mode is implemented by this PR. But pottentially, mode USER can be implemented as future contribution. In this case renaming of this class will not give nothing. I can extend docstring, if it is necessary.

mayurinehate · 2024-01-30T04:49:07Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

+        return columns
+
+    def get_table_comment(
+        self, table_name: Optional[str], schema: Optional[str] = None


table_name should not be optional, right ?

@mayurinehate
Yes, it should. But from perspective of logic of exctraction data from db. From perspective of structure of classes and it`s types denormalize_name method in original sqlalchemy source and denormalize_name method in sqlalchemy-stup are differ. It leads to situation, where we chould choose what is better: use Optional[str] for main methods (get_schema_names, get_table_names, get_view_names, get_pk_constraint, get_table_comment, get_view_definition, get_columns, get_foreign_keys) or propagate "# type: ignore" throught all usages of denormalize_name method.

<->

I don't know if one is better than other. @hsheth2 - any thoughts ? Also - could you please take a look and get this PR merged ? Its looks all good to me.

For all the main methods (e.g. get_table_names, get_view_names, get_pk_constraint, get_table_comment, get_view_definition, get_columns, get_foreign_keys) - these do actually require table_name: str, so declaring it to be optional feels misleading. I'd prefer something like below, or using type: ignore if the below doesn't work for some reason

def get_table_comment(self, table_name: str, schema: Optional[str] = None) denormalized_table_name: Optional[str] = self._inspector_instance.dialect.denormalize_name(table_name) assert denormalized_table_name is not None denormalized_table_name = table_name # or maybe just use `denormalized_table_name` in the remainder of the method

@hsheth2
I fixed it. Agree, this approach is better.

…nary_mode

…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests 5.delete class_usage_notification method)

hsheth2

mostly looks good, the main thing is tweaking those Optional[str] annotations

hsheth2 · 2024-02-02T03:56:30Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

+
+        rp = self._inspector_instance.bind.execute(sql.text(text), params).scalar()
+        if rp:
+            if py2k:


given that we're on python 3 only, is this necessary?

Agree. Was removed from module.

hsheth2 · 2024-02-02T03:59:11Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

+
+        return {"text": c.scalar()}
+
+    def _get_constraint_data(


just for my own understanding, most of this code is pretty similar to the code from the default oracle sqlalchemy dialect right now, right?

Yes. Main idea consist in accessing other schema (DBA instead of ALL) without global changes of original modules.

hsheth2 · 2024-02-02T04:00:04Z

metadata-ingestion/src/datahub/ingestion/source/sql/oracle.py

+        COMMENT_SQL = """
+            SELECT comments
+            FROM dba_tab_comments
+            WHERE table_name = CAST(:table_name AS VARCHAR(128))


eventually we'll want to replace this with a single bulk fetch for the full schema, instead of fetching each table one at a time

I agree with you. This approach (single query to fetch all info in one go) is more preferable. Idea of this PR was different. I can test it and contribute in next PR.

…le module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests 5.delete class_usage_notification method 6.fix argument type in main methods 7. replace sqlalchemy imported method)

hsheth2 · 2024-02-09T23:56:16Z

The test_create_list_get_ingestion_execution_request is a known issue, so merging this in.

Thanks @sleeperdeep for incredible contribution 🎉

feat(ingestion): add ability to specify data dictionary (ALL_ or DBA_…

8515315

…) mode for oracle module

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Sep 21, 2023

vercel bot deployed to Preview September 21, 2023 11:06 View deployment

hsheth2 requested a review from mayurinehate September 26, 2023 03:41

mayurinehate requested changes Sep 27, 2023

View reviewed changes

hsheth2 added the community-contribution PR or Issue raised by member(s) of DataHub Community label Sep 29, 2023

sleeperdeep and others added 2 commits October 4, 2023 13:31

feat(ingestion): add ability to specify data dictionary mode for orac…

7fb7e68

…le module (1. fixed syntax mistakes 2. simplified data_dictionary_mode value check 3. usage_decorator_wrapped was deleted 4. unused methods were deleted)

Merge branch 'master' into feature/oracle_data_dictionary_mode

b605809

vercel bot deployed to Preview October 4, 2023 11:39 View deployment

sleeperdeep requested a review from mayurinehate October 6, 2023 08:43

mayurinehate requested changes Oct 18, 2023

View reviewed changes

hsheth2 added the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Oct 27, 2023

sleeperdeep and others added 4 commits November 2, 2023 10:31

Merge branch 'datahub-project:master' into feature/oracle_data_dictio…

6f13e30

…nary_mode

Merge branch 'master' into feature/oracle_data_dictionary_mode

5770ea3

feat(ingestion): add ability to specify data dictionary (ALL_ or DBA_…

284bb31

Revert " feat(ingestion): add ability to specify data dictionary (ALL…

29c9085

…_ or DBA_…" This reverts commit 284bb31.

vercel bot deployed to Preview November 2, 2023 09:26 View deployment

sleeperdeep and others added 4 commits November 2, 2023 11:27

feat(ingestion): add ability to specify data dictionary (ALL_ or DBA_…

141e5f9

…) mode for oracle module, fix tests.

Merge branch 'feature/oracle_data_dictionary_mode' of https://github.…

0f6b579

…com/sleeperdeep/datahub into feature/oracle_data_dictionary_mode

feat(ingestion): add ability to specify data dictionary (ALL_ or DBA_…

59e5683

…) mode for oracle module, fix tests.

Merge branch 'datahub-project:master' into feature/oracle_data_dictio…

f202943

…nary_mode

vercel bot deployed to Preview November 2, 2023 10:49 View deployment

maggiehays removed the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Nov 13, 2023

sleeperdeep added 2 commits November 20, 2023 11:13

Merge branch 'master' into feature/oracle_data_dictionary_mode

24ee000

restore golden files

6ab6474

vercel bot deployed to Preview November 20, 2023 10:58 View deployment

mayurinehate self-requested a review November 21, 2023 06:05

mayurinehate reviewed Nov 22, 2023

View reviewed changes

Merge branch 'datahub-project:master' into feature/oracle_data_dictio…

54385be

…nary_mode

vercel bot deployed to Preview December 5, 2023 13:43 View deployment

Merge branch 'datahub-project:master' into feature/oracle_data_dictio…

293f3b2

…nary_mode

vercel bot deployed to Preview January 15, 2024 10:34 View deployment

feat(ingestion): add ability to specify data dictionary mode for orac…

d74d5dd

…le module (1. fix integration tests 2. update golden-files)

vercel bot deployed to Preview January 16, 2024 18:44 View deployment

feat(ingestion): add ability to specify data dictionary mode for orac…

1d5d39d

…le module (1. fix integration tests 2. update golden-files 3. delete debug rows)

vercel bot deployed to Preview January 18, 2024 12:43 View deployment

feat(ingestion): add ability to specify data dictionary mode for ora…

f52aae4

…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests)

vercel bot deployed to Preview January 25, 2024 09:45 View deployment

feat(ingestion): add ability to specify data dictionary mode for ora…

ba98c4f

…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests)

vercel bot deployed to Preview January 25, 2024 13:07 View deployment

Merge branch 'master' into feature/oracle_data_dictionary_mode

ae532a0

vercel bot deployed to Preview January 26, 2024 07:56 View deployment

mayurinehate approved these changes Jan 30, 2024

View reviewed changes

sleeperdeep and others added 2 commits January 30, 2024 14:45

Merge branch 'datahub-project:master' into feature/oracle_data_dictio…

498c442

…nary_mode

feat(ingestion): add ability to specify data dictionary mode for ora…

bc1bb95

…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests 5.delete class_usage_notification method)

vercel bot deployed to Preview January 30, 2024 13:22 View deployment

hsheth2 reviewed Feb 2, 2024

View reviewed changes

feat(ingestion): add ability to specify data dictionary mode for orac…

099affb

…le module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests 5.delete class_usage_notification method 6.fix argument type in main methods 7. replace sqlalchemy imported method)

vercel bot deployed to Preview February 8, 2024 12:55 View deployment

Merge branch 'master' into feature/oracle_data_dictionary_mode

f34d7a1

hsheth2 approved these changes Feb 9, 2024

View reviewed changes

hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Feb 9, 2024

vercel bot deployed to Preview February 9, 2024 21:50 View deployment

hsheth2 changed the title ~~feat(ingestion): add ability to specify data dictionary (ALL_ or DBA_…~~ feat(ingest/oracle): support changing data dictionary (ALL_ or DBA_) Feb 9, 2024

hsheth2 merged commit 7d73c41 into datahub-project:master Feb 9, 2024
53 of 54 checks passed

	description="The data dictionary views mode, to extract information about schema objects ('All' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)"
	description="The data dictionary views mode, to extract information about schema objects ('ALL' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)"

		@@ -97,6 +132,7 @@ def get_identifier(self, schema: str, table: str) -> str:
		return regular


		@inspector_wraper_usage_notificcation(class_usage_notification)

		@@ -52,13 +57,20 @@ def before_cursor_execute(conn, cursor, statement, parameters, context, executem
		cursor.outputtypehandler = output_type_handler


		def class_usage_notification(cls, func):

feat(ingest/oracle): support changing data dictionary (ALL_ or DBA_) #8873

feat(ingest/oracle): support changing data dictionary (ALL_ or DBA_) #8873

Conversation

sleeperdeep commented Sep 21, 2023 • edited by hsheth2 Loading

Checklist

mayurinehate left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sleeperdeep Oct 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate left a comment • edited Loading

Choose a reason for hiding this comment

sleeperdeep commented Oct 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate commented Dec 1, 2023

sleeperdeep commented Jan 20, 2024

mayurinehate commented Jan 23, 2024

mayurinehate left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 commented Feb 9, 2024

sleeperdeep commented Sep 21, 2023 •

edited by hsheth2

Loading

sleeperdeep Oct 4, 2023 •

edited

Loading

mayurinehate left a comment •

edited

Loading