Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/oracle): support changing data dictionary (ALL_ or DBA_) #8873

Merged

Conversation

sleeperdeep
Copy link
Contributor

@sleeperdeep sleeperdeep commented Sep 21, 2023

actual discussion in community

  • change oracle plugin config: add additional field 'data_dictionary_mode'
  • change OracleInspectorObjectWrapper: add methods to extract data from dba_ views

Fixes #6711

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Sep 21, 2023
Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. I've left a few comments for changes.

Also, oracle tests are failing in CI. Can you please fix the tests ?

# custom
data_dictionary_mode: Optional[str] = Field(
default='ALL',
description="The data dictionary views mode, to extract information about schema objects ('All' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description="The data dictionary views mode, to extract information about schema objects ('All' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)"
description="The data dictionary views mode, to extract information about schema objects ('ALL' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -186,13 +1002,16 @@ def get_inspectors(self) -> Iterable[Inspector]:
event.listen(
inspector.engine, "before_cursor_execute", before_cursor_execute
)
logger.info(f'Data dictionary mode is: "{self.config.data_dictionary_mode}".')
if self.config.data_dictionary_mode != OracleConfig.__fields__.get("data_dictionary_mode").default:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.config.data_dictionary_mode != OracleConfig.__fields__.get("data_dictionary_mode").default:
# Sqlalchemy inspector uses ALL_* tables as per oracle dialect implementation.
# OracleInspectorObjectWrapper provides alternate implementation using DBA_* tables.
if self.config.data_dictionary_mode != "ALL":

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -97,6 +132,7 @@ def get_identifier(self, schema: str, table: str) -> str:
return regular


@inspector_wraper_usage_notificcation(class_usage_notification)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this decorator was added for debugging/reporting inspector methods as they are used. I think, we should be able to remove this now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If required, debug logs can be added in methods directly.

Copy link
Contributor Author

@sleeperdeep sleeperdeep Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -108,26 +144,163 @@ def __init__(self, inspector_instance: Inspector):
# tables that we don't want to ingest into the DataHub
self.exclude_tablespaces: Tuple[str, str] = ("SYSTEM", "SYSAUX")

def has_table(self, table_name, schema=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataHub uses a few selective methods from inspector interface. You can look them up here . Also listed below for quick reference. I believe it should suffice to implement only these, as required.

get_schema_names
get_table_names
get_view_names
get_pk_constraint
get_table_comment
get_view_definition
get_columns
get_foreign_keys

I doubt whether other methods implemented in wrapper here are needed.
These methods are not used by Datahub and therefore should be removed.

has_table
has_sequence
_resolve_synonym
_prepare_reflection_args
 get_temp_table_names
get_table_options
get_indexes
get_check_constraints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused methods were deleted.

@hsheth2 hsheth2 added the community-contribution PR or Issue raised by member(s) of DataHub Community label Sep 29, 2023
sleeperdeep and others added 2 commits October 4, 2023 13:31
…le module (1. fixed syntax mistakes 2. simplified data_dictionary_mode value check 3. usage_decorator_wrapped was deleted 4. unused methods were deleted)
Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sleeperdeep , can you please fix oracle tests for your changes. It looks like the default behavior earlier was for data_dictionay_mode=DBA. I'm fine with current default as well but you may need to regenerate golden files for oracle tests. This should help there - https://datahubproject.io/docs/metadata-ingestion/developing#updating-golden-test-files

@sleeperdeep
Copy link
Contributor Author

Hey @sleeperdeep , can you please fix oracle tests for your changes. It looks like the default behavior earlier was for data_dictionay_mode=DBA. I'm fine with current default as well but you may need to regenerate golden files for oracle tests. This should help there - https://datahubproject.io/docs/metadata-ingestion/developing#updating-golden-test-files

Thank you for the link! It will help me. Try to fix it in near future.

@hsheth2 hsheth2 added the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Oct 27, 2023
@maggiehays maggiehays removed the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Nov 13, 2023
@@ -90,6 +90,8 @@ class SQLCommonConfig(
profiling: GEProfilingConfig = GEProfilingConfig()
# Custom Stateful Ingestion settings
stateful_ingestion: Optional[StatefulStaleMetadataRemovalConfig] = None
# Custom data_dictionary_mode
data_dictionary_mode: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be part of SQLCommonConfig. Can you remove this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayurinehate I`ve deleted it from SQLCommonConfig, but mypy returns attr-defined error.

@@ -299,7 +336,7 @@ def get_columns(
computed = None

if identity_options is not None:
identity = self._inspector_instance.dialect._parse_identity_options(identity_options, default_on_nul)
identity = self.parse_identity_options(identity_options, default_on_nul)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation behind this change is not clear . Can you help me understand why this change is needed ?

@@ -323,7 +360,7 @@ def get_columns(
return columns

def get_table_comment(
self, table_name: str, schema: str = None
self, table_name: Optional[str], schema: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this

@mayurinehate
Copy link
Collaborator

Hey @sleeperdeep did you get chance to go through the comments and update golden files ? Let me know if you have any queries.

It would be great if you can verify oracle tests are passing locally by running them in development environment activated using this.
pytest tests/unit/test_oracle_source.py
pytest tests/integration/oracle/test_oracle.py

…le module (1. fix integration tests 2. update golden-files)
…le module (1. fix integration tests 2. update golden-files 3. delete debug rows)
@sleeperdeep
Copy link
Contributor Author

Hi all! @mayurinehate, @hsheth2, @maggiehays, @mhw
Can you, please, review new commit?

@mayurinehate
Copy link
Collaborator

Hi all! @mayurinehate, @hsheth2, @maggiehays, @mhw Can you, please, review new commit?

Thanks for the update @sleeperdeep . I can take a look early next week. Meanwhile - there seem to be lint and test failures, if you can take a look and fix those.

If you have already setup and activated venv as described in developing guide, you should be able to run below command to find and fix lint issues

black src/ tests/ && isort src/ tests/ && flake8 src/ tests/ && mypy src/ tests/

and below commands to find and fix test issues
pytest tests/unit/test_oracle_source.py tests/integration/oracle/test_oracle.py

Feel free to reach out to me on datahub slack if any questions.

…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests)
…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests)
Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few minor comments. Everything else looks good to me.
Thanks for the amazing work.

@@ -52,13 +57,20 @@ def before_cursor_execute(conn, cursor, statement, parameters, context, executem
cursor.outputtypehandler = output_type_handler


def class_usage_notification(cls, func):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed, right ? Its not used anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can. Main idea behind this method consist in logging usage of wrapper class.

# Sqlalchemy inspector uses ALL_* tables as per oracle dialect implementation.
# OracleInspectorObjectWrapper provides alternate implementation using DBA_* tables.
if self.config.data_dictionary_mode != "ALL":
yield cast(Inspector, OracleInspectorObjectWrapper(inspector))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering, if we can rename OracleInspectorObjectWrapper to something like OracleDBAInspectorObjectWrapper, to make this more explicit in class name itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oracle allows three levels of access (based on levels of privilege): ALL, DBA, USER. Right now ALL mode is using as default. DBA mode is implemented by this PR. But pottentially, mode USER can be implemented as future contribution. In this case renaming of this class will not give nothing. I can extend docstring, if it is necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

return columns

def get_table_comment(
self, table_name: Optional[str], schema: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_name should not be optional, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayurinehate
Yes, it should. But from perspective of logic of exctraction data from db. From perspective of structure of classes and it`s types denormalize_name method in original sqlalchemy source and denormalize_name method in sqlalchemy-stup are differ. It leads to situation, where we chould choose what is better: use Optional[str] for main methods (get_schema_names, get_table_names, get_view_names, get_pk_constraint, get_table_comment, get_view_definition, get_columns, get_foreign_keys) or propagate "# type: ignore" throught all usages of denormalize_name method.

изображение

<->

изображение

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if one is better than other. @hsheth2 - any thoughts ? Also - could you please take a look and get this PR merged ? Its looks all good to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the main methods (e.g. get_table_names, get_view_names, get_pk_constraint, get_table_comment, get_view_definition, get_columns, get_foreign_keys) - these do actually require table_name: str, so declaring it to be optional feels misleading. I'd prefer something like below, or using type: ignore if the below doesn't work for some reason

def get_table_comment(self, table_name: str, schema: Optional[str] = None)
  denormalized_table_name: Optional[str] = self._inspector_instance.dialect.denormalize_name(table_name)
  assert denormalized_table_name is not None
  denormalized_table_name = table_name  # or maybe just use `denormalized_table_name` in the remainder of the method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsheth2
I fixed it. Agree, this approach is better.

sleeperdeep and others added 2 commits January 30, 2024 14:45
…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests 5.delete class_usage_notification method)
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good, the main thing is tweaking those Optional[str] annotations


rp = self._inspector_instance.bind.execute(sql.text(text), params).scalar()
if rp:
if py2k:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we're on python 3 only, is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Was removed from module.


return {"text": c.scalar()}

def _get_constraint_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my own understanding, most of this code is pretty similar to the code from the default oracle sqlalchemy dialect right now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Main idea consist in accessing other schema (DBA instead of ALL) without global changes of original modules.

COMMENT_SQL = """
SELECT comments
FROM dba_tab_comments
WHERE table_name = CAST(:table_name AS VARCHAR(128))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually we'll want to replace this with a single bulk fetch for the full schema, instead of fetching each table one at a time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. This approach (single query to fetch all info in one go) is more preferable. Idea of this PR was different. I can test it and contribute in next PR.

…le module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests 5.delete class_usage_notification method 6.fix argument type in main methods 7. replace sqlalchemy imported method)
@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Feb 9, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented Feb 9, 2024

The test_create_list_get_ingestion_execution_request is a known issue, so merging this in.

Thanks @sleeperdeep for incredible contribution 🎉

@hsheth2 hsheth2 changed the title feat(ingestion): add ability to specify data dictionary (ALL_ or DBA_… feat(ingest/oracle): support changing data dictionary (ALL_ or DBA_) Feb 9, 2024
@hsheth2 hsheth2 merged commit 7d73c41 into datahub-project:master Feb 9, 2024
53 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request : optional import of system tables in Oracle source
4 participants