-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/oracle): support changing data dictionary (ALL_ or DBA_) #8873
feat(ingest/oracle): support changing data dictionary (ALL_ or DBA_) #8873
Conversation
…) mode for oracle module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. I've left a few comments for changes.
Also, oracle tests are failing in CI. Can you please fix the tests ?
# custom | ||
data_dictionary_mode: Optional[str] = Field( | ||
default='ALL', | ||
description="The data dictionary views mode, to extract information about schema objects ('All' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description="The data dictionary views mode, to extract information about schema objects ('All' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)" | |
description="The data dictionary views mode, to extract information about schema objects ('ALL' and 'DBA' views are supported). (https://docs.oracle.com/cd/E11882_01/nav/catalog_views.htm)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -186,13 +1002,16 @@ def get_inspectors(self) -> Iterable[Inspector]: | |||
event.listen( | |||
inspector.engine, "before_cursor_execute", before_cursor_execute | |||
) | |||
logger.info(f'Data dictionary mode is: "{self.config.data_dictionary_mode}".') | |||
if self.config.data_dictionary_mode != OracleConfig.__fields__.get("data_dictionary_mode").default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.config.data_dictionary_mode != OracleConfig.__fields__.get("data_dictionary_mode").default: | |
# Sqlalchemy inspector uses ALL_* tables as per oracle dialect implementation. | |
# OracleInspectorObjectWrapper provides alternate implementation using DBA_* tables. | |
if self.config.data_dictionary_mode != "ALL": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -97,6 +132,7 @@ def get_identifier(self, schema: str, table: str) -> str: | |||
return regular | |||
|
|||
|
|||
@inspector_wraper_usage_notificcation(class_usage_notification) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this decorator was added for debugging/reporting inspector methods as they are used. I think, we should be able to remove this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If required, debug logs can be added in methods directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayurinehate Fixed
@@ -108,26 +144,163 @@ def __init__(self, inspector_instance: Inspector): | |||
# tables that we don't want to ingest into the DataHub | |||
self.exclude_tablespaces: Tuple[str, str] = ("SYSTEM", "SYSAUX") | |||
|
|||
def has_table(self, table_name, schema=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataHub uses a few selective methods from inspector interface. You can look them up here . Also listed below for quick reference. I believe it should suffice to implement only these, as required.
get_schema_names
get_table_names
get_view_names
get_pk_constraint
get_table_comment
get_view_definition
get_columns
get_foreign_keys
I doubt whether other methods implemented in wrapper here are needed.
These methods are not used by Datahub and therefore should be removed.
has_table
has_sequence
_resolve_synonym
_prepare_reflection_args
get_temp_table_names
get_table_options
get_indexes
get_check_constraints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused methods were deleted.
…le module (1. fixed syntax mistakes 2. simplified data_dictionary_mode value check 3. usage_decorator_wrapped was deleted 4. unused methods were deleted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sleeperdeep , can you please fix oracle tests for your changes. It looks like the default behavior earlier was for data_dictionay_mode=DBA
. I'm fine with current default as well but you may need to regenerate golden files for oracle tests. This should help there - https://datahubproject.io/docs/metadata-ingestion/developing#updating-golden-test-files
Thank you for the link! It will help me. Try to fix it in near future. |
…_ or DBA_…" This reverts commit 284bb31.
…) mode for oracle module, fix tests.
…com/sleeperdeep/datahub into feature/oracle_data_dictionary_mode
…) mode for oracle module, fix tests.
@@ -90,6 +90,8 @@ class SQLCommonConfig( | |||
profiling: GEProfilingConfig = GEProfilingConfig() | |||
# Custom Stateful Ingestion settings | |||
stateful_ingestion: Optional[StatefulStaleMetadataRemovalConfig] = None | |||
# Custom data_dictionary_mode | |||
data_dictionary_mode: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be part of SQLCommonConfig. Can you remove this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayurinehate I`ve deleted it from SQLCommonConfig, but mypy returns attr-defined error.
@@ -299,7 +336,7 @@ def get_columns( | |||
computed = None | |||
|
|||
if identity_options is not None: | |||
identity = self._inspector_instance.dialect._parse_identity_options(identity_options, default_on_nul) | |||
identity = self.parse_identity_options(identity_options, default_on_nul) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation behind this change is not clear . Can you help me understand why this change is needed ?
@@ -323,7 +360,7 @@ def get_columns( | |||
return columns | |||
|
|||
def get_table_comment( | |||
self, table_name: str, schema: str = None | |||
self, table_name: Optional[str], schema: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for this
Hey @sleeperdeep did you get chance to go through the comments and update golden files ? Let me know if you have any queries. It would be great if you can verify oracle tests are passing locally by running them in development environment activated using this. |
…le module (1. fix integration tests 2. update golden-files)
…le module (1. fix integration tests 2. update golden-files 3. delete debug rows)
Hi all! @mayurinehate, @hsheth2, @maggiehays, @mhw |
Thanks for the update @sleeperdeep . I can take a look early next week. Meanwhile - there seem to be lint and test failures, if you can take a look and fix those. If you have already setup and activated
and below commands to find and fix test issues Feel free to reach out to me on datahub slack if any questions. |
…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests)
…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few minor comments. Everything else looks good to me.
Thanks for the amazing work.
@@ -52,13 +57,20 @@ def before_cursor_execute(conn, cursor, statement, parameters, context, executem | |||
cursor.outputtypehandler = output_type_handler | |||
|
|||
|
|||
def class_usage_notification(cls, func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed, right ? Its not used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can. Main idea behind this method consist in logging usage of wrapper class.
# Sqlalchemy inspector uses ALL_* tables as per oracle dialect implementation. | ||
# OracleInspectorObjectWrapper provides alternate implementation using DBA_* tables. | ||
if self.config.data_dictionary_mode != "ALL": | ||
yield cast(Inspector, OracleInspectorObjectWrapper(inspector)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering, if we can rename OracleInspectorObjectWrapper
to something like OracleDBAInspectorObjectWrapper
, to make this more explicit in class name itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oracle allows three levels of access (based on levels of privilege): ALL, DBA, USER. Right now ALL mode is using as default. DBA mode is implemented by this PR. But pottentially, mode USER can be implemented as future contribution. In this case renaming of this class will not give nothing. I can extend docstring, if it is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
return columns | ||
|
||
def get_table_comment( | ||
self, table_name: Optional[str], schema: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table_name
should not be optional, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayurinehate
Yes, it should. But from perspective of logic of exctraction data from db. From perspective of structure of classes and it`s types denormalize_name method in original sqlalchemy source and denormalize_name method in sqlalchemy-stup are differ. It leads to situation, where we chould choose what is better: use Optional[str] for main methods (get_schema_names, get_table_names, get_view_names, get_pk_constraint, get_table_comment, get_view_definition, get_columns, get_foreign_keys) or propagate "# type: ignore" throught all usages of denormalize_name method.
<->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if one is better than other. @hsheth2 - any thoughts ? Also - could you please take a look and get this PR merged ? Its looks all good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the main methods (e.g. get_table_names, get_view_names, get_pk_constraint, get_table_comment, get_view_definition, get_columns, get_foreign_keys) - these do actually require table_name: str
, so declaring it to be optional feels misleading. I'd prefer something like below, or using type: ignore
if the below doesn't work for some reason
def get_table_comment(self, table_name: str, schema: Optional[str] = None)
denormalized_table_name: Optional[str] = self._inspector_instance.dialect.denormalize_name(table_name)
assert denormalized_table_name is not None
denormalized_table_name = table_name # or maybe just use `denormalized_table_name` in the remainder of the method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hsheth2
I fixed it. Agree, this approach is better.
…cle module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests 5.delete class_usage_notification method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly looks good, the main thing is tweaking those Optional[str]
annotations
|
||
rp = self._inspector_instance.bind.execute(sql.text(text), params).scalar() | ||
if rp: | ||
if py2k: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we're on python 3 only, is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Was removed from module.
|
||
return {"text": c.scalar()} | ||
|
||
def _get_constraint_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my own understanding, most of this code is pretty similar to the code from the default oracle sqlalchemy dialect right now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Main idea consist in accessing other schema (DBA instead of ALL) without global changes of original modules.
COMMENT_SQL = """ | ||
SELECT comments | ||
FROM dba_tab_comments | ||
WHERE table_name = CAST(:table_name AS VARCHAR(128)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually we'll want to replace this with a single bulk fetch for the full schema, instead of fetching each table one at a time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. This approach (single query to fetch all info in one go) is more preferable. Idea of this PR was different. I can test it and contribute in next PR.
…le module (1.fix integration tests 2.update golden-files 3.delete debug rows 4.fix mypy tests 5.delete class_usage_notification method 6.fix argument type in main methods 7. replace sqlalchemy imported method)
The Thanks @sleeperdeep for incredible contribution 🎉 |
actual discussion in community
Fixes #6711
Checklist