Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic populated field detection #196

Merged
merged 8 commits into from
Mar 22, 2024
Merged

Generic populated field detection #196

merged 8 commits into from
Mar 22, 2024

Conversation

dogversioning
Copy link
Contributor

This PR makes the following changes:

  • Detecting a complex nested field is now a generic util function
  • Added (temporarily) a denormalized creation template for Coding objects
  • some testing/linting incremental updates
  • added a test for discovery

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added
  • Update template repo if there are changes to study configuration

pyproject.toml Show resolved Hide resolved
tests/test_data/core/core__encounter.txt Outdated Show resolved Hide resolved
tests/test_duckdb.py Show resolved Hide resolved
cumulus_library/studies/discovery/code_detection.py Outdated Show resolved Hide resolved
cumulus_library/template_sql/sql_utils.py Outdated Show resolved Hide resolved
Comment on lines 233 to 234
target_field: str,
expected=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some part of these two params is still lightly confusing to me. Maybe it just needs some explanation in the docstring?

target_field is required, but only used if expected is not provided. Is it better to require a full expected list all the time? Or always add target_field to the expected list regardless of whether it was provided or not? The current approach just feels a little surprise-inducing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is back to 'the default is to treat the field as if it's a codeableconcept, unless you specify otherwise', which the more i think about it, the worse it feels.

i could've sworn i put a docstring on this, but might've lost it switching branches - i'll get something in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added a docstring for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah that helps, but it's a little odd now.

    :param target_field: The name of the nested value to check for inside the
        schema of source_col
    :param expected: a list of elements that should be present in target_field.
        If none, we assume it is a CodeableConcept.

target_field actually isn't used at all in the current code and should likely be removed from the args list. And expected's docstring should probably say "present in source_col" - cause it's frequently looking for fields anywhere up and down the hierarchy (like codeable concepts looking for both coding and system)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, updated this docstring again - too many moving targets in this thing.

cumulus_library/template_sql/sql_utils.py Show resolved Hide resolved
cumulus_library/template_sql/sql_utils.py Show resolved Hide resolved
cumulus_library/template_sql/sql_utils.py Show resolved Hide resolved
cumulus_library/template_sql/sql_utils.py Outdated Show resolved Hide resolved
source_table="documentreference",
source_id="id",
column_name="content.format",
is_array=False,
column_hierarchy=[("format", dict), ("content", dict)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is flipped, right? If so, you should probably add a unit test that would catch that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well a test would have caught it if the test data had been correct, so that should be addressed now.

column_name="code",
is_array=False,
column_hierarchy=[("code", dict)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this consolidation of arguments makes sense, yeah.

nit: maybe now that there's less overload on the word, you could just use columns instead of column_hierarchy which is easy to say or type wrong. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't like columns because ordering is important and not implicit in that name. open to other ideas.

tests/test_templates.py Outdated Show resolved Hide resolved
parent_field=None
if len(config.column_hierarchy) == 1
else config.column_hierarchy[0][0],
is_array=True if config.column_hierarchy[0][1] == list else False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True if bool_expression else False is a verbose way to say bool_expression. I'm guessing it looked odd to have a bare equality check in the middle of the kwargs. Maybe throw a parens around it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally for something inline i find this verbose form more readable, but i can do this.

Comment on lines +118 to +119
column_name=config.column_hierarchy[1][0],
parent_field=config.column_hierarchy[0][0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is uncomfortably hardcoded - you always have exactly depth=2 calls into this method? I get that it's a temporary function, but still.

How about throwing an assert at the top of this method for that assumption, to safeguard against accidents during the temporary period.

(And likewise, in get_codeable_concept_denormalize_query above, maybe assert len(config.column_hierarchy) <= 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently it's used in one place. but i can block with assert for now.


@dataclass(kw_only=True)
class FhirObjectDNConfig(abc.ABC):
"""ABC for handling table detection/denormalization of objects in SQL"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro nit: ABC is a bit jargony for a docstring, to my eyes. Especially because it can be displayed in contexts removed from the abc.ABC inheritance a line above.

Speaking of jargon, I'm lightly leery of encoding DN in this class name.

I don't find "denormalization" to be super evocative. And definitely not on the tip of my tongue enough to know what DN means without seeing the longer word right next to it. Like, I only know what you mean by denormalization from context. If I look it up online, I get stuff about injecting duplicate data into databases (which I guess the unnesting process does do, in the columns you don't care about, but that's not like the main thrust of it in my mind). "Flattening" feels more natural of a description to me. Or simply call it by its keyword of "unnesting." But maybe there's a yet better word.

But already in this docstring you have two use cases mentioned - schema detection and denormalization, but DN gets the honor of being in the class name. Anyway, not important - and you can keep using denormalization for it. Just speaking as a filthy casual over here, that it's not an intuitive word to my ears.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to say 'base class' and named the thing a BaseConfig.

Comment on lines 233 to 234
target_field: str,
expected=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah that helps, but it's a little odd now.

    :param target_field: The name of the nested value to check for inside the
        schema of source_col
    :param expected: a list of elements that should be present in target_field.
        If none, we assume it is a CodeableConcept.

target_field actually isn't used at all in the current code and should likely be removed from the args list. And expected's docstring should probably say "present in source_col" - cause it's frequently looking for fields anywhere up and down the hierarchy (like codeable concepts looking for both coding and system)

@dogversioning dogversioning merged commit 528b72c into main Mar 22, 2024
3 checks passed
@dogversioning dogversioning deleted the mg/generic_dn branch March 22, 2024 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants