Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for i2b2 export support #126

Merged
merged 5 commits into from
Oct 12, 2023
Merged

Updates for i2b2 export support #126

merged 5 commits into from
Oct 12, 2023

Conversation

dogversioning
Copy link
Contributor

@dogversioning dogversioning commented Sep 29, 2023

This PR makes the following changes:

  • Updates to the clean arg group (really only used by us, and infrequently) to adapt to prefixes in manifests, and cleaning up dangling old tables/user created tables.
  • Updates supporting i2b2 injestion
    • Moved encounter coding output to link to denormalized tables
    • Added static table for encounter class display fields
    • Fixed namespace class vs display distinction

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added
  • Run pylint if you're making changes beyond adding studies
  • Update template repo if there are changes to study configuration

@dogversioning dogversioning force-pushed the mg/i2b2_update branch 2 times, most recently from 72bc8eb to 78ed1b2 Compare October 11, 2023 15:38
@dogversioning dogversioning marked this pull request as ready for review October 11, 2023 18:41
cumulus_library/study_parser.py Outdated Show resolved Hide resolved
cumulus_library/study_parser.py Outdated Show resolved Hide resolved
@@ -89,7 +89,7 @@ def reset_data_path(self, study: PosixPath) -> None:

### Creating studies

def clean_study(self, targets: List[str]) -> None:
def clean_study(self, targets: List[str], explicit_prefix=False) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand a bit on what the explicit prefix stuff is all about? It makes me lightly nervous to allow cleaning on any old prefix - just seems very "shoot yourself in the foot" territory.

If it's to clean up for oddball historical tables... Maybe instead of an explicit prefix, we could do explicit full-name matching. I.e. Don't allow cleaning on ^m* but rather just ^mistake$ -- or is that not a real issue / not a good idea?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're doing explicit full name matching, we may as well just manually drop the tables in athena directly. if you mean word matching, then maybe?

There are three use cases:

  • as you say, historical data created by the library without the dunder syntax
  • User created tables - things like study ideation, etc etc
  • Removing a study when you don't have the manifest on hand

I'm more comfortable with this being an option here as opposed to the etl letting you do wonky stuff like this, because by definition this is not a gold source, and should be treated as being ephemeral by its nature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern was just avoiding accidental deletion.

  • Yeah you could manually drop the tables in Athena, but that's more technical - I had assumed this was exactly that, but for casuals who aren't 100% how to do that.
    • Whether this "full name" idea makes sense depends on how many tables we're talking about in those use cases. Like 10 might be annoying to do individually. But 3? 🤷
  • It might be nice to have like, a guard against destructive actions, since this can affect tables the user might not be thinking about (like, they forget that covid_symptoms__ exists when they're trying to delete some old covid_junk)
    • Something like This command will delete these tables: [...] Continue? [y/N]
    • Doesn't have to be this PR, but it feels like a nice guard, now that we're moving past single study actions.

Regardless, this is fine - I just get real nervous about wildcard deletes with no ability to undo it 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know that's ironic coming from someone who recently had the NdjsonFormat class uncritically delete a dir. But that did get fixed for a reason.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the list & confirm idea - i'll either squeeze it in now or make a ticket for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cumulus_library/cli_parser.py Outdated Show resolved Hide resolved
Comment on lines +448 to +450
dataframe = dataframe.sort_values(
by=list(dataframe.columns), ascending=False, na_position="first"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After seeing the churn in the reference data files, I threw this in, which gets us back to the original format for exports before we removed the ORDER BY clauses for performance reasons.

@dogversioning dogversioning merged commit eae81c1 into main Oct 12, 2023
3 checks passed
@dogversioning dogversioning deleted the mg/i2b2_update branch October 12, 2023 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants