-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates for i2b2 export support #126
Conversation
72bc8eb
to
78ed1b2
Compare
78ed1b2
to
e84967e
Compare
cumulus_library/cli.py
Outdated
@@ -89,7 +89,7 @@ def reset_data_path(self, study: PosixPath) -> None: | |||
|
|||
### Creating studies | |||
|
|||
def clean_study(self, targets: List[str]) -> None: | |||
def clean_study(self, targets: List[str], explicit_prefix=False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand a bit on what the explicit prefix stuff is all about? It makes me lightly nervous to allow cleaning on any old prefix - just seems very "shoot yourself in the foot" territory.
If it's to clean up for oddball historical tables... Maybe instead of an explicit prefix, we could do explicit full-name matching. I.e. Don't allow cleaning on ^m*
but rather just ^mistake$
-- or is that not a real issue / not a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're doing explicit full name matching, we may as well just manually drop the tables in athena directly. if you mean word matching, then maybe?
There are three use cases:
- as you say, historical data created by the library without the dunder syntax
- User created tables - things like study ideation, etc etc
- Removing a study when you don't have the manifest on hand
I'm more comfortable with this being an option here as opposed to the etl letting you do wonky stuff like this, because by definition this is not a gold source, and should be treated as being ephemeral by its nature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern was just avoiding accidental deletion.
- Yeah you could manually drop the tables in Athena, but that's more technical - I had assumed this was exactly that, but for casuals who aren't 100% how to do that.
- Whether this "full name" idea makes sense depends on how many tables we're talking about in those use cases. Like 10 might be annoying to do individually. But 3? 🤷
- It might be nice to have like, a guard against destructive actions, since this can affect tables the user might not be thinking about (like, they forget that covid_symptoms__ exists when they're trying to delete some old covid_junk)
- Something like
This command will delete these tables: [...] Continue? [y/N]
- Doesn't have to be this PR, but it feels like a nice guard, now that we're moving past single study actions.
- Something like
Regardless, this is fine - I just get real nervous about wildcard deletes with no ability to undo it 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I know that's ironic coming from someone who recently had the NdjsonFormat class uncritically delete a dir. But that did get fixed for a reason.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the list & confirm idea - i'll either squeeze it in now or make a ticket for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataframe = dataframe.sort_values( | ||
by=list(dataframe.columns), ascending=False, na_position="first" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After seeing the churn in the reference data files, I threw this in, which gets us back to the original format for exports before we removed the ORDER BY clauses for performance reasons.
This PR makes the following changes:
clean
arg group (really only used by us, and infrequently) to adapt to prefixes in manifests, and cleaning up dangling old tables/user created tables.Checklist
docs/
) needs to be updated