Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added archival mode #183

Merged
merged 4 commits into from
Feb 13, 2024
Merged

Added archival mode #183

merged 4 commits into from
Feb 13, 2024

Conversation

dogversioning
Copy link
Contributor

This PR makes the following changes:

  • Adds an --archive flag to the export mode, with a warning about how this is potentially sensitive:
    image
  • Removes the create argument per the docs being a 'good enough' source Tech Debt: Study creation cleanup #122
  • Removes a nested field from encounter we only caught because dictionaries are complained about when sorting dataframes
  • Regenned reference SQL after the yesterday's commits

Checklist

  • Consider if documentation (like in docs/) needs to be updated - They do, will do in one go
  • Consider if tests should be added
  • Update template repo if there are changes to study configuration

Comment on lines 349 to +350
elif args["action"] == "export":
if args["archive"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had a thought that maybe this should be a toplevel action instead of a flag on export but after thinking about it more, decided you had the right of it. Just talking out loud - in case you had thoughts on that too. But I think this makes sense.

Copy link
Contributor Author

@dogversioning dogversioning Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i started off that way too, but in digging in, since it's 90% the same, it was A) faster to get it in here, B) a little DRYer, and C) i kind of like keeping the number of base level cli verbs low.

u.codeable_concept.system = 'http://hl7.org/fhir/sid/icd-10-cm'
u.codeable_concept.system LIKE 'http://hl7.org/fhir/sid/icd-10-cm'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lightly curious about the performance cost of this. Correctness is more important obvi, but if this is noticeably slower, I imagine the jinja could switch to like if it detects a wildcard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a half baked idea in my head of moving the system extraction logic into here in some way, or in some other way doing db introspection to help build these queries out in a more nuanced fashion.

but - only doing LIKE when required seems like low hanging fruit in the interim.

cumulus_library/study_parser.py Outdated Show resolved Hide resolved
dataframe.to_parquet(f"{path}/{table}.parquet", index=False)
queries.append(query)
if not archive:
dataframe.to_parquet(f"{path}/{table}.parquet", index=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not parquet too in this case? Just space reasons?

This method is starting to feel a little awkward. You start with forked logic (which tables) and end with forked logic (how to handle the results). Only the middle is shared. Should the middle be moved to a helper method and you have two different methods here, like:

if archive:
    do_archive()
else:
    do_export()

def do_archive():
   get tables
   do_inner_export()
   zip results

def do_export():
    get tables
    do_inner_export
    parquet results

Or maybe instead move some of the "gross" specialized code (like "getting all tables" or "zipping up all tables") into helpers, so that this method can focus just on the if/else-ing.

I dunno. No action needed per se, just started feeling like a lot of very different code based on some if conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: parquet, my original thought was 'this is for parking data ahead of a paper', since that where this request originated, and since that's all csvs in our current use case, i deferred to that. but there's no reason :not: to parquet, and i guess that enables you to upload an older set of data to the aggregator, so maybe it's worth doing - and that would decrease some of the branching logic. so lemme do that and then i'll see how that feels w.r.t. breaking things out.

I have a nagging doubt about some of the infrastructure in this file - specifically, since its starting to get large, so I think that biases me towards trying to get everything in one place rather than making more functions.

This might be a bad idea. I feel like at :some: point this might need to get broken out into a set of functions per arg inheriting from a base, but i'm reluctant to go that far at the moment, which might be driving some other less sustainable architecture choices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved zipping out to utils - that and the parquet cleanup help.

@dogversioning dogversioning merged commit a36f705 into main Feb 13, 2024
3 checks passed
@dogversioning dogversioning deleted the mg/export-archive branch February 13, 2024 14:41
dogversioning added a commit that referenced this pull request Feb 27, 2024
* Added archival mode

* test tweaks

* PR feedback

* moved zip to utils
dogversioning added a commit that referenced this pull request Feb 27, 2024
* Added archival mode

* test tweaks

* PR feedback

* moved zip to utils
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants