Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSM CLI, table persistence #160

Merged
merged 13 commits into from
Dec 26, 2023
Merged

PSM CLI, table persistence #160

merged 13 commits into from
Dec 26, 2023

Conversation

dogversioning
Copy link
Contributor

@dogversioning dogversioning commented Dec 18, 2023

This PR makes the following changes:

  • Two types of tables are available, persisted outside of study lifecycle - a basic transactions log, and a list of all stats objects in a given study
  • Stats specific:
    • stats tables have a timestamp of creation
    • a view will be created that points to the latest table
    • cli flags added for force recreate/removal
    • Cleaned up PSM flags allowing for access to small data sets
  • Test data generator added for stats table sampling
  • Added data_path as a class var to StudyBuilder/StudyParser to get a write dir to PSM
  • StudyParser tests cut over to DuckDB for unit testing

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added
  • Run pylint if you're making changes beyond adding studies
  • Update template repo if there are changes to study configuration <- will do this after review in case of any comments

@@ -21,15 +22,14 @@ def __init__(self):
self.queries = []

@abstractmethod
def prepare_queries(self, cursor: object, schema: str):
def prepare_queries(self, cursor: object, schema: str, *args, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned this in passing, but I wanted to call attention to 'you can now include arbitrary args when extending a tablebuilder' as maybe relevant for metrics work.

cumulus_library/cli.py Show resolved Hide resolved
Comment on lines 268 to 269
warnings.warn(
"Loading an ndjson dir is not supported with --db-type=athena."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change for my convenience, since I'm toggling between athena/duckdb fairly frequently. I can be talked out of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally liked how it worked before (big surprise) but I'm not passionate about it.

My reasoning stemmed from the fact that athena is the default. So a lazy user might accidentally be in athena mode (forgetting to pass --db-type) and if we don't stop and let them know (and how to correct their mistake -- hence my (try duckdb), because maybe they've never used that flag before), their clear user intent will be just dropped on the floor and they might think things are working when they are really not. A warning helps, but a straight error ensures they will notice that we didn't do what they asked.

But 🤷 maybe that user flow is not so important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changing this back for now for this PR, but i reserve the right to add a workaround of some kind in the future if it becomes a pain, perhaps some kind of dev only flag or something like that.

cumulus_library/study_parser.py Show resolved Hide resolved
@dogversioning dogversioning changed the title WIP: PSM CLI, table persistence PSM CLI, table persistence Dec 21, 2023
@dogversioning dogversioning marked this pull request as ready for review December 21, 2023 18:13
@dogversioning
Copy link
Contributor Author

@mikix Known issues I will address that are not blocking review:

  • I need to update unit tests to inspect records of the transaction logs/stats tables
  • I have several inline SQL queries I should cut over to templates
  • I'll deal with regression at the end

Copy link
Contributor Author

@dogversioning dogversioning Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comorbidity If you could - can you look at this file, and the linked PSM markdown file, on the branch just to check how parsable the overall documentation of this would be for a user?

Copy link
Contributor

@mikix mikix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to mostly ignore the test file changes - my brain can't take it right now, and I trust they are good duckdb nonsense.

Approving - this is good! Feel free to land while I'm away

cumulus_library/cli.py Show resolved Hide resolved
cumulus_library/cli.py Outdated Show resolved Hide resolved
cumulus_library/cli.py Show resolved Hide resolved
cumulus_library/cli.py Show resolved Hide resolved
cumulus_library/cli.py Outdated Show resolved Hide resolved
cumulus_library/study_parser.py Outdated Show resolved Hide resolved
cumulus_library/study_parser.py Outdated Show resolved Hide resolved
cumulus_library/study_parser.py Show resolved Hide resolved
Comment on lines +407 to +408
if not stats_build:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is an odd method construction. I understand why you might end up here reasonably, I'm just going to prod you to see if there's a better way to get at the flow you want. If not, this is fine.

It especially triggered me because stats_build=False is the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what i believe stats generation lifecycle to look like:

  • stats, if they exist in a study, are always run if they haven't been run before
  • after this, stats should :not: be run
  • a researcher may say 'ok, i've changed [x,y,z] and now i'd like to run a new sampling experiment, let me explicitly invoke it'
    So some sort of 'usually false' workflow belong someplace. It might be worth nattering a bit about the the right layer of seperation between the Builder and the Parser (and should the Builder be named something different)? but it could move to the builder.

cumulus_library/study_parser.py Outdated Show resolved Hide resolved
@dogversioning dogversioning merged commit 4c66973 into main Dec 26, 2023
3 checks passed
@dogversioning dogversioning deleted the mg/psm-cli branch December 26, 2023 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants