PSM CLI, table persistence #160

dogversioning · 2023-12-18T18:33:41Z

This PR makes the following changes:

Two types of tables are available, persisted outside of study lifecycle - a basic transactions log, and a list of all stats objects in a given study
Stats specific:
- stats tables have a timestamp of creation
- a view will be created that points to the latest table
- cli flags added for force recreate/removal
- Cleaned up PSM flags allowing for access to small data sets
Test data generator added for stats table sampling
Added data_path as a class var to StudyBuilder/StudyParser to get a write dir to PSM
StudyParser tests cut over to DuckDB for unit testing

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added
Run pylint if you're making changes beyond adding studies
Update template repo if there are changes to study configuration <- will do this after review in case of any comments

dogversioning · 2023-12-21T15:34:01Z

cumulus_library/base_table_builder.py

@@ -21,15 +22,14 @@ def __init__(self):
        self.queries = []

    @abstractmethod
-    def prepare_queries(self, cursor: object, schema: str):
+    def prepare_queries(self, cursor: object, schema: str, *args, **kwargs):


I think I mentioned this in passing, but I wanted to call attention to 'you can now include arbitrary args when extending a tablebuilder' as maybe relevant for metrics work.

cumulus_library/cli.py

dogversioning · 2023-12-21T15:46:32Z

cumulus_library/databases.py

+            warnings.warn(
+                "Loading an ndjson dir is not supported with --db-type=athena."


I made this change for my convenience, since I'm toggling between athena/duckdb fairly frequently. I can be talked out of it.

I personally liked how it worked before (big surprise) but I'm not passionate about it.

My reasoning stemmed from the fact that athena is the default. So a lazy user might accidentally be in athena mode (forgetting to pass --db-type) and if we don't stop and let them know (and how to correct their mistake -- hence my (try duckdb), because maybe they've never used that flag before), their clear user intent will be just dropped on the floor and they might think things are working when they are really not. A warning helps, but a straight error ensures they will notice that we didn't do what they asked.

But 🤷 maybe that user flow is not so important.

I'm changing this back for now for this PR, but i reserve the right to add a workaround of some kind in the future if it becomes a pain, perhaps some kind of dev only flag or something like that.

cumulus_library/study_parser.py

dogversioning · 2023-12-21T18:20:05Z

@mikix Known issues I will address that are not blocking review:

I need to update unit tests to inspect records of the transaction logs/stats tables
I have several inline SQL queries I should cut over to templates
I'll deal with regression at the end

dogversioning · 2023-12-21T18:23:52Z

docs/statistics.md

@comorbidity If you could - can you look at this file, and the linked PSM markdown file, on the branch just to check how parsable the overall documentation of this would be for a user?

mikix

I am going to mostly ignore the test file changes - my brain can't take it right now, and I trust they are good duckdb nonsense.

Approving - this is good! Feel free to land while I'm away

cumulus_library/cli.py

cumulus_library/study_parser.py

mikix · 2023-12-21T20:03:34Z

cumulus_library/study_parser.py

+        if not stats_build:
+            return


nit: this is an odd method construction. I understand why you might end up here reasonably, I'm just going to prod you to see if there's a better way to get at the flow you want. If not, this is fine.

It especially triggered me because stats_build=False is the default.

Here's what i believe stats generation lifecycle to look like:

stats, if they exist in a study, are always run if they haven't been run before

after this, stats should :not: be run

a researcher may say 'ok, i've changed [x,y,z] and now i'd like to run a new sampling experiment, let me explicitly invoke it'
So some sort of 'usually false' workflow belong someplace. It might be worth nattering a bit about the the right layer of seperation between the Builder and the Parser (and should the Builder be named something different)? but it could move to the builder.

cumulus_library/study_parser.py

dogversioning added 2 commits December 18, 2023 13:33

PSM CLI, table persistence

8ee5515

existing unit test rework, data generator

ff92be6

dogversioning force-pushed the mg/psm-cli branch from b799f08 to ff92be6 Compare December 20, 2023 18:45

dogversioning added 5 commits December 20, 2023 16:18

Stats test coverage

770729c

Docs update, linting

efdee2f

pylint

1f58bdf

self review pass

93afe62

docs tweak

78eddb9

dogversioning commented Dec 21, 2023

View reviewed changes

dogversioning changed the title ~~WIP: PSM CLI, table persistence~~ PSM CLI, table persistence Dec 21, 2023

dogversioning marked this pull request as ready for review December 21, 2023 18:13

dogversioning commented Dec 21, 2023

View reviewed changes

mikix approved these changes Dec 21, 2023

View reviewed changes

dogversioning added 6 commits December 26, 2023 12:08

PR feedback

92d4c96

typecasts for athena

8d79e7c

sqlfluff vars

befd3b6

test transaction log

63605ab

replaced inline PSM queries

28e8a80

sqlfluff cleanup

d6345d2

dogversioning force-pushed the mg/psm-cli branch from 335be15 to d6345d2 Compare December 26, 2023 21:58

dogversioning merged commit 4c66973 into main Dec 26, 2023
3 checks passed

dogversioning deleted the mg/psm-cli branch December 26, 2023 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSM CLI, table persistence #160

PSM CLI, table persistence #160

dogversioning commented Dec 18, 2023 •

edited

Loading

dogversioning Dec 21, 2023

dogversioning Dec 21, 2023

mikix Dec 21, 2023

dogversioning Dec 26, 2023

dogversioning commented Dec 21, 2023

dogversioning Dec 21, 2023 •

edited

Loading

mikix left a comment

mikix Dec 21, 2023

dogversioning Dec 21, 2023

		warnings.warn(
		"Loading an ndjson dir is not supported with --db-type=athena."

PSM CLI, table persistence #160

PSM CLI, table persistence #160

Conversation

dogversioning commented Dec 18, 2023 • edited Loading

Checklist

dogversioning Dec 21, 2023

Choose a reason for hiding this comment

dogversioning Dec 21, 2023

Choose a reason for hiding this comment

mikix Dec 21, 2023

Choose a reason for hiding this comment

dogversioning Dec 26, 2023

Choose a reason for hiding this comment

dogversioning commented Dec 21, 2023

dogversioning Dec 21, 2023 • edited Loading

Choose a reason for hiding this comment

mikix left a comment

Choose a reason for hiding this comment

mikix Dec 21, 2023

Choose a reason for hiding this comment

dogversioning Dec 21, 2023

Choose a reason for hiding this comment

dogversioning commented Dec 18, 2023 •

edited

Loading

dogversioning Dec 21, 2023 •

edited

Loading