Update docs for table builders

smart-on-fhir · Sep 7, 2023 · 2e745d9 · 2e745d9
1 parent 36323e4
commit 2e745d9
Show file tree

Hide file tree

Showing 6 changed files with 258 additions and 17 deletions.
diff --git a/cumulus_library/studies/template/manifest.toml b/cumulus_library/studies/template/manifest.toml
@@ -28,6 +28,15 @@ export_list = [
     "template__count_influenza_test_month",
 ]
 
+# For generating counts table in a more standardized manner, we have a class in the 
+# main library you can extend that will handle most of the logic of assembling 
+# queries for you. We use this pattern for generating the core tables, as well
+# other studies authored inside BCH. These will always be run after any other
+# SQL queries have been generated
+# [counts_builder_config]
+# file_names = [
+#    "count.py"
+# ]
 
 # For most use cases, this should not be required, but if you need to programmatically
 # build tables, you can provide a list of files implementing BaseTableBuilder.

diff --git a/docs/core-study-details.md b/docs/core-study-details.md
@@ -1,7 +1,7 @@
 ---
 title: Core Study Details
 parent: Library
-nav_order: 4
+nav_order: 5
 # audience: clinical researchers, IRB reviewers
 # type: reference
 ---

diff --git a/docs/creating-sql-with-python.md b/docs/creating-sql-with-python.md
@@ -0,0 +1,201 @@
+---
+title: Creating SQL with Python
+parent: Library
+nav_order: 4
+# audience: clinical researcher or engineer familiar with project
+# type: tutorial
+---
+
+# Creating SQL with python
+
+Before jumping into this doc, take a look at 
+[Creating Studies](creating-studies.md).
+If you're just working on `core` tables related to the US Core FHIR profiles, you 
+may not be interested in this, or only need to look at the 
+[Working with TableBuilders](#working-with-tablebuilders)
+and the
+[Generating count tables](#generating-counts-tables)
+sections.
+
+## Why would I even need to think about this?
+
+There are three main reasons why you would need to use python to generate sql:
+- You would like to make use of the 
+[helper class we've built](#generating-counts-tables)
+for ease of creating count tables in a structured manual.
+- You have a dataset you'd like to 
+[load into a table from a static file](#adding-a-static-dataset),
+separate from the ETL tables.
+- The gnarly one: you are working against the raw FHIR resource tables, and are 
+trying to access 
+[nested data](#querying-nested-data) in Athena. 
+  - We infer datatypes in the ETL based on the presence of data once we get past 
+  the top level elements, and so the structure may vary depending on the
+  implementation, either at the EHR level or at the FHIR interface level.
+
+
+We've got examples of all three of these cases in this repo, and we'll reference
+those as examples as we go.
+
+## Utilities
+
+There are two main bits of infrastructure we use for programmatic tables:
+The TableBuilder class, and the collection of template SQL.
+
+### Working with TableBuilders
+
+We have a base
+[TableBuilder class](../cumulus_library/base_table_builder.py)
+that
+all the above use cases leverage. At a high level, here's what it provides:
+
+- A `prepare_queries` function, which is where you put your custom logic. It
+should create an array of queries in `self.queries`. The CLI will pass in a cursor
+object and database/schema name, so if you need to interrogate the dataset to decide
+how to structure your queries, you can.
+- An `execute_queries` function, which will run `prepare_queries` and then apply
+those queries to the database. Generally, you shouldn't need to touch this function -
+just be aware this is how your queries actually get run.
+- A `write_queries` function, which will write your queries from `prepare_function`
+to disk. If you are creating multiple queries in one go, calling `comment_queries`
+before `write_queries` will insert some spacing elements for readability.
+- A `display_text` string, which is what will be shown with a progress bar when your
+queries are being executed.
+
+You can either extend this class directly (like `builder_*.py` files in 
+`cumulus_library/studies/core`) or create a specific class to add reusable functions
+for a repeated use case (like in `cumulus_library/schema/counts.py`).
+
+TableBuilder SQL generally should go through a template SQL generator, so that
+your SQL has been validated.
+
+### Working with template SQL
+
+If you are only worried about building counts tables, skip this section - 
+we've got enough wrappers that you shouldn't need to worry about this
+level of detail.
+
+For validating SQL, we are using 
+[Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/)
+to create validated SQL in a repeatable manner. We don't expect you to write these
+templates - instead, using the 
+[template function library](../cumulus_library/template_sql/templates.py)
+you can provide a series of arguments to these templates that will allow you to
+generate standard types of SQL tables, as well as using templates targeted for
+bespoke operations. 
+
+When you're thinking about a query that you'd need to create, first check the
+template function library to see if something already exists. Basic CRUD
+should be covered, as well as unnestings for some common FHIR objects.
+
+## Use cases
+
+### Generating counts tables
+A thing we do over and over as part of studies is generate powerset counts tables
+against a filtered resource to get data about a certain kind of clinical population.
+Since this is so common we created a class just for this, and we're using it in all
+studies the Cumulus team is directly authoring.
+
+The [CountsBuilder class](../cumulus_library/schema/counts.py) 
+provides a number of convenience methods that are available for use (this covers
+mechanics of generation). You can see examples of usage in the 
+[Core counts builder](../cumulus_library/studies/core/count_core.py)
+(which is where the business logic of your study lives). 
+
+- `get_table_name` will scan the study's `manifest.toml` and auto prepend a table
+name with whatever the study prefix is.
+- `get_where_clauses` will format a string, or an array, of where clauses in a
+manner that the table constructors will expect.
+- `count_[condition,document,encounter,observation,patient]` will take a target table
+name, a source table, and an array of columns, and produce the appropriate powerset
+table to count that resource. You can optionally provide a list of where statements
+for filtering, or can change the minimum bin size used to include data
+- The `count_*` functions pass through to `get_count_query` - if you have a use
+case we're not covering, you can use this interface directly. We'd love to hear
+about it - we'd consider covering it and/or take PRs for new features
+
+As a convenience, if you include a `if __name__ == "__main__":` clause like you
+see in `count_core.py`, you can invoke the builder's output by invoking it with
+python, which is a nice way to get example SQL output for inclusion in github.
+This is where the 
+[count core sql output](../cumulus_library/studies/core/count_core.sql)
+originated from.
+
+Add your count generator file to the `counts_builder_config` section of your
+`manifest.toml` to include it in your build invocations.
+
+### Adding a static dataset
+
+*NOTE* - we have an
+[open issue](https://github.com/smart-on-fhir/cumulus-library/issues/58)
+to develop a faster methodology for adding new datasets.
+
+Occasionally you will have a dataset from a third party that is useful for working
+with your dataset. In the vocab study (requiring a license to use), we 
+[add coding system data](../cumulus_library/studies/vocab/vocab_icd_builder.py)
+from flat files to athena. If you need to do this, you should extend the base
+TableBuilder class, and your `prepare_queries` function should do the following,
+leveraging the
+[template function library](../cumulus_library/template_sql/templates.py):
+- Use the `get_ctas_query` function to get a CREATE TABLE AS statement to 
+instantiate your table in athena
+- Since athena SQL queries are limited in size to 262144 bytes, if you have
+a large dataset, break it up into smaller chunks
+- Use the `get_insert_into` function to add the data from each table to
+the chunk you just created.
+
+Add the dataset uploader to the `table_builder_config` section of your
+`manifest.toml` to include it in your build - this will make this data
+available for downstream queries
+
+### Querying nested data
+
+Are you trying to access data from deep within raw FHIR tables? I'm so sorry.
+Here's an example of how this can get fussy with code systems:
+
+A FHIR coding element may be an array, or it may be a singleton, or it may
+be a singleton wrapped an array. It may be fully populated, or partially populated,
+or completely absent. There may be one code per record, or multiple codes per record,
+and you may only be interested in a subset of these codes.
+
+This means you may have differing schemas in Athena from one site's data to another
+(especially if they come from different EHR systems, where implementation details
+may differ). In order to handle this, you need to create a standard output
+representation that accounts for all the different permutations you have, and
+conform data to match that. The 
+[encounter coding](../cumulus_library/studies/core/builder_encounter_coding.py)
+and
+[condition codeableConcept](../cumulus_library/studies/core/builder_condition_codeableconcept.py)
+builders both jump through hoops to try and get this data into flat tables for
+downstream use.
+
+This is a pretty open ended design problem, but based on our experience, your
+`prepare_queries` implementation should attempt the following steps:
+- Check if your table has any data at all
+- If it does, inspect the table schema to see if the field you're interested in
+is populated with the schema elements you're expecting
+  - If yes, it's safe to grab them
+  - If no, you will need to manually initialize them to an appropriate null value
+- If you are dealing with deeply nested objects, you may need to repeat the above
+more than once
+- Write a jinja template that handles the conditionally present data, and a 
+template function to invoke that template
+- Test this against as many different sample databases as you can
+- Be prepared to need to update this when you hit a condition you didn't expect
+- Create a distinct table that has an ID for joining back to the original
+- Perform this join as appropriate to create a table with unnested data
+
+You may find it useful to use the `--builder [filename]` sub argument of the cli
+build command to run just your builder for iteration. The
+[Sample bulk FHIR datasets](https://github.com/smart-on-fhir/sample-bulk-fhir-datasets)
+can provide an additional testbed database above and beyond whatever you produce
+in house.
+
+Add this builder to the `table_builder_config` section of your
+`manifest.toml` - this will make this data available for downstream queries.
+
+Good luck! If you think you're dealing with a pretty common case, you can reach
+out to us on the 
+[discussion forum](https://github.com/smart-on-fhir/cumulus/discussions)
+and we may be able to provide an implementation for you, or provide assistance
+if you're dealing with a particular edge case.
diff --git a/docs/creating-studies.md b/docs/creating-studies.md
@@ -13,8 +13,8 @@ aggregations in support of ongoing projects.
 
 ## Setup
 
-If you are going to be creating new studies, we strongly recommend adding an
-environment variable, `CUMULUS_LIBRARY_PATH`, pointing to the folder in which 
+If you are going to be creating new studies, we recommend, but do not require, adding
+an environment variable, `CUMULUS_LIBRARY_PATH`, pointing to the folder in which 
 you'll be working on study development. `cumulus-library` will look in each 
 subdirectory of that folder for manifest files, so you can run several studies
 at once. 
@@ -24,15 +24,21 @@ to any build/export call to tell it where to look for your work.
 
 ## Creating a new study
 
-The easiest way to get started with a new study is to use `cumulus-library` to
-create a manifest for you. You can do this with by running:
+There are two ways to get started with a new study:
+
+1. Use `cumulus-library` to create a manifest for you. You can do this with by running:
 ```bash
 cumulus-library create ./path/to/your/study/dir
 ```
-We'll create that folder if it doesn't already exist. We recommend you use a name
-relevant to your study (we'll use `my_study` forthis document). The folder name is
-the same thing you will use as a target with `cumulus_library` to run your study's
-queries.
+We'll create that folder if it doesn't already exist. 
+
+2. Fork the [
+Cumulus library template repo](https://github.com/smart-on-fhir/cumulus-library-template),
+renaming your fork, and cloning it directly from github.
+
+We recommend you use a name relevant to your study (we'll use `my_study` for this
+document). The folder name is the same thing you will use as a target with 
+`cumulus_library` to run your study's queries.
 
 Once you've made a new study, the `manifest.toml` file is the place you let cumulus
 library know how you want your study to be run against the remote database. The
@@ -68,14 +74,25 @@ Talking about what these three sections do:
   counts to reduce exposure of limited datasets, and so we recommend only exporting
   count tables.
 
+There are other hooks you can use in the manifest for more advanced control over
+how you can generate sql. See [Creating SQL with python](creating-sql-with-python.md)
+for more information.
+
 We recommend creating a git repo per study, to help version your study data, which
-you can do in the same directory as the manifest file.
+you can do in the same directory as the manifest file. If you've forked your study from
+the template, you've already checked this step off.
 
 ### Writing SQL queries
 
 Most users have a workflow that looks like this:
   - Write queries in the [AWS Athena console](https://aws.amazon.com/athena/) while
   you are exploring the data
+    - We recommend trying to keep your studies pointed at the `core` tables. The
+    base FHIR resource named tables contain a lot of nested data that can be tricky
+    to write cross-EHR queries against, and so you'll save yourself some headaches
+    if everything you need is available via those resources. If it isn't, make sure
+    you look at the [Creating SQL with python](creating-sql-with-python.md) guide
+    for information about safely extracting datasets from those tables.
   - Move queries to a file as you finalize them
   - Build your study with the CLI to make sure your queries load correctly.
 
@@ -115,9 +132,12 @@ styling.
   they have a small number of members**, i.e. less than 10.
 
 **Recommended**
-  - You may want to select a SQL style guide as a reference.
+  - You may want to select a SQL style guide as a reference. Mozilla provides a
+  [SQL style guide](https://docs.telemetry.mozilla.org/concepts/sql_style.html),
+  which our sqlfluff config enforces.
   [Gitlab's data team](https://about.gitlab.com/handbook/business-technology/data-team/platform/sql-style-guide/)
-  has an example of this, though there are other choices.
+  has a style guide that is more centered around DBT, but also has some practices
+  you may want to consider adopting.
   - Don't implicitly reference columns tables. Either use the full table name,
   or give the table an alias, and use that any time you are referencing a column.
   - Don't use the * wildcard in your final tables. Explicitly list the columns
@@ -127,16 +147,27 @@ styling.
   to find other problems if you lightly adhere to this from the start.
   - Agggregate count tables should have the first word after the study prefix be
   `count`, and otherwise the word `count` should not be used.
+
+**Metadata tables**
   - Creating a table called `my_study__meta_date` with two columns, `min date` and
   `max date`, and populating it with the start and end date of your study, will
   allow other Cumulus tools to detect study date ranges, and otherwise bakes the
   study date range into your SQL for future reference.
+  - Creating a `my_study__meta_version` with one column, `data_package_version`, and
+  giving it an integer value as shown in this snippet:
+  ```sql
+  CREATE TABLE my_study__meta_version AS
+  SELECT 1 AS data_package_version;
+  ```
+  allows you to signal versions for use in segregating data upstream, like in the
+  Cumulus aggregator - just increment it when you want third parties to start running
+  a new data model. If this is not set, the version will implicitly be set to zero.
 
 ## Sharing studies
 
-If you want to share your study as part of a publication, you'll need to open a PR - 
-after cloning this repository, make a branch, and add your study config to the
-`cumulus_library/studies/` directory, and then just open a PR. 
+If you want to share your study as an official Cumulus study, please let us know
+via the [discussion forum](https://github.com/smart-on-fhir/cumulus/discussions) -
+we can talk more about what makes sense for your use case.
 
 If you write a paper using the Cumulus library, please 
 [cite the project](https://smarthealthit.org/cumulus-a-universal-sidecar-for-a-smart-learning-healthcare-system/)
diff --git a/docs/sharing-data.md b/docs/sharing-data.md
@@ -1,7 +1,7 @@
 ---
 title: Data Sharing
 parent: Library
-nav_order: 5
+nav_order: 6
 # audience: IT security or clinical researcher with low to medium familiarity with project
 # type: explanation
 ---

diff --git a/docs/study-list.md b/docs/study-list.md
@@ -1,7 +1,7 @@
 ---
 title: Cumulus studies
 parent: Library
-nav_order: 6
+nav_order: 7
 # audience: Clinical researchers interested in publications
 # type: reference
 ---