docs: some wordsmithing and clarification

smart-on-fhir · Feb 27, 2024 · 62d16f4 · 62d16f4
1 parent 78045a3
commit 62d16f4
Show file tree

Hide file tree

Showing 10 changed files with 171 additions and 152 deletions.
diff --git a/cumulus_library/cli_parser.py b/cumulus_library/cli_parser.py
@@ -22,7 +22,7 @@ def add_table_builder_argument(parser: argparse.ArgumentParser) -> None:
 
 
 def add_study_dir_argument(parser: argparse.ArgumentParser) -> None:
-    """Adds --study_dir arg to a subparser"""
+    """Adds --study-dir arg to a subparser"""
     parser.add_argument(
         "-s",
         "--study-dir",

diff --git a/cumulus_library/study_parser.py b/cumulus_library/study_parser.py
@@ -248,10 +248,7 @@ def clean_study(
         # study builder, and remove them from the list.
         for view_table in view_table_list.copy():
             if any(
-                (
-                    (f"_{word.value}_") in view_table[0]
-                    or view_table[0].endswith(word.value)
-                )
+                f"__{word.value}_" in view_table[0]
                 for word in enums.ProtectedTableKeywords
             ):
                 view_table_list.remove(view_table)

diff --git a/docs/aws-setup.md b/docs/aws-setup.md
@@ -14,7 +14,7 @@ Cumulus library executes queries against an
 for creating such a datastore is available for testing purposes if you don't
 already have one.
 
-The cloudformation template in the sample database's Cloudformation template should
+The sample database's CloudFormation template should
 have the appropriate permissions set for all the services. If you need to configure
 an IAM policy manually, you will need to ensure the AWS profile you are using has
 the following permissions:
@@ -52,4 +52,8 @@ to specify where your database information lives:
 - `CUMULUS_LIBRARY_DATABASE` : The name of the database Athena will use (`cumulus_library_sample_db` if using the sample DB)
 - `CUMULUS_LIBRARY_WORKGROUP` : the Athena workgroup to execute queries in (`cumulus_library_sample_db` if using the sample DB)
 
-Configuring environment variables on your system is out of scope of this document, but several guides are available elsewhere. [This guide](https://www.twilio.com/blog/2017/01/how-to-set-environment-variables.html), for example, covers Mac, Windows, and Linux. And, as a plus, it has a picture of an adorable puppy at the top of it.
+Configuring environment variables on your system is out of scope of this document,
+but several guides are available elsewhere.
+[This guide](https://www.twilio.com/blog/how-to-set-environment-variables-html),
+for example, covers Mac, Windows, and Linux.
+And, as a plus, it has a picture of an adorable puppy at the top of it.
diff --git a/docs/core-study-details.md b/docs/core-study-details.md
@@ -42,58 +42,64 @@ Examples:
 ## Core study exportable counts tables
 
 ### count_core_condition_icd10_month
-| Variable  |   Description |
-| --------  |   --------    |
-| cnt   |   Count   |
-| cond_month    |   Month condition recorded    |
-| cond_code_display |   Condition code  |
-| enc_class_code    |   Encounter Code (Healthcare Setting) |
+
+| Variable          | Description                         |
+|:------------------|:------------------------------------|
+| cnt               | Count                               |
+| cond_month        | Month condition recorded            |
+| cond_code_display | Condition code                      |
+| enc_class_code    | Encounter Code (Healthcare Setting) |
 
 
 ### count_core_documentreference_month
-| Variable  |   Description |
-| --------  |   --------    |
-| cnt   |   Count   |
-| author_month  |   Month document was authored |
-| enc_class_code    |   Encounter Code (Healthcare Setting) |
-| doc_type_display  |   Type of Document (display)  |
+
+| Variable         | Description                         |
+|:-----------------|:------------------------------------|
+| cnt              | Count                               |
+| author_month     | Month document was authored         |
+| enc_class_code   | Encounter Code (Healthcare Setting) |
+| doc_type_display | Type of Document (display)          |
 
 
 ### count_core_encounter_day
-| Variable  |   Description |
-| --------  |   --------    |
-| cnt   |   Count   |
-| enc_class_code    |   Encounter Code (Healthcare Setting) |
-| start_date    |   Day patient encounter started   |
+
+| Variable       | Description                         |
+|:---------------|:------------------------------------|
+| cnt            | Count                               |
+| enc_class_code | Encounter Code (Healthcare Setting) |
+| start_date     | Day patient encounter started       |
 
 
 ### count_core_encounter_month
-| Variable  |   Description |
-| --------  |   --------    |
-| cnt   |   Count   |
-| enc_class_code    |   Encounter Code (Healthcare Setting) |
-| start_month   |   Month patient encounter started |
-| age_at_visit  |   Patient Age at Encounter    |
-| gender    |   Biological sex at birth |
-| race_display  |   Patient reported race   |
-| postalcode3   |   Patient 3 digit zip |
+
+| Variable       | Description                         |
+|:---------------|:------------------------------------|
+| cnt            | Count                               |
+| enc_class_code | Encounter Code (Healthcare Setting) |
+| start_month    | Month patient encounter started     |
+| age_at_visit   | Patient Age at Encounter            |
+| gender         | Biological sex at birth             |
+| race_display   | Patient reported race               |
+| postalcode3    | Patient 3 digit zip                 |
 
 
 ### count_core_observation_lab_month
-| Variable  |   Description |
-| --------  |   --------    |
-| cnt   |   Count   |
-| lab_month |   Month of lab result |
-| lab_code  |   Laboratory Code |
-| lab_result_display    |   Laboratory result   |
-| enc_class_code    |   Encounter Code (Healthcare Setting) |
+
+| Variable           | Description                         |
+|:-------------------|:------------------------------------|
+| cnt                | Count                               |
+| lab_month          | Month of lab result                 |
+| lab_code           | Laboratory Code                     |
+| lab_result_display | Laboratory result                   |
+| enc_class_code     | Encounter Code (Healthcare Setting) |
 
 
 ### count_core_patient
-| Variable  |   Description |
-| --------  |   --------    |
-| cnt   |   Count   |
-| gender    |   Biological sex at birth |
-| age   |   Age in years calculated since DOB   |
-| race_display  |   Patient reported race   |
-| postalcode3   |   Patient 3 digit zip |
+
+| Variable     | Description                       |
+|:-------------|:----------------------------------|
+| cnt          | Count                             |
+| gender       | Biological sex at birth           |
+| age          | Age in years calculated since DOB |
+| race_display | Patient reported race             |
+| postalcode3  | Patient 3 digit zip               |
diff --git a/docs/creating-sql-with-python.md b/docs/creating-sql-with-python.md
@@ -19,28 +19,30 @@ sections.
 
 ## Why would I even need to think about this?
 
-There are three main reasons why you would need to use python to generate sql:
+There are three main reasons why you would need to use Python to generate SQL:
 - You would like to make use of the 
 [helper class we've built](#generating-counts-tables)
-for ease of creating count tables in a structured manual.
+for ease of creating count tables in a structured manner.
 - You have a dataset you'd like to 
 [load into a table from a static file](#adding-a-static-dataset),
 separate from the ETL tables.
 - The gnarly one: you are working against the raw FHIR resource tables, and are 
 trying to access 
 [nested data](#querying-nested-data) in Athena. 
-  - We infer datatypes in the ETL based on the presence of data once we get past 
-  the top level elements, and so the structure may vary depending on the
-  implementation, either at the EHR level or at the FHIR interface level.
+  - This is gnarly because while the ETL provides a full SQL schema for your own data,
+  it does not guarantee a schema for data that you don't have at your site.
+  And if you want your study to run at multiple sites with different EHRs,
+  you need to be careful when accessing deep FHIR fields.
+  For example, your EHR might populate `Condition.evidence.code` and you can safely
+  write SQL that uses it. But a different site's EHR may not provide that field at all,
+  and thus that column may not be defined in the SQL table schema at that other site.
 
-
-We've got examples of all three of these cases in this repo, and we'll reference
-those as examples as we go.
+You'll see examples of all three cases in this guide.
 
 ## Utilities
 
 There are two main bits of infrastructure we use for programmatic tables:
-The TableBuilder class, and the collection of template SQL.
+The `TableBuilder` class, and the collection of template SQL.
 
 ### Working with TableBuilders
 
@@ -63,7 +65,7 @@ queries are being executed.
 
 You can either extend this class directly (like `builder_*.py` files in 
 `cumulus_library/studies/core`) or create a specific class to add reusable functions
-for a repeated use case (like in `cumulus_library/schema/counts.py`).
+for a repeated use case (like in `cumulus_library/statistics/counts.py`).
 
 TableBuilder SQL generally should go through a template SQL generator, so that
 your SQL has been validated. If you're just working on counts, you don't need
@@ -77,27 +79,27 @@ we've got enough wrappers that you shouldn't need to worry about this
 level of detail.
 
 For validating SQL, we are using 
-[Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/)
+[Jinja templates](https://jinja.palletsprojects.com/)
 to create validated SQL in a repeatable manner. We don't expect you to write these
 templates - instead, using the 
-[template function library](../cumulus_library/template_sql/templates.py)
-you can provide a series of arguments to these templates that will allow you to
+[template function library](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/template_sql/base_templates.py)
+you can provide arguments to these templates that will allow you to
 generate standard types of SQL tables, as well as using templates targeted for
 bespoke operations. 
 
 When you're thinking about a query that you'd need to create, first check the
-template function library to see if something already exists. Basic CRUD
-should be covered, as well as unnestings for some common FHIR objects.
+template function library to see if something already exists. Basic creation and inspection
+queries should be covered, as well as unnestings for some common FHIR objects.
 
 ## Use cases
 
 ### Generating counts tables
 A thing we do over and over as part of studies is generate powerset counts tables
 against a filtered resource to get data about a certain kind of clinical population.
-Since this is so common we created a class just for this, and we're using it in all
+Since this is so common, we created a class just for this, and we're using it in all
 studies the Cumulus team is directly authoring.
 
-The [CountsBuilder class](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/schema/counts.py) 
+The [CountsBuilder class](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/statistics/counts.py)
 provides a number of convenience methods that are available for use (this covers
 mechanics of generation). You can see examples of usage in the 
 [Core counts builder](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/studies/core/count_core.py)
@@ -119,28 +121,24 @@ As a convenience, if you include a `if __name__ == "__main__":` clause like you
 see in `count_core.py`, you can invoke the builder's output by invoking it with
 python, which is a nice way to get example SQL output for inclusion in github.
 This is where the 
-[count core sql output](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/studies/core/count_core.sql)
+[count core sql output](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/studies/core/reference_sql/count_core.sql)
 originated from.
 
 Add your count generator file to the `counts_builder_config` section of your
 `manifest.toml` to include it in your build invocations.
 
 ### Adding a static dataset
 
-*NOTE* - we have an
-[open issue](https://github.com/smart-on-fhir/cumulus-library/issues/58)
-to develop a faster methodology for adding new datasets.
-
 Occasionally you will have a dataset from a third party that is useful for working
 with your dataset. In the vocab study (requiring a license to use), we 
 [add coding system data](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/studies/vocab/vocab_icd_builder.py)
-from flat files to athena. If you need to do this, you should extend the base
-TableBuilder class, and your `prepare_queries` function should do the following,
+from flat files to Athena. If you need to do this, you should extend the base
+`TableBuilder` class, and your `prepare_queries` function should do the following,
 leveraging the
-[template function library](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/template_sql/templates.py):
-- Use the `get_ctas_query` function to get a CREATE TABLE AS statement to 
-instantiate your table in athena
-- Since athena SQL queries are limited in size to 262144 bytes, if you have
+[template function library](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/template_sql/base_templates.py):
+- Use the `get_ctas_query` function to get a `CREATE TABLE AS` statement to 
+instantiate your table in Athena
+- Since Athena SQL queries are limited in size to 262144 bytes, if you have
 a large dataset, break it up into smaller chunks
 - Use the `get_insert_into` function to add the data from each table to
 the chunk you just created.
@@ -149,6 +147,11 @@ Add the dataset uploader to the `table_builder_config` section of your
 `manifest.toml` to include it in your build - this will make this data
 available for downstream queries
 
+{: .note }
+We have an
+[open issue](https://github.com/smart-on-fhir/cumulus-library/issues/58)
+to develop an easier methodology for adding new datasets.
+
 ### Querying nested data
 
 Are you trying to access data from deep within raw FHIR tables? I'm so sorry.
@@ -164,9 +167,9 @@ This means you may have differing schemas in Athena from one site's data to anot
 may differ). In order to handle this, you need to create a standard output
 representation that accounts for all the different permutations you have, and
 conform data to match that. The 
-[encounter coding](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/studies/core/builder_encounter_coding.py)
+[encounter](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/studies/core/builder_encounter.py)
 and
-[condition codeableConcept](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/studies/core/builder_condition_codeableconcept.py)
+[condition](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/studies/core/builder_condition.py)
 builders both jump through hoops to try and get this data into flat tables for
 downstream use.
 
@@ -186,8 +189,8 @@ template function to invoke that template
 - Create a distinct table that has an ID for joining back to the original
 - Perform this join as appropriate to create a table with unnested data
 
-You may find it useful to use the `--builder [filename]` sub argument of the cli
-build command to run just your builder for iteration. The
+You may find it useful to use the `--builder [filename]` sub argument of the CLI
+`build` command to run just your builder for iteration. The
 [Sample bulk FHIR datasets](https://github.com/smart-on-fhir/sample-bulk-fhir-datasets)
 can provide an additional testbed database above and beyond whatever you produce
 in house.