Skip to content

Commit

Permalink
docs: some wordsmithing and clarification
Browse files Browse the repository at this point in the history
  • Loading branch information
mikix committed Feb 27, 2024
1 parent 78045a3 commit 62d16f4
Show file tree
Hide file tree
Showing 10 changed files with 171 additions and 152 deletions.
2 changes: 1 addition & 1 deletion cumulus_library/cli_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ def add_table_builder_argument(parser: argparse.ArgumentParser) -> None:


def add_study_dir_argument(parser: argparse.ArgumentParser) -> None:
"""Adds --study_dir arg to a subparser"""
"""Adds --study-dir arg to a subparser"""
parser.add_argument(
"-s",
"--study-dir",
Expand Down
5 changes: 1 addition & 4 deletions cumulus_library/study_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,10 +248,7 @@ def clean_study(
# study builder, and remove them from the list.
for view_table in view_table_list.copy():
if any(
(
(f"_{word.value}_") in view_table[0]
or view_table[0].endswith(word.value)
)
f"__{word.value}_" in view_table[0]
for word in enums.ProtectedTableKeywords
):
view_table_list.remove(view_table)
Expand Down
8 changes: 6 additions & 2 deletions docs/aws-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Cumulus library executes queries against an
for creating such a datastore is available for testing purposes if you don't
already have one.

The cloudformation template in the sample database's Cloudformation template should
The sample database's CloudFormation template should
have the appropriate permissions set for all the services. If you need to configure
an IAM policy manually, you will need to ensure the AWS profile you are using has
the following permissions:
Expand Down Expand Up @@ -52,4 +52,8 @@ to specify where your database information lives:
- `CUMULUS_LIBRARY_DATABASE` : The name of the database Athena will use (`cumulus_library_sample_db` if using the sample DB)
- `CUMULUS_LIBRARY_WORKGROUP` : the Athena workgroup to execute queries in (`cumulus_library_sample_db` if using the sample DB)

Configuring environment variables on your system is out of scope of this document, but several guides are available elsewhere. [This guide](https://www.twilio.com/blog/2017/01/how-to-set-environment-variables.html), for example, covers Mac, Windows, and Linux. And, as a plus, it has a picture of an adorable puppy at the top of it.
Configuring environment variables on your system is out of scope of this document,
but several guides are available elsewhere.
[This guide](https://www.twilio.com/blog/how-to-set-environment-variables-html),
for example, covers Mac, Windows, and Linux.
And, as a plus, it has a picture of an adorable puppy at the top of it.
86 changes: 46 additions & 40 deletions docs/core-study-details.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,58 +42,64 @@ Examples:
## Core study exportable counts tables

### count_core_condition_icd10_month
| Variable | Description |
| -------- | -------- |
| cnt | Count |
| cond_month | Month condition recorded |
| cond_code_display | Condition code |
| enc_class_code | Encounter Code (Healthcare Setting) |

| Variable | Description |
|:------------------|:------------------------------------|
| cnt | Count |
| cond_month | Month condition recorded |
| cond_code_display | Condition code |
| enc_class_code | Encounter Code (Healthcare Setting) |


### count_core_documentreference_month
| Variable | Description |
| -------- | -------- |
| cnt | Count |
| author_month | Month document was authored |
| enc_class_code | Encounter Code (Healthcare Setting) |
| doc_type_display | Type of Document (display) |

| Variable | Description |
|:-----------------|:------------------------------------|
| cnt | Count |
| author_month | Month document was authored |
| enc_class_code | Encounter Code (Healthcare Setting) |
| doc_type_display | Type of Document (display) |


### count_core_encounter_day
| Variable | Description |
| -------- | -------- |
| cnt | Count |
| enc_class_code | Encounter Code (Healthcare Setting) |
| start_date | Day patient encounter started |

| Variable | Description |
|:---------------|:------------------------------------|
| cnt | Count |
| enc_class_code | Encounter Code (Healthcare Setting) |
| start_date | Day patient encounter started |


### count_core_encounter_month
| Variable | Description |
| -------- | -------- |
| cnt | Count |
| enc_class_code | Encounter Code (Healthcare Setting) |
| start_month | Month patient encounter started |
| age_at_visit | Patient Age at Encounter |
| gender | Biological sex at birth |
| race_display | Patient reported race |
| postalcode3 | Patient 3 digit zip |

| Variable | Description |
|:---------------|:------------------------------------|
| cnt | Count |
| enc_class_code | Encounter Code (Healthcare Setting) |
| start_month | Month patient encounter started |
| age_at_visit | Patient Age at Encounter |
| gender | Biological sex at birth |
| race_display | Patient reported race |
| postalcode3 | Patient 3 digit zip |


### count_core_observation_lab_month
| Variable | Description |
| -------- | -------- |
| cnt | Count |
| lab_month | Month of lab result |
| lab_code | Laboratory Code |
| lab_result_display | Laboratory result |
| enc_class_code | Encounter Code (Healthcare Setting) |

| Variable | Description |
|:-------------------|:------------------------------------|
| cnt | Count |
| lab_month | Month of lab result |
| lab_code | Laboratory Code |
| lab_result_display | Laboratory result |
| enc_class_code | Encounter Code (Healthcare Setting) |


### count_core_patient
| Variable | Description |
| -------- | -------- |
| cnt | Count |
| gender | Biological sex at birth |
| age | Age in years calculated since DOB |
| race_display | Patient reported race |
| postalcode3 | Patient 3 digit zip |

| Variable | Description |
|:-------------|:----------------------------------|
| cnt | Count |
| gender | Biological sex at birth |
| age | Age in years calculated since DOB |
| race_display | Patient reported race |
| postalcode3 | Patient 3 digit zip |
67 changes: 35 additions & 32 deletions docs/creating-sql-with-python.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,28 +19,30 @@ sections.

## Why would I even need to think about this?

There are three main reasons why you would need to use python to generate sql:
There are three main reasons why you would need to use Python to generate SQL:
- You would like to make use of the
[helper class we've built](#generating-counts-tables)
for ease of creating count tables in a structured manual.
for ease of creating count tables in a structured manner.
- You have a dataset you'd like to
[load into a table from a static file](#adding-a-static-dataset),
separate from the ETL tables.
- The gnarly one: you are working against the raw FHIR resource tables, and are
trying to access
[nested data](#querying-nested-data) in Athena.
- We infer datatypes in the ETL based on the presence of data once we get past
the top level elements, and so the structure may vary depending on the
implementation, either at the EHR level or at the FHIR interface level.
- This is gnarly because while the ETL provides a full SQL schema for your own data,
it does not guarantee a schema for data that you don't have at your site.
And if you want your study to run at multiple sites with different EHRs,
you need to be careful when accessing deep FHIR fields.
For example, your EHR might populate `Condition.evidence.code` and you can safely
write SQL that uses it. But a different site's EHR may not provide that field at all,
and thus that column may not be defined in the SQL table schema at that other site.


We've got examples of all three of these cases in this repo, and we'll reference
those as examples as we go.
You'll see examples of all three cases in this guide.

## Utilities

There are two main bits of infrastructure we use for programmatic tables:
The TableBuilder class, and the collection of template SQL.
The `TableBuilder` class, and the collection of template SQL.

### Working with TableBuilders

Expand All @@ -63,7 +65,7 @@ queries are being executed.

You can either extend this class directly (like `builder_*.py` files in
`cumulus_library/studies/core`) or create a specific class to add reusable functions
for a repeated use case (like in `cumulus_library/schema/counts.py`).
for a repeated use case (like in `cumulus_library/statistics/counts.py`).

TableBuilder SQL generally should go through a template SQL generator, so that
your SQL has been validated. If you're just working on counts, you don't need
Expand All @@ -77,27 +79,27 @@ we've got enough wrappers that you shouldn't need to worry about this
level of detail.

For validating SQL, we are using
[Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/)
[Jinja templates](https://jinja.palletsprojects.com/)
to create validated SQL in a repeatable manner. We don't expect you to write these
templates - instead, using the
[template function library](../cumulus_library/template_sql/templates.py)
you can provide a series of arguments to these templates that will allow you to
[template function library](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/template_sql/base_templates.py)
you can provide arguments to these templates that will allow you to
generate standard types of SQL tables, as well as using templates targeted for
bespoke operations.

When you're thinking about a query that you'd need to create, first check the
template function library to see if something already exists. Basic CRUD
should be covered, as well as unnestings for some common FHIR objects.
template function library to see if something already exists. Basic creation and inspection
queries should be covered, as well as unnestings for some common FHIR objects.

## Use cases

### Generating counts tables
A thing we do over and over as part of studies is generate powerset counts tables
against a filtered resource to get data about a certain kind of clinical population.
Since this is so common we created a class just for this, and we're using it in all
Since this is so common, we created a class just for this, and we're using it in all
studies the Cumulus team is directly authoring.

The [CountsBuilder class](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/schema/counts.py)
The [CountsBuilder class](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/statistics/counts.py)
provides a number of convenience methods that are available for use (this covers
mechanics of generation). You can see examples of usage in the
[Core counts builder](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/studies/core/count_core.py)
Expand All @@ -119,28 +121,24 @@ As a convenience, if you include a `if __name__ == "__main__":` clause like you
see in `count_core.py`, you can invoke the builder's output by invoking it with
python, which is a nice way to get example SQL output for inclusion in github.
This is where the
[count core sql output](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/studies/core/count_core.sql)
[count core sql output](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/studies/core/reference_sql/count_core.sql)
originated from.

Add your count generator file to the `counts_builder_config` section of your
`manifest.toml` to include it in your build invocations.

### Adding a static dataset

*NOTE* - we have an
[open issue](https://github.com/smart-on-fhir/cumulus-library/issues/58)
to develop a faster methodology for adding new datasets.

Occasionally you will have a dataset from a third party that is useful for working
with your dataset. In the vocab study (requiring a license to use), we
[add coding system data](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/studies/vocab/vocab_icd_builder.py)
from flat files to athena. If you need to do this, you should extend the base
TableBuilder class, and your `prepare_queries` function should do the following,
from flat files to Athena. If you need to do this, you should extend the base
`TableBuilder` class, and your `prepare_queries` function should do the following,
leveraging the
[template function library](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/template_sql/templates.py):
- Use the `get_ctas_query` function to get a CREATE TABLE AS statement to
instantiate your table in athena
- Since athena SQL queries are limited in size to 262144 bytes, if you have
[template function library](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/template_sql/base_templates.py):
- Use the `get_ctas_query` function to get a `CREATE TABLE AS` statement to
instantiate your table in Athena
- Since Athena SQL queries are limited in size to 262144 bytes, if you have
a large dataset, break it up into smaller chunks
- Use the `get_insert_into` function to add the data from each table to
the chunk you just created.
Expand All @@ -149,6 +147,11 @@ Add the dataset uploader to the `table_builder_config` section of your
`manifest.toml` to include it in your build - this will make this data
available for downstream queries

{: .note }
We have an
[open issue](https://github.com/smart-on-fhir/cumulus-library/issues/58)
to develop an easier methodology for adding new datasets.

### Querying nested data

Are you trying to access data from deep within raw FHIR tables? I'm so sorry.
Expand All @@ -164,9 +167,9 @@ This means you may have differing schemas in Athena from one site's data to anot
may differ). In order to handle this, you need to create a standard output
representation that accounts for all the different permutations you have, and
conform data to match that. The
[encounter coding](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/studies/core/builder_encounter_coding.py)
[encounter](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/studies/core/builder_encounter.py)
and
[condition codeableConcept](https://github.com/smart-on-fhir/cumulus-library/blob/main//cumulus_library/studies/core/builder_condition_codeableconcept.py)
[condition](https://github.com/smart-on-fhir/cumulus-library/blob/main/cumulus_library/studies/core/builder_condition.py)
builders both jump through hoops to try and get this data into flat tables for
downstream use.

Expand All @@ -186,8 +189,8 @@ template function to invoke that template
- Create a distinct table that has an ID for joining back to the original
- Perform this join as appropriate to create a table with unnested data

You may find it useful to use the `--builder [filename]` sub argument of the cli
build command to run just your builder for iteration. The
You may find it useful to use the `--builder [filename]` sub argument of the CLI
`build` command to run just your builder for iteration. The
[Sample bulk FHIR datasets](https://github.com/smart-on-fhir/sample-bulk-fhir-datasets)
can provide an additional testbed database above and beyond whatever you produce
in house.
Expand Down
Loading

0 comments on commit 62d16f4

Please sign in to comment.