Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StudyConfig infrastructure #233

Merged
merged 6 commits into from
May 9, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 17 additions & 7 deletions cumulus_library/apis/umls.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,10 +94,22 @@ def get_vsac_valuesets(
all_responses.append(valueset[0])
return all_responses

def get_latest_umls_file_release(self, target: str):
if target not in VALID_UMLS_DOWNLOADS:
raise errors.ApiError(
f"'{target}' is not a valid umls download type.\n\n"
f"Expected values: {','.join(VALID_UMLS_DOWNLOADS)}"
)
release_payload = {"releaseType": target, "current": "true"}
return self.session.get(
"https://uts-ws.nlm.nih.gov/releases", params=release_payload
).json()[0]

def download_umls_files(
self,
target: str = "umls-metathesaurus-mrconso-file",
target: str = "umls-metathesaurus-full-subset",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the new target of the UMLS study.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did you do with Mr. Conso?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mr. Conso brought all his friends along! He's kind of an inconsiderate guest, I didn't get enough snacks.

path: pathlib.Path | None = None,
unzip: bool = True,
):
"""Downloads an available file from the UMLS Download API and unzips it
target: the UMLS resource to download (default: the MRCONSO.RRF file)
Expand All @@ -113,10 +125,7 @@ def download_umls_files(
)
if path is None:
path = pathlib.Path.cwd()
release_payload = {"releaseType": target, "current": "true"}
file_meta = self.session.get(
"https://uts-ws.nlm.nih.gov/releases", params=release_payload
).json()[0]
file_meta = self.get_latest_umls_file_release(target)

# This particular endpoint requires the API key as a param rather than a
# basic auth header ¯\_(ツ)_/¯.
Expand Down Expand Up @@ -144,5 +153,6 @@ def download_umls_files(
f"{chunks_read/1000} MB"
),
)
base_utils.unzip_file(path / file_meta["fileName"], path)
(path / file_meta["fileName"]).unlink()
if unzip:
base_utils.unzip_file(path / file_meta["fileName"], path)
(path / file_meta["fileName"]).unlink()
42 changes: 28 additions & 14 deletions cumulus_library/base_utils.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,43 @@
""" Collection of small commonly used utility functions """

import dataclasses
import datetime
import json
import os
import pathlib
import shutil
import zipfile
from contextlib import contextmanager

from rich import progress

from cumulus_library import databases

def filepath(filename: str) -> str:
dogversioning marked this conversation as resolved.
Show resolved Hide resolved
return os.path.join(os.path.dirname(__file__), filename)

@dataclasses.dataclass
class StudyConfig:
"""Class for containing study-centric parameters

The intent of this class is that if you want something passed through to the
prepare_queries section of a study, this is the place it should go. If you're
doing something above that level, consider explicit arguments instead. This should
be an interface aimed at a study author.

:param db_backend: a databaseBackend object for a specific target database
:keyword db_type: the argument passed in from the CLI indicating the requested DB
(this is easier to use in things like jinja templates than db_backend, if they
need to run DB technology aware queries)
dogversioning marked this conversation as resolved.
Show resolved Hide resolved
:keyword replace_existing: If the study downloads data from an external resource,
force it to skip any cached data when running
:keyword stats_build: If the study runs a stats builder, force regeneration of
any sampling or other stochastic techniques
:keyword umls: A UMLS API key
"""

db_backend: databases.DatabaseBackend
dogversioning marked this conversation as resolved.
Show resolved Hide resolved
db_type: str | None = None
replace_existing: bool = False
stats_build: bool = False
umls: str | None = None
dogversioning marked this conversation as resolved.
Show resolved Hide resolved


def load_text(path: str) -> str:
Expand Down Expand Up @@ -41,17 +66,6 @@ def filter_strip(commands) -> list[str]:
return list(filter(None, [c.strip() for c in commands]))


def list_coding(code_display: dict, system=None) -> list[dict]:
as_list = []
for code, display in code_display.items():
if system:
item = {"code": code, "display": display, "system": system}
else:
item = {"code": code, "display": display}
as_list.append(item)
return as_list


@contextmanager
def query_console_output(
verbose: bool, query: str, progress_bar: progress.Progress, task: progress.Task
Expand Down
85 changes: 55 additions & 30 deletions cumulus_library/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,20 +103,20 @@ def clean_and_build_study(
self,
target: pathlib.Path,
*,
stats_build: bool,
config: base_utils.StudyConfig,
continue_from: str | None = None,
) -> None:
"""Recreates study views/tables

:param target: A path to the study directory
:param stats_build: if True, forces creation of new stats tables
:param config: A StudyConfig object containing optional params
:keyword continue_from: Restart a run from a specific sql file (for dev only)
"""
studyparser = study_parser.StudyManifestParser(target, self.data_path)
try:
if not continue_from:
studyparser.run_protected_table_builder(
self.cursor, self.schema_name, verbose=self.verbose
self.cursor, self.schema_name, verbose=self.verbose, config=config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an early comment before I've read the rest of the PR - but looks like you went light-touch with this first cut. Like, you could have removed self.cursor and self.schema_name args here and had run_protected_table_builder grab them from config.db_backend, yeah? And that sort of refactor might be in play in the future?

Not a criticism, just trying to get a handle on the current PR does and what the future vision might be, as I see a lot of similar args that could be removed around the place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a future refactor to remove those is the plan, because that is a breaking change to the interface that studies use. Right now, table builders in other studies can just take the new config object and ignore it.

)
self.update_transactions(studyparser.get_study_prefix(), "started")
cleaned_tables = studyparser.clean_study(
Expand All @@ -127,25 +127,31 @@ def clean_and_build_study(
)
# If the study hasn't been created before, force stats table generation
if len(cleaned_tables) == 0:
stats_build = True
config.stats_build = True
studyparser.run_table_builder(
self.cursor,
self.schema_name,
verbose=self.verbose,
parser=self.db.parser(),
config=config,
)
else:
self.update_transactions(studyparser.get_study_prefix(), "resumed")

studyparser.build_study(self.cursor, self.verbose, continue_from)
studyparser.build_study(
self.cursor,
verbose=self.verbose,
continue_from=continue_from,
config=config,
)
studyparser.run_counts_builders(
self.cursor, self.schema_name, verbose=self.verbose
self.cursor, self.schema_name, verbose=self.verbose, config=config
)
studyparser.run_statistics_builders(
self.cursor,
self.schema_name,
verbose=self.verbose,
stats_build=stats_build,
config=config,
)
self.update_transactions(studyparser.get_study_prefix(), "finished")

Expand All @@ -158,12 +164,16 @@ def clean_and_build_study(
raise e

def run_matching_table_builder(
self, target: pathlib.Path, table_builder_name: str
self,
target: pathlib.Path,
table_builder_name: str,
config: base_utils.StudyConfig,
) -> None:
"""Runs a single table builder

:param target: A path to the study directory
:param table_builder_name: a builder file referenced in the study's manifest
:param config: A StudyConfig object containing optional params
"""
studyparser = study_parser.StudyManifestParser(target)
studyparser.run_matching_table_builder(
Expand All @@ -172,26 +182,27 @@ def run_matching_table_builder(
table_builder_name,
self.verbose,
parser=self.db.parser(),
config=config,
)

def clean_and_build_all(self, study_dict: dict, stats_build: bool) -> None:
"""Builds views for all studies.
def clean_and_build_all(
self, study_dict: dict, config: base_utils.StudyConfig
) -> None:
"""Builds tables for all studies.

NOTE: By design, this method will always exclude the `template` study dir,
since 99% of the time you don't need a live copy in the database.

:param study_dict: A dict of paths
:param stats_build: if True, regen stats tables
:param config: A StudyConfig object containing optional params
"""
study_dict = dict(study_dict)
study_dict.pop("template")
for precursor_study in ["vocab", "core"]:
self.clean_and_build_study(
study_dict[precursor_study], stats_build=stats_build
)
self.clean_and_build_study(study_dict[precursor_study], config=config)
study_dict.pop(precursor_study)
for key in study_dict:
self.clean_and_build_study(study_dict[key], stats_build=stats_build)
self.clean_and_build_study(study_dict[key], config=config)

### Data exporters
def export_study(
Expand All @@ -212,11 +223,16 @@ def export_all(self, study_dict: dict, data_path: pathlib.Path, archive: bool):
self.export_study(study_dict[key], data_path, archive)

def generate_study_sql(
self, target: pathlib.Path, builder: str | None = None
self,
target: pathlib.Path,
*,
config: base_utils.StudyConfig,
builder: str | None = None,
) -> None:
"""Materializes study sql from templates

:param target: A path to the study directory
:param config: A StudyConfig object containing optional params
:param builder: Specify a single builder to generate sql from
"""
studyparser = study_parser.StudyManifestParser(target)
Expand All @@ -226,6 +242,7 @@ def generate_study_sql(
builder=builder,
verbose=self.verbose,
parser=self.db.parser(),
config=config,
)

def generate_study_markdown(
Expand Down Expand Up @@ -327,9 +344,15 @@ def run_cli(args: dict):

# all other actions require connecting to the database
else:
db_backend = databases.create_db_backend(args)
config = base_utils.StudyConfig(
db_type=args.get("db_type"),
db_backend=databases.create_db_backend(args),
replace_existing=args.get("replace_existing", False),
stats_build=args.get("stats_build", False),
umls=args.get("umls"),
)
try:
runner = StudyRunner(db_backend, data_path=args.get("data_path"))
runner = StudyRunner(config.db_backend, data_path=args.get("data_path"))
if args.get("verbose"):
runner.verbose = True
console.print("[italic] Connecting to database...")
Expand All @@ -354,18 +377,17 @@ def run_cli(args: dict):
)
elif args["action"] == "build":
if "all" in args["target"]:
runner.clean_and_build_all(study_dict, args["stats_build"])
runner.clean_and_build_all(study_dict, config=config)
else:
for target in args["target"]:
if args["builder"]:
runner.run_matching_table_builder(
study_dict[target], args["builder"]
study_dict[target], config=config
)
else:
runner.clean_and_build_study(
study_dict[target],
stats_build=args["stats_build"],
continue_from=args["continue_from"],
config=config,
)

elif args["action"] == "export":
Expand Down Expand Up @@ -394,13 +416,15 @@ def run_cli(args: dict):

elif args["action"] == "generate-sql":
for target in args["target"]:
runner.generate_study_sql(study_dict[target], args["builder"])
runner.generate_study_sql(
study_dict[target], builder=args["builder"], config=config
)

elif args["action"] == "generate-md":
for target in args["target"]:
runner.generate_study_markdown(study_dict[target])
finally:
db_backend.close()
config.db_backend.close()


def main(cli_args=None):
Expand All @@ -421,17 +445,18 @@ def main(cli_args=None):
break

arg_env_pairs = (
("data_path", "CUMULUS_LIBRARY_DATA_PATH"),
("db_type", "CUMULUS_LIBRARY_DB_TYPE"),
("id", "CUMULUS_AGGREGATOR_ID"),
("load_ndjson_dir", "CUMULUS_LIBRARY_LOAD_NDJSON_DIR"),
("profile", "CUMULUS_LIBRARY_PROFILE"),
("schema_name", "CUMULUS_LIBRARY_DATABASE"),
("workgroup", "CUMULUS_LIBRARY_WORKGROUP"),
("region", "CUMULUS_LIBRARY_REGION"),
("schema_name", "CUMULUS_LIBRARY_DATABASE"),
("study_dir", "CUMULUS_LIBRARY_STUDY_DIR"),
("data_path", "CUMULUS_LIBRARY_DATA_PATH"),
("load_ndjson_dir", "CUMULUS_LIBRARY_LOAD_NDJSON_DIR"),
("user", "CUMULUS_AGGREGATOR_USER"),
("id", "CUMULUS_AGGREGATOR_ID"),
("umls", "UMLS_API_KEY"),
("url", "CUMULUS_AGGREGATOR_URL"),
("user", "CUMULUS_AGGREGATOR_USER"),
("workgroup", "CUMULUS_LIBRARY_WORKGROUP"),
dogversioning marked this conversation as resolved.
Show resolved Hide resolved
)
read_env_vars = []
for pair in arg_env_pairs:
Expand Down
9 changes: 9 additions & 0 deletions cumulus_library/cli_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,15 @@ def get_parser() -> argparse.ArgumentParser:
),
dest="stats_build",
)
build.add_argument(
"--umls",
dogversioning marked this conversation as resolved.
Show resolved Hide resolved
help="An API Key for the UMLS API",
)
build.add_argument(
"--replace-existing",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming is also lightly tricky - as a user not steeped in the file uploading feature, I might reasonably think this arg does what clean does - i.e. it refers to whether or not to delete existing tables. I know that's not what you had in mind when naming this arg, but if you could somehow scope this arg down to just file uploads? Maybe something more elegant than --replace-existing-uploads?

Or perversely, maybe going less verbose but in a way that doesn't create confusion with clean - like --force/-f, which is often used for this sort of thing when dealing with file-oriented commands like cp or rm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Though --force may get overloaded once we have two kinds of things we want to force through, so it's not necessarily a good idea, just an idea.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so weird because it's only affecting a slice of the total state. lemme sleep on it, i'll try to get something a bit more nuanced, but it may just be --replace-existing-uploads - :hopefully: this is developer only, though there are edge cases for users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--force-uploads splits the difference

action="store_true",
help="Forces file downloads/uploads to occur, even if they already exist",
)
build.add_argument(
"--continue",
dest="continue_from",
Expand Down
Loading