Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rvdss indicator/data source #1542

Draft
wants to merge 33 commits into
base: dev
Choose a base branch
from
Draft

Add rvdss indicator/data source #1542

wants to merge 33 commits into from

Conversation

cchuong
Copy link
Collaborator

@cchuong cchuong commented Sep 13, 2024

Summary:

Code to create a new datasource for the epidata AP. The data is respiratory virus detections in Canada, reported by the Public Health Agency of Canada (PHAC).

Prerequisites:

  • Unless it is a documentation hotfix it should be merged against the dev branch
  • Branch is up-to-date with the branch to be merged with, i.e. dev
  • Build is successful
  • Code is cleaned up and formatted

@nmdefries nmdefries self-requested a review September 13, 2024 19:44
@nmdefries
Copy link
Contributor

We also want to have some unit tests here. This can be complicated for functions that normally interact with external components (websites, databases, etc), but we want to at least test functions that don't require that. (External components can be faked with mocking packages, but it's not essential.)

Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed some things, moved some variables to constants.py, and added some comments. I will be looking at this more, but wanted to get you started with some feedback.

Overall, I'm having trouble understanding the flow of the code. I think this could be improved by adding more docstrings and comments in general, mid-level functions, and clear function naming.

Docstrings: the shortest form of this is one line describing what the function does, the core goal. We don't necessarily need to describe all the args, like

    Keyword arguments:
    real -- the real part (default 0.0)
    imag -- the imaginary part (default 0.0)

although it can be helpful for more complex and user-facing fns.

Comments: Comments give context to the code. They can say why we're doing this. They can describe broadly what we're doing in a chunk of code (not details of implementation). They can discuss cons or gotchas of the current approach. They can present alternatives to the current approach (generally when the alternatives were considered but were too complex for the current need).

Take this fn, for example. Why are we processing table captions? What is a table caption? What do the "table_identifiers" represent? What is the sum([all... chunk doing (broadly)?

You don't need to answer all these questions in the code, but it's good to insert comments where you think future readers will be confused.

Mid-level functions (optional but something to think about): The helper functions you've defined so far are mostly low level (get_report_date), and the higher level functions in this package are pretty long. It makes me think that that they can/should be broken down into more functions that call several of the low-level fns at once.

Function naming (recommended but optional): Try to make fn names very clearly say what they are doing. For example, get_report_date sounds to me like it is fetching the report date from the website ("get") . However, it's actually interpreting a given start_year-week combo. A better name might be "parse_report_date". Actually, I'm confused by the term "report date", because that makes me think of the date the entire report was posted, but I think this is handling time values (weeks within a given report). So maybe even "parse_time_value"? Or (this is too long, but brainstorming here...) "wrap_early_weeknums_to_next_year".

Comment on lines 53 to 55
# Construct dashboard and data report URLS.
DASHBOARD_BASE_URL_2023 = "https://health-infobase.canada.ca/src/data/respiratory-virus-detections/archive/{date}/"
DASHBOARD_BASE_URLS_2023 = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: Please describe a bit more about what these URLs are and if we will need to update them (add new dates in) ever.

issue: also, are these actually for 2023? All the dates are in 2024

return(geo_type)

def check_date_format(date_string):
if not re.search("[0-9]{4}-[0-9]{2}-[0-9]{2}",date_string):
Copy link
Contributor

@nmdefries nmdefries Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (optional): Since datetime.strptime errors if the desired format doesn't match the actual format of the date, we could do this block as a try-catch instead.

suggestion (optional): or we could look for a function that auto-detects common date formats, like R's as.Date, so we don't need to manually do all this checking. dateutil might be a good place to start.


def check_duplicate_rows(table):
if table['week'].duplicated().any():
table.columns = [re.sub("canada","can",t) for t in table.columns]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Is this a permanent replacement of the "canada" name? I thought we were doing this elsewhere. If not, we should be doing it elsewhere so the logic flows better (someone reading the code won't think that this kind of geo replacement is happening in this particular function).

Comment on lines 136 to 137
for name, group in grouped:
duplicates_drop.append(group['can tests'].idxmin())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: please add a comment here adding more detail about why we're doing this (what we've seen in existing data that necessitates this) and what our specific deduplication approach is.

pat8= r"^ah1n1pdm09"
combined_pat3 = '|'.join((pat5, pat6,pat7,pat8))

table.columns=[re.sub(combined_pat, "positive_tests",col) for col in table.columns] #making naming consistent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (optional): we do a LOT of changing names here. I think this chunk would be better as a separate function that is called here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, it looks like this function mostly does renaming, so maybe we should change the name of the function to better reflect that, but leave the contents as-is.

Ditto for the other create....table functions

Comment on lines 369 to 370
# Rename columns
table.columns = [re.sub("\xa0"," ", col) for col in table.columns] # \xa0 to space
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: hmm, a lot of renaming happens here too. If this happens for all tables, maybe factor this out to a new rename... fn, and call it within each create...table fn.

More generally, this fn is pretty long. It would be more readable broken up into more separate functions.

@nmdefries
Copy link
Contributor

Plan to do code walkthrough next week!

Copy link

…asily fetched (#1551)

* add basic sql tables -- needs update with real col names

* rename files

* add main fn with CLI; remove date range params in package frontend fn stubs

* start filling out historical fn stubs

* rest of new fn layout. adds CLI

* dashboard results can be stored directly in list in fetch_historical_dashboard_data

* Add in archived dashboards, and calculate start year from data

* address todos and fix historical fetching

* Change misspelled CB to BC

* Update imports

---------

Co-authored-by: cchuong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants