-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rvdss indicator/data source #1542
base: dev
Are you sure you want to change the base?
Conversation
We also want to have some unit tests here. This can be complicated for functions that normally interact with external components (websites, databases, etc), but we want to at least test functions that don't require that. (External components can be faked with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed some things, moved some variables to constants.py
, and added some comments. I will be looking at this more, but wanted to get you started with some feedback.
Overall, I'm having trouble understanding the flow of the code. I think this could be improved by adding more docstrings and comments in general, mid-level functions, and clear function naming.
Docstrings: the shortest form of this is one line describing what the function does, the core goal. We don't necessarily need to describe all the args, like
Keyword arguments:
real -- the real part (default 0.0)
imag -- the imaginary part (default 0.0)
although it can be helpful for more complex and user-facing fns.
Comments: Comments give context to the code. They can say why we're doing this. They can describe broadly what we're doing in a chunk of code (not details of implementation). They can discuss cons or gotchas of the current approach. They can present alternatives to the current approach (generally when the alternatives were considered but were too complex for the current need).
Take this fn, for example. Why are we processing table captions? What is a table caption? What do the "table_identifiers" represent? What is the sum([all...
chunk doing (broadly)?
You don't need to answer all these questions in the code, but it's good to insert comments where you think future readers will be confused.
Mid-level functions (optional but something to think about): The helper functions you've defined so far are mostly low level (get_report_date
), and the higher level functions in this package are pretty long. It makes me think that that they can/should be broken down into more functions that call several of the low-level fns at once.
Function naming (recommended but optional): Try to make fn names very clearly say what they are doing. For example, get_report_date
sounds to me like it is fetching the report date from the website ("get") . However, it's actually interpreting a given start_year-week combo. A better name might be "parse_report_date". Actually, I'm confused by the term "report date", because that makes me think of the date the entire report was posted, but I think this is handling time values (weeks within a given report). So maybe even "parse_time_value"? Or (this is too long, but brainstorming here...) "wrap_early_weeknums_to_next_year".
src/acquisition/rvdss/constants.py
Outdated
# Construct dashboard and data report URLS. | ||
DASHBOARD_BASE_URL_2023 = "https://health-infobase.canada.ca/src/data/respiratory-virus-detections/archive/{date}/" | ||
DASHBOARD_BASE_URLS_2023 = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Please describe a bit more about what these URLs are and if we will need to update them (add new dates in) ever.
issue: also, are these actually for 2023? All the dates are in 2024
return(geo_type) | ||
|
||
def check_date_format(date_string): | ||
if not re.search("[0-9]{4}-[0-9]{2}-[0-9]{2}",date_string): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (optional): Since datetime.strptime
errors if the desired format doesn't match the actual format of the date, we could do this block as a try-catch instead.
suggestion (optional): or we could look for a function that auto-detects common date formats, like R's as.Date
, so we don't need to manually do all this checking. dateutil
might be a good place to start.
|
||
def check_duplicate_rows(table): | ||
if table['week'].duplicated().any(): | ||
table.columns = [re.sub("canada","can",t) for t in table.columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Is this a permanent replacement of the "canada" name? I thought we were doing this elsewhere. If not, we should be doing it elsewhere so the logic flows better (someone reading the code won't think that this kind of geo replacement is happening in this particular function).
for name, group in grouped: | ||
duplicates_drop.append(group['can tests'].idxmin()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: please add a comment here adding more detail about why we're doing this (what we've seen in existing data that necessitates this) and what our specific deduplication approach is.
pat8= r"^ah1n1pdm09" | ||
combined_pat3 = '|'.join((pat5, pat6,pat7,pat8)) | ||
|
||
table.columns=[re.sub(combined_pat, "positive_tests",col) for col in table.columns] #making naming consistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought (optional): we do a LOT of changing names here. I think this chunk would be better as a separate function that is called here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, it looks like this function mostly does renaming, so maybe we should change the name of the function to better reflect that, but leave the contents as-is.
Ditto for the other create....table
functions
# Rename columns | ||
table.columns = [re.sub("\xa0"," ", col) for col in table.columns] # \xa0 to space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: hmm, a lot of renaming happens here too. If this happens for all tables, maybe factor this out to a new rename...
fn, and call it within each create...table
fn.
More generally, this fn is pretty long. It would be more readable broken up into more separate functions.
Plan to do code walkthrough next week! |
… into seperate function
Quality Gate passedIssues Measures |
…asily fetched (#1551) * add basic sql tables -- needs update with real col names * rename files * add main fn with CLI; remove date range params in package frontend fn stubs * start filling out historical fn stubs * rest of new fn layout. adds CLI * dashboard results can be stored directly in list in fetch_historical_dashboard_data * Add in archived dashboards, and calculate start year from data * address todos and fix historical fetching * Change misspelled CB to BC * Update imports --------- Co-authored-by: cchuong <[email protected]>
Quality Gate passedIssues Measures |
Summary:
Code to create a new datasource for the epidata AP. The data is respiratory virus detections in Canada, reported by the Public Health Agency of Canada (PHAC).
Prerequisites:
dev
branchdev