-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor hospital admission to use delphi_utils create_export_csv #2032
Conversation
BREAKING CHANGE: update_indicator now outputs a dataframe instead of a dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for doing this, made a few suggestions and a few questions
@@ -289,3 +295,46 @@ def test_write_to_csv_wrong_results(self): | |||
updater.write_to_csv(res3, td.name) | |||
|
|||
td.cleanup() | |||
|
|||
def test_prefilter_results(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: nice test, thanks!
question: how big is the dataset we're comparing here? do you think it's representative and gets a lot of code coverage?
suggestion: this seems like another migration test we can remove once this PR is ready for merge.
suggestion: if we want to be especially careful, we could run this same kind of test but compare staging and prod output CSVs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A1: I need to double check, but I believe I got an actual file from a one-off run and should be about a gig, but I would need to double check. Do you think I should add another file that's more recent?
Response to S1: that's the idea
Response to S2: that seems like a good idea; I would need to poke around how staging is and see what happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A1: I need to double check, but I believe I got an actual file from a one-off run and should be about a gig, but I would need to double check. Do you think I should add another file that's more recent?
I'm not familiar with the source files for hospital admission, but the answer here really depends on whether source file is a one of many signals, one of many geos, etc. if this single drop contains every signal as a column and it's the source geo that we aggregate up, then that's good coverage. but if it's not, then doing a prod/staging comparison will get that coverage instead.
side-note: very important that we squash merge this PR, so the gig-sized file doesn't make it into the commit history.
Response to S2: that seems like a good idea; I would need to poke around how staging is and see what happens
I think it would be worthwhile, so let's do that at some point. I also think that your experience with doing prod/staging comparisons will help us streamline this process in the future and make something that does branch comparisons with the press of a button.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in staging: ran the older version and saved the output in /common/text_hosptial_admission_test_export_20240903
and after scp into local and comparing with the new version
sample script below
def test_compare_run(self):
expected_path = "../from_staging/test_export"
actual_path = "../receiving"
expected_files = sorted(glob.glob(f"{expected_path}/*.csv"))
actual_files = sorted(glob.glob(f"{actual_path}/*.csv"))
for expected, actual in zip(expected_files, actual_files):
with open(f"{expected_path}/{expected}", "rb") as expected_f, \
open(f"{actual_path}/{actual}", "rb") as actual_f:
expected_df = pd.read_csv(expected_f)
actual_df = pd.read_csv(actual_f)
pd.testing.assert_frame_equal(expected_df, actual_df)
passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how many export csvs are produced by the staging run? /common/text_hosptial_admission_test_export_20240903
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20076 or so. hospital-admission creates all geos starting from 2020-02-01 till 2024-08-31 (there's some lag)
assert np.all(group.val > 0) and np.all(group.se > 0), "p=0, std_err=0 invalid" | ||
else: | ||
group["se"] = np.NaN | ||
group.drop("incl", inplace=True, axis="columns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: is this necessary here? create_export_csv will drop it anyway.
broader question: tracing the code above, i actually don't know what columns are in output_df at this step. in the previous code, we at least knew that we were dealing with
output_dict = {
"rates": rates,
"se": std_errs,
"dates": self.output_dates,
"geo_ids": unique_geo_ids,
"geo_level": self.geo,
"include": valid_inds,
}
suggestion: i suppose that depends on what res = ClaimsHospIndicator.fit(sub_data, self.burnindate, geo_id)
outputs in update_indicator
, but i haven't tracked that down. what do you think about adding an assert to update_indicator
at the end that makes sure that output_df has the all the right columns that we expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you take a look at the comment above again, i updated it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! I actually missed sample_size if it wasn't for your comment. Hopefully this should fix the issue.
question: is this necessary here? create_export_csv will drop it anyway.
Yes, we do need the incl at least until the preprocess_output that filters out with incl column being true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! I actually missed sample_size if it wasn't for your comment. Hopefully this should fix the issue.
Glad that helped!
Yes, we do need the incl at least until the preprocess_output that filters out with incl column being true.
I meant is it necessary to even drop it in line 230, since create_export_csv will ignore it when writing the csv. But it's a minor thing, not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish there werent so many formatting changes here, it makes it hard to see the real meat of the functionality changes. :(
Is this all worth it? It looks like there is a lot of stuff happening here just to shoehorn in the usage of a fairly simple ~45 line file utility method... Does this give us an efficiency boost? Do you have timing comparisons of runs of the old vs the new code? Did you consider creating a smaller diff by just replacing calls to (or within) write_to_csv()
with calls to create_export_csv()
?
There are some logging lines that are being removed -- can we replace them with similar stuff in create_export_csv()
?
Almost all the formatting changes is auto generated with darker, so that's out of my hand
The current create_export_csv is slower than writing in a for loop, but it's also not the main bottleneck that's making this indicator slower. The difference is about 60 seconds between the for loop and the create_export_csv as is, but there are quick changes (groupby date and not filtering per date) that decreases the difference to 30 seconds cprofile_main.txt There are low hanging fruits that would negate the 30 second decrease and more. Similar to doctor visits there are multiple read and write of the csv that can be relative easily removed. I think it's easier in terms of maintainability and readability to refactor over to create_export_csv and also have some extra stuff that's already in create_export_csv like (missingness and remove nulls)
|
Honestly my bad for not documenting the darker workflow better -- make your code changes, commit them, then do |
I've been doing running the linter more frequently so some of the commit also have formatting changes, since the ci run checks for lints before the tests and I like to check that all the checks pass within my comments, but I can work on isolating the comments for functional changes and linting for future reference. But in terms of the file changed page the default shows all the changes that was made and unless you manually filter out the lint commits. At least that's how I understand how the file changed page works... Is there a workaround that doesn't involve manually filtering? |
Not that I know of, but often times looking through the individual commits is good enough for me. |
else: | ||
group["se"] = np.NaN | ||
group["sample_size"] = np.NaN | ||
df_list.append(group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rereading this bit, I'm thinking: df_list
seems unnecessary. These are all just asserts that terminate the pipeline if any of the values don't pass validation, so we should just run the asserts, but don't rebuild the df. The only difference between filtered_df
and output_df
are the group["se"] = np.NaN
and group["sample_size"] = np.NaN
transformations, but those are independent of group, so can be handled outside the for-loop. It might even make sense to handle
filtered_df = df[df["incl"]]
filtered_df = filtered_df.reset_index()
filtered_df.rename(columns={"rate": "val"}, inplace=True)
filtered_df["timestamp"] = filtered_df["timestamp"].astype(str)
and adding NAs at the end of update_indicator
, call this function validate_dataframe
, and don't have it return anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I'd just try removing the df_list and output_df though and do something like this after the for loop with all the assert statements.
if not self.write_se:
filtered_df["se"] = np.NaN
filtered_df["sample_size"] = np.NaN
filtered_df.drop(columns=["incl"], inplace = True)
assert sorted(list(filtered_df.columns)) == ["geo_id", "sample_size", "se", "timestamp", "val"]
return filtered_df
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious (a) if that works, (b) how much that speeds things up. I'd guess that this for loop is really expensive because it runs over all counties (update_indicator does too, but at least we parallelize that). There may be a way to avoid the for loop using DataFrame methods, but we can get there later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the create_export_csv is what's slowing down the runs. the preprocessing takes about half the time as the previous write_to_csv. It's the delphi utils create_export_csv that's slower than the previous versions.
Still moved the nan columns outside of the loop though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I see what you mean. Really surprised create_export_csv is that much slower, that's a bit unfortunate.
optimziation not really panning out; closing |
Changelog
kept the existing method for testing/review purposes
Associated Issue(s)