refactor hospital admission to use delphi_utils create_export_csv #2032

aysim319 · 2024-08-26T17:34:31Z

Changelog

kept the existing method for testing/review purposes

update_indicator function returns dataframe
new function filter output to preprocess before feeding into create_export_csv
run.py now uses create_export_csv

Associated Issue(s)

Partially addresses Refactor claims_hosp and doctor_visits to use create_export_csv #1268

BREAKING CHANGE: update_indicator now outputs a dataframe instead of a dictionary

dshemetov

thanks for doing this, made a few suggestions and a few questions

claims_hosp/delphi_claims_hosp/run.py

claims_hosp/delphi_claims_hosp/update_indicator.py

claims_hosp/tests/test_update_indicator.py

dshemetov · 2024-08-28T17:10:48Z

claims_hosp/tests/test_update_indicator.py

@@ -289,3 +295,46 @@ def test_write_to_csv_wrong_results(self):
            updater.write_to_csv(res3, td.name)

        td.cleanup()
+
+    def test_prefilter_results(self):


praise: nice test, thanks!

question: how big is the dataset we're comparing here? do you think it's representative and gets a lot of code coverage?

suggestion: this seems like another migration test we can remove once this PR is ready for merge.

suggestion: if we want to be especially careful, we could run this same kind of test but compare staging and prod output CSVs.

A1: I need to double check, but I believe I got an actual file from a one-off run and should be about a gig, but I would need to double check. Do you think I should add another file that's more recent?

Response to S1: that's the idea

Response to S2: that seems like a good idea; I would need to poke around how staging is and see what happens

A1: I need to double check, but I believe I got an actual file from a one-off run and should be about a gig, but I would need to double check. Do you think I should add another file that's more recent?

I'm not familiar with the source files for hospital admission, but the answer here really depends on whether source file is a one of many signals, one of many geos, etc. if this single drop contains every signal as a column and it's the source geo that we aggregate up, then that's good coverage. but if it's not, then doing a prod/staging comparison will get that coverage instead.

side-note: very important that we squash merge this PR, so the gig-sized file doesn't make it into the commit history.

Response to S2: that seems like a good idea; I would need to poke around how staging is and see what happens

I think it would be worthwhile, so let's do that at some point. I also think that your experience with doing prod/staging comparisons will help us streamline this process in the future and make something that does branch comparisons with the press of a button.

in staging: ran the older version and saved the output in /common/text_hosptial_admission_test_export_20240903

and after scp into local and comparing with the new version
sample script below

def test_compare_run(self): expected_path = "../from_staging/test_export" actual_path = "../receiving" expected_files = sorted(glob.glob(f"{expected_path}/*.csv")) actual_files = sorted(glob.glob(f"{actual_path}/*.csv")) for expected, actual in zip(expected_files, actual_files): with open(f"{expected_path}/{expected}", "rb") as expected_f, \ open(f"{actual_path}/{actual}", "rb") as actual_f: expected_df = pd.read_csv(expected_f) actual_df = pd.read_csv(actual_f) pd.testing.assert_frame_equal(expected_df, actual_df)

passed.

how many export csvs are produced by the staging run? /common/text_hosptial_admission_test_export_20240903?

20076 or so. hospital-admission creates all geos starting from 2020-02-01 till 2024-08-31 (there's some lag)

claims_hosp/delphi_claims_hosp/update_indicator.py

_delphi_utils_python/delphi_utils/export.py

claims_hosp/tests/test_update_indicator.py

_delphi_utils_python/delphi_utils/export.py

claims_hosp/delphi_claims_hosp/update_indicator.py

dshemetov · 2024-09-06T18:27:36Z

claims_hosp/delphi_claims_hosp/update_indicator.py

+                assert np.all(group.val > 0) and np.all(group.se > 0), "p=0, std_err=0 invalid"
+            else:
+                group["se"] = np.NaN
+            group.drop("incl", inplace=True, axis="columns")


question: is this necessary here? create_export_csv will drop it anyway.

broader question: tracing the code above, i actually don't know what columns are in output_df at this step. in the previous code, we at least knew that we were dealing with

output_dict = { "rates": rates, "se": std_errs, "dates": self.output_dates, "geo_ids": unique_geo_ids, "geo_level": self.geo, "include": valid_inds, }

suggestion: i suppose that depends on what res = ClaimsHospIndicator.fit(sub_data, self.burnindate, geo_id) outputs in update_indicator, but i haven't tracked that down. what do you think about adding an assert to update_indicator at the end that makes sure that output_df has the all the right columns that we expect?

could you take a look at the comment above again, i updated it

That's a good point! I actually missed sample_size if it wasn't for your comment. Hopefully this should fix the issue.

question: is this necessary here? create_export_csv will drop it anyway.

Yes, we do need the incl at least until the preprocess_output that filters out with incl column being true.

That's a good point! I actually missed sample_size if it wasn't for your comment. Hopefully this should fix the issue.

Glad that helped!

Yes, we do need the incl at least until the preprocess_output that filters out with incl column being true.

I meant is it necessary to even drop it in line 230, since create_export_csv will ignore it when writing the csv. But it's a minor thing, not a big deal.

claims_hosp/delphi_claims_hosp/update_indicator.py

melange396

I wish there werent so many formatting changes here, it makes it hard to see the real meat of the functionality changes. :(

Is this all worth it? It looks like there is a lot of stuff happening here just to shoehorn in the usage of a fairly simple ~45 line file utility method... Does this give us an efficiency boost? Do you have timing comparisons of runs of the old vs the new code? Did you consider creating a smaller diff by just replacing calls to (or within) write_to_csv() with calls to create_export_csv()?

There are some logging lines that are being removed -- can we replace them with similar stuff in create_export_csv()?

_delphi_utils_python/delphi_utils/export.py

aysim319 · 2024-09-09T15:08:08Z

I wish there werent so many formatting changes here, it makes it hard to see the real meat of the functionality changes. :(

Almost all the formatting changes is auto generated with darker, so that's out of my hand

Is this all worth it? It looks like there is a lot of stuff happening here just to shoehorn in the usage of a fairly simple ~45 line file utility method... Does this give us an efficiency boost? Do you have timing comparisons of runs of the old vs the new code? Did you consider creating a smaller diff by just replacing calls to (or within) write_to_csv() with calls to create_export_csv()?

The current create_export_csv is slower than writing in a for loop, but it's also not the main bottleneck that's making this indicator slower. The difference is about 60 seconds between the for loop and the create_export_csv as is, but there are quick changes (groupby date and not filtering per date) that decreases the difference to 30 seconds

cprofile_main.txt
cprofile_create_export_csv.txt
cprofile_create_export_csv_updated.txt

There are low hanging fruits that would negate the 30 second decrease and more. Similar to doctor visits there are multiple read and write of the csv that can be relative easily removed. I think it's easier in terms of maintainability and readability to refactor over to create_export_csv and also have some extra stuff that's already in create_export_csv like (missingness and remove nulls)

There are some logging lines that are being removed -- can we replace them with similar stuff in create_export_csv()?
There's one that's the warning for se that I just added and added the number of rows in the export

dshemetov · 2024-09-09T17:22:10Z

Honestly my bad for not documenting the darker workflow better -- make your code changes, commit them, then do make format in the indicator you're changing, let it make its formatting changes, and then commit those separately. I didn't think this would be necessary, since its whole premise is that it only formats the lines you touch, but it turned out to be more aggressive about resorting our imports than I thought it would be.

aysim319 · 2024-09-09T17:52:28Z

Honestly my bad for not documenting the darker workflow better -- make your code changes, commit them, then do make format in the indicator you're changing, let it make its formatting changes, and then commit those separately. I didn't think this would be necessary, since its whole premise is that it only formats the lines you touch, but it turned out to be more aggressive about resorting our imports than I thought it would be.

I've been doing running the linter more frequently so some of the commit also have formatting changes, since the ci run checks for lints before the tests and I like to check that all the checks pass within my comments, but I can work on isolating the comments for functional changes and linting for future reference.

But in terms of the file changed page the default shows all the changes that was made and unless you manually filter out the lint commits. At least that's how I understand how the file changed page works... Is there a workaround that doesn't involve manually filtering?

dshemetov · 2024-09-09T17:59:00Z

Is there a workaround that doesn't involve manually filtering?

Not that I know of, but often times looking through the individual commits is good enough for me.

dshemetov · 2024-09-17T18:18:51Z

claims_hosp/delphi_claims_hosp/update_indicator.py

+            else:
+                group["se"] = np.NaN
+            group["sample_size"] = np.NaN
+            df_list.append(group)


Rereading this bit, I'm thinking: df_list seems unnecessary. These are all just asserts that terminate the pipeline if any of the values don't pass validation, so we should just run the asserts, but don't rebuild the df. The only difference between filtered_df and output_df are the group["se"] = np.NaN and group["sample_size"] = np.NaN transformations, but those are independent of group, so can be handled outside the for-loop. It might even make sense to handle

filtered_df = df[df["incl"]] filtered_df = filtered_df.reset_index() filtered_df.rename(columns={"rate": "val"}, inplace=True) filtered_df["timestamp"] = filtered_df["timestamp"].astype(str)

and adding NAs at the end of update_indicator, call this function validate_dataframe, and don't have it return anything.

Initially, I'd just try removing the df_list and output_df though and do something like this after the for loop with all the assert statements.

if not self.write_se: filtered_df["se"] = np.NaN filtered_df["sample_size"] = np.NaN filtered_df.drop(columns=["incl"], inplace = True) assert sorted(list(filtered_df.columns)) == ["geo_id", "sample_size", "se", "timestamp", "val"] return filtered_df

Curious (a) if that works, (b) how much that speeds things up. I'd guess that this for loop is really expensive because it runs over all counties (update_indicator does too, but at least we parallelize that). There may be a way to avoid the for loop using DataFrame methods, but we can get there later.

actually the create_export_csv is what's slowing down the runs. the preprocessing takes about half the time as the previous write_to_csv. It's the delphi utils create_export_csv that's slower than the previous versions.
Still moved the nan columns outside of the loop though.

Gotcha, I see what you mean. Really surprised create_export_csv is that much slower, that's a bit unfortunate.

claims_hosp/delphi_claims_hosp/update_indicator.py

aysim319 · 2024-11-19T15:06:49Z

optimziation not really panning out; closing

feat: refactored update_indicator and write_to_csv function

003fa54

BREAKING CHANGE: update_indicator now outputs a dataframe instead of a dictionary

aysim319 requested review from dshemetov and minhkhul August 26, 2024 17:34

dshemetov reviewed Aug 28, 2024

View reviewed changes

aysim319 added 6 commits August 29, 2024 15:13

suggested changes

7a2c25a

lint

1d9b165

comparison test was with staging

c9b600d

reverting back to 1

ed9031a

added note

ad3c14f

added direction into expected columns

9e9e086

dshemetov reviewed Sep 4, 2024

View reviewed changes

_delphi_utils_python/delphi_utils/export.py Outdated Show resolved Hide resolved

dshemetov reviewed Sep 4, 2024

View reviewed changes

claims_hosp/tests/test_update_indicator.py Outdated Show resolved Hide resolved

suggested change

8e4b4f9

minhkhul reviewed Sep 5, 2024

View reviewed changes

claims_hosp/tests/test_update_indicator.py Outdated Show resolved Hide resolved

_delphi_utils_python/delphi_utils/export.py Outdated Show resolved Hide resolved

changed wording for comment

0aaad70

dshemetov reviewed Sep 6, 2024

View reviewed changes

claims_hosp/delphi_claims_hosp/update_indicator.py Outdated Show resolved Hide resolved

dshemetov reviewed Sep 6, 2024

View reviewed changes

suggested changes

10f1812

dshemetov reviewed Sep 6, 2024

View reviewed changes

claims_hosp/delphi_claims_hosp/update_indicator.py Outdated Show resolved Hide resolved

melange396 reviewed Sep 6, 2024

View reviewed changes

_delphi_utils_python/delphi_utils/export.py Outdated Show resolved Hide resolved

minhkhul mentioned this pull request Sep 7, 2024

Indicator runners should output files with issue date #1907

Open

aysim319 added 2 commits September 9, 2024 10:51

suggested changes

8822240

remove unneeded column

5481f0c

aysim319 added 3 commits September 9, 2024 16:35

more logging

ed4ff19

adding logging to calls

4954f9a

adding logger to create_export_csv calls

47d8c39

aysim319 added 4 commits September 11, 2024 13:38

lint

b78c63a

lint

30f5a29

lint

12b62a3

lint

ec1a003

dshemetov reviewed Sep 17, 2024

View reviewed changes

claims_hosp/delphi_claims_hosp/update_indicator.py Outdated Show resolved Hide resolved

suggested changes

38893fd

dshemetov reviewed Sep 17, 2024

View reviewed changes

claims_hosp/delphi_claims_hosp/update_indicator.py Outdated Show resolved Hide resolved

aysim319 added 5 commits September 17, 2024 21:03

suggested changes

29a5153

small optimizations

55958b0

lint and fix tests

95e6708

Merge branch 'main' into 1268-use-create-export-cvs-func

f3259de

merge clean up

6234e64

aysim319 closed this Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor hospital admission to use delphi_utils create_export_csv #2032

refactor hospital admission to use delphi_utils create_export_csv #2032

aysim319 commented Aug 26, 2024 •

edited

Loading

dshemetov left a comment

dshemetov Aug 28, 2024

aysim319 Aug 28, 2024

dshemetov Aug 29, 2024 •

edited

Loading

aysim319 Sep 4, 2024 •

edited

Loading

dshemetov Sep 4, 2024

aysim319 Sep 4, 2024

dshemetov Sep 6, 2024 •

edited

Loading

dshemetov Sep 6, 2024

aysim319 Sep 9, 2024 •

edited

Loading

dshemetov Sep 10, 2024 •

edited

Loading

melange396 left a comment

aysim319 commented Sep 9, 2024 •

edited

Loading

dshemetov commented Sep 9, 2024

aysim319 commented Sep 9, 2024

dshemetov commented Sep 9, 2024

dshemetov Sep 17, 2024 •

edited

Loading

dshemetov Sep 17, 2024 •

edited

Loading

dshemetov Sep 17, 2024 •

edited

Loading

aysim319 Sep 17, 2024

dshemetov Sep 18, 2024

aysim319 commented Nov 19, 2024

refactor hospital admission to use delphi_utils create_export_csv #2032

refactor hospital admission to use delphi_utils create_export_csv #2032

Conversation

aysim319 commented Aug 26, 2024 • edited Loading

Changelog

Associated Issue(s)

dshemetov left a comment

Choose a reason for hiding this comment

dshemetov Aug 28, 2024

Choose a reason for hiding this comment

aysim319 Aug 28, 2024

Choose a reason for hiding this comment

dshemetov Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

aysim319 Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

dshemetov Sep 4, 2024

Choose a reason for hiding this comment

aysim319 Sep 4, 2024

Choose a reason for hiding this comment

dshemetov Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

dshemetov Sep 6, 2024

Choose a reason for hiding this comment

aysim319 Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

dshemetov Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

melange396 left a comment

Choose a reason for hiding this comment

aysim319 commented Sep 9, 2024 • edited Loading

dshemetov commented Sep 9, 2024

aysim319 commented Sep 9, 2024

dshemetov commented Sep 9, 2024

dshemetov Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

dshemetov Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

dshemetov Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

aysim319 Sep 17, 2024

Choose a reason for hiding this comment

dshemetov Sep 18, 2024

Choose a reason for hiding this comment

aysim319 commented Nov 19, 2024

aysim319 commented Aug 26, 2024 •

edited

Loading

dshemetov Aug 29, 2024 •

edited

Loading

aysim319 Sep 4, 2024 •

edited

Loading

dshemetov Sep 6, 2024 •

edited

Loading

aysim319 Sep 9, 2024 •

edited

Loading

dshemetov Sep 10, 2024 •

edited

Loading

aysim319 commented Sep 9, 2024 •

edited

Loading

dshemetov Sep 17, 2024 •

edited

Loading

dshemetov Sep 17, 2024 •

edited

Loading

dshemetov Sep 17, 2024 •

edited

Loading