Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor hospital admission to use delphi_utils create_export_csv #2032

Closed
wants to merge 25 commits into from

Conversation

aysim319
Copy link
Contributor

@aysim319 aysim319 commented Aug 26, 2024

Changelog

kept the existing method for testing/review purposes

  • update_indicator function returns dataframe
  • new function filter output to preprocess before feeding into create_export_csv
  • run.py now uses create_export_csv

Associated Issue(s)

BREAKING CHANGE: update_indicator now outputs a dataframe instead of a dictionary
@aysim319 aysim319 requested review from dshemetov and minhkhul August 26, 2024 17:34
Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing this, made a few suggestions and a few questions

claims_hosp/delphi_claims_hosp/run.py Outdated Show resolved Hide resolved
claims_hosp/delphi_claims_hosp/update_indicator.py Outdated Show resolved Hide resolved
claims_hosp/tests/test_update_indicator.py Outdated Show resolved Hide resolved
@@ -289,3 +295,46 @@ def test_write_to_csv_wrong_results(self):
updater.write_to_csv(res3, td.name)

td.cleanup()

def test_prefilter_results(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: nice test, thanks!

question: how big is the dataset we're comparing here? do you think it's representative and gets a lot of code coverage?

suggestion: this seems like another migration test we can remove once this PR is ready for merge.

suggestion: if we want to be especially careful, we could run this same kind of test but compare staging and prod output CSVs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A1: I need to double check, but I believe I got an actual file from a one-off run and should be about a gig, but I would need to double check. Do you think I should add another file that's more recent?

Response to S1: that's the idea

Response to S2: that seems like a good idea; I would need to poke around how staging is and see what happens

Copy link
Contributor

@dshemetov dshemetov Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A1: I need to double check, but I believe I got an actual file from a one-off run and should be about a gig, but I would need to double check. Do you think I should add another file that's more recent?

I'm not familiar with the source files for hospital admission, but the answer here really depends on whether source file is a one of many signals, one of many geos, etc. if this single drop contains every signal as a column and it's the source geo that we aggregate up, then that's good coverage. but if it's not, then doing a prod/staging comparison will get that coverage instead.

side-note: very important that we squash merge this PR, so the gig-sized file doesn't make it into the commit history.

Response to S2: that seems like a good idea; I would need to poke around how staging is and see what happens

I think it would be worthwhile, so let's do that at some point. I also think that your experience with doing prod/staging comparisons will help us streamline this process in the future and make something that does branch comparisons with the press of a button.

Copy link
Contributor Author

@aysim319 aysim319 Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in staging: ran the older version and saved the output in /common/text_hosptial_admission_test_export_20240903

and after scp into local and comparing with the new version
sample script below

    def test_compare_run(self):
        expected_path = "../from_staging/test_export"
        actual_path = "../receiving"
        expected_files = sorted(glob.glob(f"{expected_path}/*.csv"))
        actual_files = sorted(glob.glob(f"{actual_path}/*.csv"))
        for expected, actual in zip(expected_files, actual_files):
            with open(f"{expected_path}/{expected}", "rb") as expected_f, \
                 open(f"{actual_path}/{actual}", "rb") as actual_f:
                expected_df = pd.read_csv(expected_f)
                actual_df = pd.read_csv(actual_f)
                pd.testing.assert_frame_equal(expected_df, actual_df)

passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how many export csvs are produced by the staging run? /common/text_hosptial_admission_test_export_20240903?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20076 or so. hospital-admission creates all geos starting from 2020-02-01 till 2024-08-31 (there's some lag)

claims_hosp/delphi_claims_hosp/update_indicator.py Outdated Show resolved Hide resolved
claims_hosp/delphi_claims_hosp/update_indicator.py Outdated Show resolved Hide resolved
assert np.all(group.val > 0) and np.all(group.se > 0), "p=0, std_err=0 invalid"
else:
group["se"] = np.NaN
group.drop("incl", inplace=True, axis="columns")
Copy link
Contributor

@dshemetov dshemetov Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is this necessary here? create_export_csv will drop it anyway.

broader question: tracing the code above, i actually don't know what columns are in output_df at this step. in the previous code, we at least knew that we were dealing with

        output_dict = {
            "rates": rates,
            "se": std_errs,
            "dates": self.output_dates,
            "geo_ids": unique_geo_ids,
            "geo_level": self.geo,
            "include": valid_inds,
        }

suggestion: i suppose that depends on what res = ClaimsHospIndicator.fit(sub_data, self.burnindate, geo_id) outputs in update_indicator, but i haven't tracked that down. what do you think about adding an assert to update_indicator at the end that makes sure that output_df has the all the right columns that we expect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you take a look at the comment above again, i updated it

Copy link
Contributor Author

@aysim319 aysim319 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! I actually missed sample_size if it wasn't for your comment. Hopefully this should fix the issue.

question: is this necessary here? create_export_csv will drop it anyway.

Yes, we do need the incl at least until the preprocess_output that filters out with incl column being true.

Copy link
Contributor

@dshemetov dshemetov Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! I actually missed sample_size if it wasn't for your comment. Hopefully this should fix the issue.

Glad that helped!

Yes, we do need the incl at least until the preprocess_output that filters out with incl column being true.

I meant is it necessary to even drop it in line 230, since create_export_csv will ignore it when writing the csv. But it's a minor thing, not a big deal.

Copy link
Contributor

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there werent so many formatting changes here, it makes it hard to see the real meat of the functionality changes. :(

Is this all worth it? It looks like there is a lot of stuff happening here just to shoehorn in the usage of a fairly simple ~45 line file utility method... Does this give us an efficiency boost? Do you have timing comparisons of runs of the old vs the new code? Did you consider creating a smaller diff by just replacing calls to (or within) write_to_csv() with calls to create_export_csv()?

There are some logging lines that are being removed -- can we replace them with similar stuff in create_export_csv()?

_delphi_utils_python/delphi_utils/export.py Outdated Show resolved Hide resolved
@aysim319
Copy link
Contributor Author

aysim319 commented Sep 9, 2024

I wish there werent so many formatting changes here, it makes it hard to see the real meat of the functionality changes. :(

Almost all the formatting changes is auto generated with darker, so that's out of my hand

Is this all worth it? It looks like there is a lot of stuff happening here just to shoehorn in the usage of a fairly simple ~45 line file utility method... Does this give us an efficiency boost? Do you have timing comparisons of runs of the old vs the new code? Did you consider creating a smaller diff by just replacing calls to (or within) write_to_csv() with calls to create_export_csv()?

The current create_export_csv is slower than writing in a for loop, but it's also not the main bottleneck that's making this indicator slower. The difference is about 60 seconds between the for loop and the create_export_csv as is, but there are quick changes (groupby date and not filtering per date) that decreases the difference to 30 seconds

cprofile_main.txt
cprofile_create_export_csv.txt
cprofile_create_export_csv_updated.txt

There are low hanging fruits that would negate the 30 second decrease and more. Similar to doctor visits there are multiple read and write of the csv that can be relative easily removed. I think it's easier in terms of maintainability and readability to refactor over to create_export_csv and also have some extra stuff that's already in create_export_csv like (missingness and remove nulls)

There are some logging lines that are being removed -- can we replace them with similar stuff in create_export_csv()?
There's one that's the warning for se that I just added and added the number of rows in the export

@dshemetov
Copy link
Contributor

Honestly my bad for not documenting the darker workflow better -- make your code changes, commit them, then do make format in the indicator you're changing, let it make its formatting changes, and then commit those separately. I didn't think this would be necessary, since its whole premise is that it only formats the lines you touch, but it turned out to be more aggressive about resorting our imports than I thought it would be.

@aysim319
Copy link
Contributor Author

aysim319 commented Sep 9, 2024

Honestly my bad for not documenting the darker workflow better -- make your code changes, commit them, then do make format in the indicator you're changing, let it make its formatting changes, and then commit those separately. I didn't think this would be necessary, since its whole premise is that it only formats the lines you touch, but it turned out to be more aggressive about resorting our imports than I thought it would be.

I've been doing running the linter more frequently so some of the commit also have formatting changes, since the ci run checks for lints before the tests and I like to check that all the checks pass within my comments, but I can work on isolating the comments for functional changes and linting for future reference.

But in terms of the file changed page the default shows all the changes that was made and unless you manually filter out the lint commits. At least that's how I understand how the file changed page works... Is there a workaround that doesn't involve manually filtering?

@dshemetov
Copy link
Contributor

Is there a workaround that doesn't involve manually filtering?

Not that I know of, but often times looking through the individual commits is good enough for me.

else:
group["se"] = np.NaN
group["sample_size"] = np.NaN
df_list.append(group)
Copy link
Contributor

@dshemetov dshemetov Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rereading this bit, I'm thinking: df_list seems unnecessary. These are all just asserts that terminate the pipeline if any of the values don't pass validation, so we should just run the asserts, but don't rebuild the df. The only difference between filtered_df and output_df are the group["se"] = np.NaN and group["sample_size"] = np.NaN transformations, but those are independent of group, so can be handled outside the for-loop. It might even make sense to handle

        filtered_df = df[df["incl"]]
        filtered_df = filtered_df.reset_index()
        filtered_df.rename(columns={"rate": "val"}, inplace=True)
        filtered_df["timestamp"] = filtered_df["timestamp"].astype(str)

and adding NAs at the end of update_indicator, call this function validate_dataframe, and don't have it return anything.

Copy link
Contributor

@dshemetov dshemetov Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I'd just try removing the df_list and output_df though and do something like this after the for loop with all the assert statements.

if not self.write_se:
    filtered_df["se"] = np.NaN
filtered_df["sample_size"] = np.NaN
filtered_df.drop(columns=["incl"], inplace = True)
assert sorted(list(filtered_df.columns)) == ["geo_id", "sample_size", "se", "timestamp", "val"]
return filtered_df

Copy link
Contributor

@dshemetov dshemetov Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious (a) if that works, (b) how much that speeds things up. I'd guess that this for loop is really expensive because it runs over all counties (update_indicator does too, but at least we parallelize that). There may be a way to avoid the for loop using DataFrame methods, but we can get there later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the create_export_csv is what's slowing down the runs. the preprocessing takes about half the time as the previous write_to_csv. It's the delphi utils create_export_csv that's slower than the previous versions.
Still moved the nan columns outside of the loop though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I see what you mean. Really surprised create_export_csv is that much slower, that's a bit unfortunate.

@aysim319
Copy link
Contributor Author

optimziation not really panning out; closing

@aysim319 aysim319 closed this Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants