Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling suppressed cells vs missing values #4

Open
digicosmos86 opened this issue Oct 1, 2021 · 5 comments
Open

Handling suppressed cells vs missing values #4

digicosmos86 opened this issue Oct 1, 2021 · 5 comments

Comments

@digicosmos86
Copy link
Contributor

@edwardhuh and @khwilson

I saw the output of this helper function was changed:

def _combine_n_pct(
    df_n: pd.DataFrame,
    df_pct: pd.DataFrame,
    digits: int = 1,
    suppress_symbol: str = "*",
) -> pd.DataFrame:
    """
    Helper function that formats and combines n and pct values
    """
    df_n_str = df_n.astype(str)
    df_pct_str = df_pct.applymap(lambda x: f" ({x:.{digits}%})", na_action="ignore")
    return df_n_str.add(df_pct_str).fillna(f"{suppress_symbol}")

Maybe I am missing something, but it seems that the last fillna() step will also apply the cell suppression symbol to the cells that are actually missing? I generally leave missing as-is, which allows something like df.zen.pretty().format(na_rep="N/A") for the user. Wondering if using df.loc with the suppression symbol might be a good idea?

@khwilson
Copy link
Contributor

khwilson commented Oct 1, 2021

I was ok with this change originally b/c there was a strict check in the code to make sure there were NO NA values anywhere in the original dataframe, though I may have mistaken the control flow.

I'm not sure exactly what it would mean to do suppression with NA values. Perhaps the compromise is:

  • If suppress == True in the pretty call, require no NA values. Do suppression as usual.
  • If suppress == False allow NA values
  • Pass the suppression mask around to all these helper functions? Kind of annoying, but there's no other good way to create a "special symbol." Unless the new pd.NA supports something like Stata's different missing value markers?

@edwardhuh
Copy link
Contributor

So, after discussing this problem with Paul in person, I have realized that there is a case where an individual may want to calculate the mean and sd of a row/column that has no entry.
These values will be NA. Just like how we want to retain 0s in our frequency counts, Paul convinced me that the NA mean and standard deviation for sparse columns/rows should be maintained, instead of being suppressed or erroneously converted to 0.

The code solving this will have a separate search for columns with n=0 when calculating sd and mean, and ensure that these NaN values are retained.

@khwilson
Copy link
Contributor

khwilson commented Oct 1, 2021

I would greatly appreciate a test that explains this situation :-)

@digicosmos86
Copy link
Contributor Author

Let me clarify this a little bit to the best of my understanding. @edwardhuh please chime in if this is not accurate.

In the case of this helper function combine_n_pct, Ed explained that in freq_table, the NAs are reported as 0s, which is equivalent, so there will not be NAs produced before fillna with the suppression symbol. I totally agree with this solution (if suppress == True then allow NA values to ensure compatibility with pandas.Styler.

In the case of mean_sd_table, on the other hand, 0s are not equivalent to NAs, so I think a proper way of distinguishing NAs from 0s should be established. An example could be the corner case of a table with one column that only has one value, and that value is suppressed (imagine, an EL cohort with only one program in which there are fewer than 5 students and no students in the rest of them). One representation of this column could be one cell with a suppression symbol and the rest being NAs. From what Ed described to me, it seems all values in this column are going to be a suppress symbol, which in some cases might be more desirable in a privacy-protection perspective, but the option to allow NAs to be shown should also be available to the users.

@khwilson
Copy link
Contributor

khwilson commented Oct 4, 2021

Ah, this does make sense. I think that we should schedule some time in an upcoming sprint where these concerns are first written out as tests. 😄

This is one of the cases where I feel like test-driven development does make the most sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants