Handling suppressed cells vs missing values #4

digicosmos86 · 2021-10-01T13:42:27Z

I saw the output of this helper function was changed:

def _combine_n_pct(
    df_n: pd.DataFrame,
    df_pct: pd.DataFrame,
    digits: int = 1,
    suppress_symbol: str = "*",
) -> pd.DataFrame:
    """
    Helper function that formats and combines n and pct values
    """
    df_n_str = df_n.astype(str)
    df_pct_str = df_pct.applymap(lambda x: f" ({x:.{digits}%})", na_action="ignore")
    return df_n_str.add(df_pct_str).fillna(f"{suppress_symbol}")

Maybe I am missing something, but it seems that the last fillna() step will also apply the cell suppression symbol to the cells that are actually missing? I generally leave missing as-is, which allows something like df.zen.pretty().format(na_rep="N/A") for the user. Wondering if using df.loc with the suppression symbol might be a good idea?

The text was updated successfully, but these errors were encountered:

khwilson · 2021-10-01T16:13:45Z

I was ok with this change originally b/c there was a strict check in the code to make sure there were NO NA values anywhere in the original dataframe, though I may have mistaken the control flow.

I'm not sure exactly what it would mean to do suppression with NA values. Perhaps the compromise is:

If suppress == True in the pretty call, require no NA values. Do suppression as usual.
If suppress == False allow NA values
Pass the suppression mask around to all these helper functions? Kind of annoying, but there's no other good way to create a "special symbol." Unless the new pd.NA supports something like Stata's different missing value markers?

edwardhuh · 2021-10-01T16:25:29Z

So, after discussing this problem with Paul in person, I have realized that there is a case where an individual may want to calculate the mean and sd of a row/column that has no entry.
These values will be NA. Just like how we want to retain 0s in our frequency counts, Paul convinced me that the NA mean and standard deviation for sparse columns/rows should be maintained, instead of being suppressed or erroneously converted to 0.

The code solving this will have a separate search for columns with n=0 when calculating sd and mean, and ensure that these NaN values are retained.

khwilson · 2021-10-01T20:52:51Z

I would greatly appreciate a test that explains this situation :-)

digicosmos86 · 2021-10-04T13:53:18Z

Let me clarify this a little bit to the best of my understanding. @edwardhuh please chime in if this is not accurate.

In the case of this helper function combine_n_pct, Ed explained that in freq_table, the NAs are reported as 0s, which is equivalent, so there will not be NAs produced before fillna with the suppression symbol. I totally agree with this solution (if suppress == True then allow NA values to ensure compatibility with pandas.Styler.

In the case of mean_sd_table, on the other hand, 0s are not equivalent to NAs, so I think a proper way of distinguishing NAs from 0s should be established. An example could be the corner case of a table with one column that only has one value, and that value is suppressed (imagine, an EL cohort with only one program in which there are fewer than 5 students and no students in the rest of them). One representation of this column could be one cell with a suppression symbol and the rest being NAs. From what Ed described to me, it seems all values in this column are going to be a suppress symbol, which in some cases might be more desirable in a privacy-protection perspective, but the option to allow NAs to be shown should also be available to the users.

khwilson · 2021-10-04T16:03:09Z

Ah, this does make sense. I think that we should schedule some time in an upcoming sprint where these concerns are first written out as tests. 😄

This is one of the cases where I feel like test-driven development does make the most sense!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling suppressed cells vs missing values #4

Handling suppressed cells vs missing values #4

digicosmos86 commented Oct 1, 2021

khwilson commented Oct 1, 2021

edwardhuh commented Oct 1, 2021

khwilson commented Oct 1, 2021

digicosmos86 commented Oct 4, 2021

khwilson commented Oct 4, 2021

Handling suppressed cells vs missing values #4

Handling suppressed cells vs missing values #4

Comments

digicosmos86 commented Oct 1, 2021

khwilson commented Oct 1, 2021

edwardhuh commented Oct 1, 2021

khwilson commented Oct 1, 2021

digicosmos86 commented Oct 4, 2021

khwilson commented Oct 4, 2021