Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes pd.concat error halting processing #972

Merged
merged 5 commits into from
Oct 18, 2024

Conversation

MaxGamill-Sheffield
Copy link
Collaborator

@MaxGamill-Sheffield MaxGamill-Sheffield commented Oct 17, 2024

closes #969

TLDR:
PR adds the following:

  • Rejigging tracing stats outputs so base grainstats / molstats / None is returned.
  • Enables folder stats to be output properly.
  • drops empty stats columns causing concat warning.

Think I got it. The cause according to the docs is that pandas doesn't like concatenating objects with whole columns (or rows) as NaN's or None's (we had a mix which is why it was hard to find).

Not only that, but for some reason the "warning" also halts processing.

The initial fix I tried was to stop the warning using the below code which seemed to work, so maybe a 🔥hotfix🔥 idea for something in the future:

import warnings
...
warnings.filterwarnings("ignore")

Then I thought... WWND (what would @ns-rse do) 👼, and so I solved the root issue and stopped the message from popping up in the first place by dropping all columns with all values as None or NaN's, and applied this across:

  • grainstats
  • disordered_tracing_stats
  • molstats
  • image_stats
    so this shouldn't happen again and we will still be told of depreciation warnings.

Now I've tested this slightly to ensure that if a column is present in df1 but not df2, it will be added, but filled with NaNs for the values in the resultant_df for df2.

@SylviaWhittle
Copy link
Collaborator

Hey, I've read the PR and it looks great, just two things:

1 - Trying to replicate the issue in a python file out of curiosity, pandas seems to be allowing concatenation of data frames with columns / rows with np.nans in - what am I doing wrong here? 😅

image image

2 - Removing the columns that contain all np.nan values - could this lead to incomplete grainstats.csvs? Would this cause an issue from a user perspective? (ie trying to plot a column present in some grainstats.csvs but not in others) Should we have a final check before saving it to ensure all the columns are populated and if not, fill with np.nan?

@MaxGamill-Sheffield
Copy link
Collaborator Author

1 - Trying to replicate the issue in a python file out of curiosity, pandas seems to be allowing concatenation of data frames with columns / rows with np.nans in - what am I doing wrong here? 😅

Yep! I tried the same and it was driving me mad. I still have no idea of the difference 🤷‍♂️

2 - Removing the columns that contain all np.nan values - could this lead to incomplete grainstats.csvs? Would this cause an issue from a user perspective? (ie trying to plot a column present in some grainstats.csvs but not in others) Should we have a final check before saving it to ensure all the columns are populated and if not, fill with np.nan?

And also yes, if somehow the same column in all images returned None / NaN 's then that column would be missing. This would cause a problem when searching for that column to plot, but then again so would trying to plot values for a column of all NaN's.
So I guess it's up to us wether we want to add it. Which side of the fence do you lie @SylviaWhittle?

@SylviaWhittle
Copy link
Collaborator

1 - Trying to replicate the issue in a python file out of curiosity, pandas seems to be allowing concatenation of data frames with columns / rows with np.nans in - what am I doing wrong here? 😅

Yep! I tried the same and it was driving me mad. I still have no idea of the difference 🤷‍♂️

2 - Removing the columns that contain all np.nan values - could this lead to incomplete grainstats.csvs? Would this cause an issue from a user perspective? (ie trying to plot a column present in some grainstats.csvs but not in others) Should we have a final check before saving it to ensure all the columns are populated and if not, fill with np.nan?

And also yes, if somehow the same column in all images returned None / NaN 's then that column would be missing. This would cause a problem when searching for that column to plot, but then again so would trying to plot values for a column of all NaN's. So I guess it's up to us wether we want to add it. Which side of the fence do you lie @SylviaWhittle?

I say that 1 - this is unlikely and 2 - this would cause an error anyways so from a user perspective then it isn't really solving anything so until someone complains and it becomes a legit issue, let's not try to preempt something that might not be an issue?

Copy link
Collaborator

@SylviaWhittle SylviaWhittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me, let's just keep an eye out for missing column errors if users experience that.

Also I'm very in favour of the WWND

@llwiggins
Copy link
Collaborator

Thanks for your nice detective work on this @MaxGamill-Sheffield! 🕵️ This seems a sensible fix to me and enables the data set I mentioned in issue #969 to run through to completion and successfully produces the all_statistics.csv.

Happy for this to be merged!

@MaxGamill-Sheffield
Copy link
Collaborator Author

I've added to the usage docs in the the tracing outputs which were missing, and the warning about the columns so it is documented somewhere not in our brains :)

Copy link
Collaborator

@SylviaWhittle SylviaWhittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't have remembered to add documentation 👍

@MaxGamill-Sheffield MaxGamill-Sheffield added this pull request to the merge queue Oct 18, 2024
Merged via the queue into main with commit ee0fb0f Oct 18, 2024
11 checks passed
@MaxGamill-Sheffield MaxGamill-Sheffield deleted the maxgamill-sheffield/969-concat-issue branch October 18, 2024 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Concatenation warning that stops all_statistics.csv from being produced
3 participants