-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling NaNs from ElementProperty #898
Comments
Hey @gbrunin let me address some of these. For your first bullet point, I think allowing the users to filter features as they see fit using pandas operations is probably best. For the second bullet, this strikes me as potentially nefarious because it can result in very strange values and the user would have to dig deep to figure out where it came from. The postprocessing might just be more easily handled by the users as they see fit. I do like your third idea:
I believe the technical term for this is "imputation". Note you can already impute features with pandas when doing ML experiments with matminer. That being said, your idea looks like a better version of imputation because we would be retaining as much information as possible about elementproperty attributes while avoiding nans. Why this seems usefulSo, normally, lets say you were using elementproperty to get mean
Although
I understand you want to impute the elemental attribute before statistics are computed over the individual composition. This would, for example, use
Closing thoughtsit might be nice to see some examples of this third working in action... i.e., to see if this kind of imputation would actually be more useful than just pandas.impute.... To save you some time, if you are still interested in making a PR, I'd focus on your third idea |
Thanks for your comments @ardunn. I agree that the third option is best, I just wanted to lay out the possible ways to handle these NaNs that I could see. I'll start working on a PR for this imputation, and we'll see what it leads to. Note that in any case, if we have let's say Ga2O3 and Ga3O5, those would have the same value for the feature if the average over the column is used, while they would have different feature values if Ga is first sorted out. So it should make more sense. |
Hi, I have (finally) made some tests with the 3rd option that has been discussed here. Tagging @ml-evs since he was also part of the process and developments. Here are the MAE on three of the Matbench datasets, using MODNet models:
For the steels, I think most compounds have pretty basic elements for which most features are known, so imputing does not change the featurized dataset much. I have not checked this though. I believe that imputing in this way overall does lead to better results with features that make more sense. The implementation is already present, it has been merged in #892. However in this PR we decided to allow for the possibility to impute but to keep the default behavior ( Any thoughts? |
When using an
ElementProperty
featurizer, some elemental data may not be present, e.g., the bulk modulus of Ga is not in the GaElement
of pymatgen. In such a case, the featurizer will return a NaN.There are different ways to handle such a case, and for now it is left to the user to handle it in whatever way they prefer. Basically, the possible approaches there would be:
I believe that these two possibilities could be implemented in matminer as some kind of post-processing step, that could be used by the user or not. This is arguable because it could be left to the user to handle these.
I see another possibility that could be implemented in matminer and that the user has no easy way to do. The
ElementProperty
could, when a value for an element is not found in the data, replace it by the mean of the values for all other elements. This is different from using the mean of the feature over the total dataset, in the sense that it is not biased by the dataset (the user could want one or the other), and that nothing is to be done for new predictions since this treatment is internal to the featurizer. TheElementProperty
featurizer would not return NaNs for missing-data reasons. This could be triggered with an optional argument, e.g.,ElementProperty(missing_is_mean=True)
.If you think this is a good addition to matminer, I would be happy to submit a PR with whichever solution you think is best. I would actually be in favor of implementing all of them to leave the choice to the user, but make the users life easier.
The text was updated successfully, but these errors were encountered: