Handling NaNs from ElementProperty #898

gbrunin · 2023-04-13T06:44:05Z

When using an ElementProperty featurizer, some elemental data may not be present, e.g., the bulk modulus of Ga is not in the Ga Element of pymatgen. In such a case, the featurizer will return a NaN.
There are different ways to handle such a case, and for now it is left to the user to handle it in whatever way they prefer. Basically, the possible approaches there would be:

ignore such a feature entirely, which can be a pity if only a small fraction of the dataset presents a NaN for this feature
replace the feature by a constant value, either one that is completely outside the range of possibilities, or the mean of the feature over the dataset. The former has the advantage that it is simple to set in place, while for the latter the mean of the feature should be stored somewhere to be re-used in the case the feature is NaN for a new prediction (that may not have access to the original dataset).
I believe that these two possibilities could be implemented in matminer as some kind of post-processing step, that could be used by the user or not. This is arguable because it could be left to the user to handle these.

I see another possibility that could be implemented in matminer and that the user has no easy way to do. The ElementProperty could, when a value for an element is not found in the data, replace it by the mean of the values for all other elements. This is different from using the mean of the feature over the total dataset, in the sense that it is not biased by the dataset (the user could want one or the other), and that nothing is to be done for new predictions since this treatment is internal to the featurizer. The ElementProperty featurizer would not return NaNs for missing-data reasons. This could be triggered with an optional argument, e.g., ElementProperty(missing_is_mean=True).

If you think this is a good addition to matminer, I would be happy to submit a PR with whichever solution you think is best. I would actually be in favor of implementing all of them to leave the choice to the user, but make the users life easier.

The text was updated successfully, but these errors were encountered:

ardunn · 2023-06-08T06:33:52Z

Hey @gbrunin let me address some of these. For your first bullet point, I think allowing the users to filter features as they see fit using pandas operations is probably best. For the second bullet, this strikes me as potentially nefarious because it can result in very strange values and the user would have to dig deep to figure out where it came from. The postprocessing might just be more easily handled by the users as they see fit.

I do like your third idea:

I see another possibility that could be implemented in matminer and that the user has no easy way to do. The ElementProperty could, when a value for an element is not found in the data, replace it by the mean of the values for all other elements. This is different from using the mean of the feature over the total dataset, in the sense that it is not biased by the dataset (the user could want one or the other), and that nothing is to be done for new predictions since this treatment is internal to the featurizer. The ElementProperty featurizer would not return NaNs for missing-data reasons. This could be triggered with an optional argument, e.g., ElementProperty(missing_is_mean=True).

I believe the technical term for this is "imputation". Note you can already impute features with pandas when doing ML experiments with matminer. That being said, your idea looks like a better version of imputation because we would be retaining as much information as possible about elementproperty attributes while avoiding nans.

Why this seems useful

So, normally, lets say you were using elementproperty to get mean <feature1> of a composition, and is not defined for Ga (I'm using made up numbers here). You'd get:

composition	ElementProperty mean feature 1
NbCoSn	12
Ga2O3	nan
SiO2	16

Although <feature1> was defined for oxygen, the information about oxygen was lost for Ga2O3. Then if you were to do "normal" imputation (from the defined samples, let's say by mean):

composition	ElementProperty mean feature 1
NbCoSn	12
Ga2O3	14 (does not use `<feature1>` for O)
SiO2	16

I understand you want to impute the elemental attribute before statistics are computed over the individual composition. This would, for example, use <feature1> for oxygen and would use the mean of <feature1> as <feature1> for Ga. Then, instead, you might get something more accurate like:

composition	ElementProperty mean feature 1
NbCoSn	12
Ga2O3	19 (does use `<feature1>` for O)
SiO2	16

Closing thoughts

it might be nice to see some examples of this third working in action... i.e., to see if this kind of imputation would actually be more useful than just pandas.impute....

To save you some time, if you are still interested in making a PR, I'd focus on your third idea

gbrunin · 2023-06-08T06:53:30Z

Thanks for your comments @ardunn. I agree that the third option is best, I just wanted to lay out the possible ways to handle these NaNs that I could see. I'll start working on a PR for this imputation, and we'll see what it leads to.

Note that in any case, if we have let's say Ga2O3 and Ga3O5, those would have the same value for the feature if the average over the column is used, while they would have different feature values if Ga is first sorted out. So it should make more sense.

gbrunin · 2024-05-16T15:02:37Z

Hi,

I have (finally) made some tests with the 3rd option that has been discussed here. Tagging @ml-evs since he was also part of the process and developments. Here are the MAE on three of the Matbench datasets, using MODNet models:

	`impute_nan=False`	`impute_nan=True`
Steels	92.59	93.69 (+1.2%)
Expt Eg	0.3399	0.3238 (-4.7%)
Dielectric	0.2793	0.2662 (-4.2%)

For the steels, I think most compounds have pretty basic elements for which most features are known, so imputing does not change the featurized dataset much. I have not checked this though. I believe that imputing in this way overall does lead to better results with features that make more sense.

The implementation is already present, it has been merged in #892. However in this PR we decided to allow for the possibility to impute but to keep the default behavior (impute_nan=False + warning message) because we had not tested it yet. Now, I would be in favor to change the default to True, though that will change the value of multiple features for some compounds.

Any thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling NaNs from ElementProperty #898

Handling NaNs from ElementProperty #898

gbrunin commented Apr 13, 2023

ardunn commented Jun 8, 2023

gbrunin commented Jun 8, 2023 •

edited

Loading

gbrunin commented May 16, 2024

Handling NaNs from ElementProperty #898

Handling NaNs from ElementProperty #898

Comments

gbrunin commented Apr 13, 2023

ardunn commented Jun 8, 2023

Why this seems useful

Closing thoughts

gbrunin commented Jun 8, 2023 • edited Loading

gbrunin commented May 16, 2024

gbrunin commented Jun 8, 2023 •

edited

Loading