Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-stratified splitting and overwriting of loss function in classification tasks #228

Open
kaueltzen opened this issue Oct 20, 2024 · 6 comments · May be fixed by #234
Open

Non-stratified splitting and overwriting of loss function in classification tasks #228

kaueltzen opened this issue Oct 20, 2024 · 6 comments · May be fixed by #234

Comments

@kaueltzen
Copy link
Contributor

Hello,

I'd like to report two issues regarding classification tasks in modnet:

First, the loss function passed to ModnetModel().fit() is overwritten with "categorical_crossentropy" if val_data is not None and self.multi_label=False:

if self.num_classes[prop[0]] >= 2: # Classification
targ = prop[0]
if self.multi_label:
y_inner = np.stack(val_data.df_targets[targ].values)
if loss is None:
loss = "binary_crossentropy"
else:
y_inner = tf.keras.utils.to_categorical(
val_data.df_targets[targ].values,
num_classes=self.num_classes[targ],
)
loss = "categorical_crossentropy"

As the loss=None case is already handled before in L352-L360 in the preprocessing of the training data, maybe this could be removed here when preprocessing the validation data?

Second, if nested=False, both FitGenetic and ModnetModel.fit_preset() perform a train test split that is not stratified:

train_test_split(range(len(data.df_featurized)), test_size=val_fraction)

splits = [
train_test_split(
range(len(self.train_data.df_featurized)), test_size=val_fraction
)
]

This is an issue in the case of imbalanced datasets and it would be helpful if the splitting was stratified for classification tasks.

If you are interested, I'm happy to raise a PR with fixes.

@kaueltzen
Copy link
Contributor Author

Btw just noticed that the validation split in keras is always taken from the last samples provided to Model.fit().
https://www.tensorflow.org/versions/r2.11/api_docs/python/tf/keras/Model
This could be an issue when passing training_data to ModnetModel.fit() that is sorted by label (regardless if a classification / regression task).
So, if val_data=None, a shuffling of the training data before calling

history = self.model.fit(**fit_params)

would make sense in the regression case, while for classification, a stratified split based on val_fraction should maybe already happen inside ModnetModel.fit(). Thoughts?

@ppdebreuck
Copy link
Owner

ppdebreuck commented Oct 21, 2024

You are correct on the three points. For the second, I agree adding stratification by default makes sense as it follows what we do for k-fold (i.e. StratifiedKFold). For the last one, this indeed only applies to when a float split is used as input. I would either warn in doc saying it splits on last part, or better, do as you suggest to mimic closely what is done on the k-folds and hold-out.

You could perhaps combine points 2/3 by defining a hold-out split function taking a fraction as input and handling shuffling or stratification, similar to the kfold-split here:

def matbench_kfold_splits(data: MODData, n_splits=5, classification=False):

Happy to have a PR on this, thanks !

@kaueltzen
Copy link
Contributor Author

Thanks for your answer! One more point regarding overwriting of loss functions that would also be good to have in the PR:
In evaluate of ModnetModel and in the case of classification, not the passed loss function, but always the -roc_auc is returned.

def evaluate(
self,
test_data: MODData,
loss: Union[str, Callable] = "mae",
) -> pd.DataFrame:
"""Evaluates predictions on the passed MODData by returning the corresponding score:
- for regression: loss function provided in loss argument. Defaults to mae.
- for classification: negative ROC AUC.
averaged over the targets when multi-target.
Parameters:
test_data: A featurized and feature-selected `MODData`
object containing the descriptors used in training.
Returns:
Score defined hereabove.

In the evaluation of individuals in FitGenetic, this method is also used,
self.val_loss = model.evaluate(
val_data,
loss=self.genes["loss"],
)

so fitting and validation may be done with different metrics.

If you agree, I would also like to implement the passing of a loss function in evaluate for classification tasks.

@kaueltzen
Copy link
Contributor Author

Hi @ppdebreuck one more thing: when looking into stratifying the bootstrapping in fit of EnsembleMODNetModel, I noticed that the same bootstrapped sample is drawn from the training data for len(self.n_models) times because the random_state is always the same:

if self.bootstrap:
LOG.info("Generating bootstrap data...")
train_datas = [
training_data.split(
(
resample(
np.arange(len(training_data.df_targets)),
replace=True,
random_state=2943,
),
[],
)
)[0]
for _ in range(self.n_models)
]

Is this behaviour anticipated? If not, I'd replace it with something that is reproducible but creates len(self.n_models) different samples.

@ml-evs
Copy link
Collaborator

ml-evs commented Oct 26, 2024

Hi @ppdebreuck one more thing: when looking into stratifying the bootstrapping in fit of EnsembleMODNetModel, I noticed that the same bootstrapped sample is drawn from the training data for len(self.n_models) times because the random_state is always the same:

if self.bootstrap:
LOG.info("Generating bootstrap data...")
train_datas = [
training_data.split(
(
resample(
np.arange(len(training_data.df_targets)),
replace=True,
random_state=2943,
),
[],
)
)[0]
for _ in range(self.n_models)
]

Is this behaviour anticipated? If not, I'd replace it with something that is reproducible but creates len(self.n_models) different samples.

Thanks for noticing this -- it looks like this random state has much overstayed its welcome, and has been perhaps reducing model perf. for a few years (!). I'll raise a separate issue and open a PR fixing this.

@ppdebreuck
Copy link
Owner

@ml-evs It's indeed not the intended behaviour, but note that GA never really uses bootstrapping. It's only when fitting EnsembleModel from scratch, so we never ran into this issue.

I would turn off bootstrapping altogether by default, just having different initial weights is mostly enough or better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants