Non-stratified splitting and overwriting of loss function in classification tasks #228

kaueltzen · 2024-10-20T10:42:38Z

Hello,

I'd like to report two issues regarding classification tasks in modnet:

First, the loss function passed to ModnetModel().fit() is overwritten with "categorical_crossentropy" if val_data is not None and self.multi_label=False:

modnet/modnet/models/vanilla.py

Lines 400 to 411 in e14188d

    
           if self.num_classes[prop[0]] >= 2:  # Classification 
        
               targ = prop[0] 
        
               if self.multi_label: 
        
                   y_inner = np.stack(val_data.df_targets[targ].values) 
        
                   if loss is None: 
        
                       loss = "binary_crossentropy" 
        
               else: 
        
                   y_inner = tf.keras.utils.to_categorical( 
        
                       val_data.df_targets[targ].values, 
        
                       num_classes=self.num_classes[targ], 
        
                   ) 
        
                   loss = "categorical_crossentropy"

As the loss=None case is already handled before in L352-L360 in the preprocessing of the training data, maybe this could be removed here when preprocessing the validation data?

Second, if nested=False, both FitGenetic and ModnetModel.fit_preset() perform a train test split that is not stratified:

modnet/modnet/models/vanilla.py

Line 580 in e14188d

train_test_split(range(len(data.df_featurized)), test_size=val_fraction)

modnet/modnet/hyper_opt/fit_genetic.py

Lines 458 to 462 in e14188d

    
           splits = [ 
        
               train_test_split( 
        
                   range(len(self.train_data.df_featurized)), test_size=val_fraction 
        
               ) 
        
           ]

This is an issue in the case of imbalanced datasets and it would be helpful if the splitting was stratified for classification tasks.

If you are interested, I'm happy to raise a PR with fixes.

The text was updated successfully, but these errors were encountered:

kaueltzen · 2024-10-20T11:29:29Z

Btw just noticed that the validation split in keras is always taken from the last samples provided to Model.fit().
https://www.tensorflow.org/versions/r2.11/api_docs/python/tf/keras/Model
This could be an issue when passing training_data to ModnetModel.fit() that is sorted by label (regardless if a classification / regression task).
So, if val_data=None, a shuffling of the training data before calling

modnet/modnet/models/vanilla.py

Line 475 in e14188d

history = self.model.fit(**fit_params)

would make sense in the regression case, while for classification, a stratified split based on val_fraction should maybe already happen inside ModnetModel.fit(). Thoughts?

ppdebreuck · 2024-10-21T15:19:37Z

You are correct on the three points. For the second, I agree adding stratification by default makes sense as it follows what we do for k-fold (i.e. StratifiedKFold). For the last one, this indeed only applies to when a float split is used as input. I would either warn in doc saying it splits on last part, or better, do as you suggest to mimic closely what is done on the k-folds and hold-out.

You could perhaps combine points 2/3 by defining a hold-out split function taking a fraction as input and handling shuffling or stratification, similar to the kfold-split here:

modnet/modnet/matbench/benchmark.py

Line 16 in e14188d

def matbench_kfold_splits(data: MODData, n_splits=5, classification=False):

Happy to have a PR on this, thanks !

kaueltzen · 2024-10-25T11:27:01Z

Thanks for your answer! One more point regarding overwriting of loss functions that would also be good to have in the PR:
In evaluate of ModnetModel and in the case of classification, not the passed loss function, but always the -roc_auc is returned.

modnet/modnet/models/vanilla.py

Lines 768 to 784 in e14188d

    
               def evaluate( 
        
                   self, 
        
                   test_data: MODData, 
        
                   loss: Union[str, Callable] = "mae", 
        
               ) -> pd.DataFrame: 
        
                   """Evaluates predictions on the passed MODData by returning the corresponding score: 
        
                       - for regression: loss function provided in loss argument. Defaults to mae. 
        
                       - for classification: negative ROC AUC. 
        
                       averaged over the targets when multi-target. 
        
                   Parameters: 
        
                       test_data: A featurized and feature-selected `MODData` 
        
                           object containing the descriptors used in training. 
        
                   Returns: 
        
                       Score defined hereabove.

In the evaluation of individuals in FitGenetic, this method is also used,

modnet/modnet/hyper_opt/fit_genetic.py

Lines 237 to 240 in e14188d

    
           self.val_loss = model.evaluate( 
        
               val_data, 
        
               loss=self.genes["loss"], 
        
           )

so fitting and validation may be done with different metrics.

If you agree, I would also like to implement the passing of a loss function in evaluate for classification tasks.

kaueltzen · 2024-10-26T10:22:56Z

Hi @ppdebreuck one more thing: when looking into stratifying the bootstrapping in fit of EnsembleMODNetModel, I noticed that the same bootstrapped sample is drawn from the training data for len(self.n_models) times because the random_state is always the same:

modnet/modnet/models/ensemble.py

Lines 90 to 104 in e14188d

    
           if self.bootstrap: 
        
               LOG.info("Generating bootstrap data...") 
        
               train_datas = [ 
        
                   training_data.split( 
        
                       ( 
        
                           resample( 
        
                               np.arange(len(training_data.df_targets)), 
        
                               replace=True, 
        
                               random_state=2943, 
        
                           ), 
        
                           [], 
        
                       ) 
        
                   )[0] 
        
                   for _ in range(self.n_models) 
        
               ]

Is this behaviour anticipated? If not, I'd replace it with something that is reproducible but creates len(self.n_models) different samples.

ml-evs · 2024-10-26T13:06:00Z

Hi @ppdebreuck one more thing: when looking into stratifying the bootstrapping in fit of EnsembleMODNetModel, I noticed that the same bootstrapped sample is drawn from the training data for len(self.n_models) times because the random_state is always the same:

modnet/modnet/models/ensemble.py

Lines 90 to 104 in e14188d

if self.bootstrap:

LOG.info("Generating bootstrap data...")

train_datas = [

training_data.split(

(

resample(

np.arange(len(training_data.df_targets)),

replace=True,

random_state=2943,

),

[],

)

)[0]

for _ in range(self.n_models)

]

Is this behaviour anticipated? If not, I'd replace it with something that is reproducible but creates len(self.n_models) different samples.

Thanks for noticing this -- it looks like this random state has much overstayed its welcome, and has been perhaps reducing model perf. for a few years (!). I'll raise a separate issue and open a PR fixing this.

ppdebreuck · 2024-10-31T16:26:38Z

@ml-evs It's indeed not the intended behaviour, but note that GA never really uses bootstrapping. It's only when fitting EnsembleModel from scratch, so we never ran into this issue.

I would turn off bootstrapping altogether by default, just having different initial weights is mostly enough or better

ml-evs mentioned this issue Oct 26, 2024

BUG: Bootstrap ensemble fitting uses same training set per ensemble model #231

Closed

kaueltzen linked a pull request Oct 28, 2024 that will close this issue

Validation data generation, removal of loss function overrides #234

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-stratified splitting and overwriting of loss function in classification tasks #228

Non-stratified splitting and overwriting of loss function in classification tasks #228

kaueltzen commented Oct 20, 2024

kaueltzen commented Oct 20, 2024

ppdebreuck commented Oct 21, 2024 •

edited

Loading

kaueltzen commented Oct 25, 2024

kaueltzen commented Oct 26, 2024

ml-evs commented Oct 26, 2024

ppdebreuck commented Oct 31, 2024

Non-stratified splitting and overwriting of loss function in classification tasks #228

Non-stratified splitting and overwriting of loss function in classification tasks #228

Comments

kaueltzen commented Oct 20, 2024

kaueltzen commented Oct 20, 2024

ppdebreuck commented Oct 21, 2024 • edited Loading

kaueltzen commented Oct 25, 2024

kaueltzen commented Oct 26, 2024

ml-evs commented Oct 26, 2024

ppdebreuck commented Oct 31, 2024

ppdebreuck commented Oct 21, 2024 •

edited

Loading