-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-stratified splitting and overwriting of loss function in classification tasks #228
Comments
Btw just noticed that the validation split in keras is always taken from the last samples provided to modnet/modnet/models/vanilla.py Line 475 in e14188d
would make sense in the regression case, while for classification, a stratified split based on val_fraction should maybe already happen inside ModnetModel.fit() . Thoughts?
|
You are correct on the three points. For the second, I agree adding stratification by default makes sense as it follows what we do for k-fold (i.e. StratifiedKFold). For the last one, this indeed only applies to when a float split is used as input. I would either warn in doc saying it splits on last part, or better, do as you suggest to mimic closely what is done on the k-folds and hold-out. You could perhaps combine points 2/3 by defining a hold-out split function taking a fraction as input and handling shuffling or stratification, similar to the kfold-split here: modnet/modnet/matbench/benchmark.py Line 16 in e14188d
Happy to have a PR on this, thanks ! |
Thanks for your answer! One more point regarding overwriting of loss functions that would also be good to have in the PR: modnet/modnet/models/vanilla.py Lines 768 to 784 in e14188d
In the evaluation of individuals in FitGenetic , this method is also used,modnet/modnet/hyper_opt/fit_genetic.py Lines 237 to 240 in e14188d
so fitting and validation may be done with different metrics. If you agree, I would also like to implement the passing of a loss function in |
Hi @ppdebreuck one more thing: when looking into stratifying the bootstrapping in modnet/modnet/models/ensemble.py Lines 90 to 104 in e14188d
Is this behaviour anticipated? If not, I'd replace it with something that is reproducible but creates len(self.n_models) different samples.
|
Thanks for noticing this -- it looks like this random state has much overstayed its welcome, and has been perhaps reducing model perf. for a few years (!). I'll raise a separate issue and open a PR fixing this. |
@ml-evs It's indeed not the intended behaviour, but note that GA never really uses bootstrapping. It's only when fitting EnsembleModel from scratch, so we never ran into this issue. I would turn off bootstrapping altogether by default, just having different initial weights is mostly enough or better |
Hello,
I'd like to report two issues regarding classification tasks in modnet:
First, the loss function passed to
ModnetModel().fit()
is overwritten with"categorical_crossentropy"
ifval_data
is not None andself.multi_label=False
:modnet/modnet/models/vanilla.py
Lines 400 to 411 in e14188d
As the
loss=None
case is already handled before in L352-L360 in the preprocessing of the training data, maybe this could be removed here when preprocessing the validation data?Second, if
nested=False
, bothFitGenetic
andModnetModel.fit_preset()
perform a train test split that is not stratified:modnet/modnet/models/vanilla.py
Line 580 in e14188d
modnet/modnet/hyper_opt/fit_genetic.py
Lines 458 to 462 in e14188d
This is an issue in the case of imbalanced datasets and it would be helpful if the splitting was stratified for classification tasks.
If you are interested, I'm happy to raise a PR with fixes.
The text was updated successfully, but these errors were encountered: