Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't pass y=None to the train_split function #646

Merged
merged 5 commits into from
Jun 10, 2020

Conversation

kqf
Copy link
Contributor

@kqf kqf commented Jun 3, 2020

By default skorch requires the train_split function to have two positional arguments X and y. This leads to unexpected behavior when working on unsupervised tasks:

def unsupervised_split(dataset, split_ratio=0.7):
    return # some splits

...

def build_model():
    model = skorch.NeuralNetClassifier(
        module=SimpleModule,
        train_split=unsupervised_split,
    )
    ...

# This causes an error as `y=None` will be passed as a positional 
# argument to `unsupervised_split` function.
model = build_model().fit(df)

As suggested by @BenjaminBossan, It is possible to define a wrapper function that accepts a dummy positional argument, to solve the problem, but the same can be achieved with the minimum changes on the skorch side. This update also eases the integration with torchtext iterator/split methods (see the discussion here).

@BenjaminBossan
Copy link
Collaborator

I had a quick look at this and it looks good so far. However, I'd like to think a bit longer on this since this change might break some existing code. I'd like to make sure that nothing severe breaks because of the change.

Meanwhile, could you please add an entry to the CHANGES.md, in the Changed section.

@kqf kqf force-pushed the bugfix-dont-split-by-y-if-none branch from 5655dbe to d1cdeae Compare June 4, 2020 17:28
@kqf
Copy link
Contributor Author

kqf commented Jun 4, 2020

Thanks for reviewing the PR, sure, take your time.

@BenjaminBossan
Copy link
Collaborator

Thanks for updating the CHANGES.md.

I would like to see an explicit test for the new behavior. As of now, it's only tested coincidentally.

I'm a tiny bit nervous since this change could break existing code (as witnessed by the tests that needed changing). It's probably not necessary to raise a warning, but I could see adding a comment in line 1215 (before the if y is None) along these lines:

# After a change in #646, `y` is no longer passed to `self.train_split` if it is `None`. To revert to the previous behavior, remove the following two lines:

(of course with the proper indentations and line breaks)

The rest looks good to me.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test, thanks. I just have two questions, it would be nice if you could answer them (and, if appropriate, add comments to the test for posterity).

(True, lambda x, y: (x, x), ExitStack()),
(True, lambda x: (x, x), pytest.raises(TypeError)), # Raises an error
])
def test_train_split_with_nans(self, needs_y, train_split, raises):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about the name test_train_split_with_nans, where do nans come into play? Is the test not about y being passed to train_split only when it is not None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General remark: indeed, I didn't think well about the name of the test case.

where do nans come into play?

The NaNs here come into play as y=None and train_split=None

Is the test not about y being passed to train_split only when it is not None?

I am writing a more verbose answer to this question as the PR comment, so we can move the discussion there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, can we call this function test_passes_y_to_train_split_when_not_nan? Frankly, I can't invent a better name 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NaNs here come into play as y=None and train_split=None

Can you elaborate on that? Do you mean NaN in the sense of "not a number"? I don't see how that can happen.

So, can we call this function test_passes_y_to_train_split_when_not_nan? Frankly, I can't invent a better name

With tests, it's always better to use a lengthy name that is more descriptive than a short, less descriptive name. Nobody is going to import your test anyway. If in doubt, tests should also rather contain too many than too few explanatory comments (some skorch tests are hard to understand because we didn't do that right from the start).

Regarding the name itself, I would prefer test_passes_y_to_train_split_when_not_none (last word changed), since that is what y actually is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on that?

I think this is covered in the PR comments. Yes, I just confused None and NaN in the conversation, sorry.

I agree to the changes, see the new code.

)
with raises:
net.fit(X, y)
assert net.predict(X) is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this assertion is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, I think. I wanted to be more explicit: even though the y=None the predictions still make sense. I can change this 🤷‍♂️

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's better to remove it, since it detracts from what is really tested. I actually believe it's impossible for net.predict to ever return None, so this assertion will never fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think this can be resolved

@kqf
Copy link
Contributor Author

kqf commented Jun 7, 2020

Thanks again for the suggestions.

Some explanations about the test case. The tests are applied to the NeuralNet as it's the only class that works for y=None. This means that neither classifier nor regressor should pass through if y is not None line. Then for each test scenario, we check if the model can fit and make predictions. There is a dummy criterion defined that ignores y_pred to make things easy to read, and random data are generated.

The test cases, these two lines reproduce an old behaviour:

        (False, None, ExitStack()),  # ExitStack = does not raise
        (True, None, ExitStack()),
        (True, lambda x, y: (x, x), ExitStack()),

Where the first two lines correspond to the default train_split method.

Then we need to check if it works when train_split does not require a positional y:

        (False, lambda x: (x, x), ExitStack()),
        (True, lambda x: (x, x), pytest.raises(TypeError)),  # Raises an error

These cases forced me to change the test data in test_scoring. One can see this as a safety check that says: "you are trying to apply unsupervised split for supervised data".

Thanks again for the review

@BenjaminBossan
Copy link
Collaborator

Thanks for adding the explanation. I actually believe that a condensed form of it could be added to the test (as I mentioned above, better to add too many than too few comments in tests).

Where the first two lines correspond to the default train_split method.

This is not quite true, since the default train_split is CVSplit(5), not None. Maybe you could actually even add the former?

Thanks again for the review

Thanks for the PR ;)

@kqf
Copy link
Contributor Author

kqf commented Jun 7, 2020

This is not quite true, since the default train_split is CVSplit(5)

😮 I overlooked that. For some reason I thought, that it goes like self.train_split = train_split or CVSplit(5). Not sure if CVSplit invoked with y=None should fail or not. I have to think.

I agree to the comments, I'll try to fix the issues tomorrow (now I am way from my workstation).

from skorch.net import NeuralNet
from skorch.toy import MLPModule

# By default, `train_split=CVSplit(5)` in the `NeuralNet` definition
if train_split == "default":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit ugly, but I didn't want to introduce a new function/import to this namespace.

# By default, `train_split=CVSplit(5)` in the `NeuralNet` definition
if train_split == "default":
from skorch.dataset import CVSplit
train_split = CVSplit(5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is an explicit statement, however it can be replaced by the implicit one for sake of extensibility:

train_split = NeuralNet(None, None).train_split

So, this way you don't have to modify the test when the default value of the parameter is changed.
@BenjaminBossan , what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be even better to solve it like this:

kwargs = {} if train_split == 'default' else {'train_split': train_split}

and then below:

net = NeuralNet(
   ...
   **kwargs)

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested two minor changes, the rest looks good.

# By default, `train_split=CVSplit(5)` in the `NeuralNet` definition
if train_split == "default":
from skorch.dataset import CVSplit
train_split = CVSplit(5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be even better to solve it like this:

kwargs = {} if train_split == 'default' else {'train_split': train_split}

and then below:

net = NeuralNet(
   ...
   **kwargs)

n_samples, n_features = 128, 10
X = np.random.rand(n_samples, n_features).astype(np.float32)
y = np.random.binomial(n=1, p=0.5, size=n_samples) if needs_y else None

# The `NeuralNetClassifier` or `NeuralNetRegressor` always require `y`
# Only `NeuralNet`can transfer `y=None` to `tran_split` method.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Only `NeuralNet`can transfer `y=None` to `tran_split` method.
# Only `NeuralNet` can transfer `y=None` to `train_split` method.

@BenjaminBossan
Copy link
Collaborator

Great work, thanks for being patient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants