Don't pass `y=None` to the `train_split` function #646

kqf · 2020-06-03T07:01:38Z

By default skorch requires the train_split function to have two positional arguments X and y. This leads to unexpected behavior when working on unsupervised tasks:

def unsupervised_split(dataset, split_ratio=0.7):
    return # some splits

...

def build_model():
    model = skorch.NeuralNetClassifier(
        module=SimpleModule,
        train_split=unsupervised_split,
    )
    ...

# This causes an error as `y=None` will be passed as a positional 
# argument to `unsupervised_split` function.
model = build_model().fit(df)

As suggested by @BenjaminBossan, It is possible to define a wrapper function that accepts a dummy positional argument, to solve the problem, but the same can be achieved with the minimum changes on the skorch side. This update also eases the integration with torchtext iterator/split methods (see the discussion here).

BenjaminBossan · 2020-06-03T22:05:25Z

I had a quick look at this and it looks good so far. However, I'd like to think a bit longer on this since this change might break some existing code. I'd like to make sure that nothing severe breaks because of the change.

Meanwhile, could you please add an entry to the CHANGES.md, in the Changed section.

kqf · 2020-06-04T17:57:35Z

Thanks for reviewing the PR, sure, take your time.

BenjaminBossan · 2020-06-06T19:29:33Z

Thanks for updating the CHANGES.md.

I would like to see an explicit test for the new behavior. As of now, it's only tested coincidentally.

I'm a tiny bit nervous since this change could break existing code (as witnessed by the tests that needed changing). It's probably not necessary to raise a warning, but I could see adding a comment in line 1215 (before the if y is None) along these lines:

# After a change in #646, `y` is no longer passed to `self.train_split` if it is `None`. To revert to the previous behavior, remove the following two lines:

(of course with the proper indentations and line breaks)

The rest looks good to me.

BenjaminBossan

Nice test, thanks. I just have two questions, it would be nice if you could answer them (and, if appropriate, add comments to the test for posterity).

BenjaminBossan · 2020-06-07T10:47:53Z

skorch/tests/test_net.py

+        (True, lambda x, y: (x, x), ExitStack()),
+        (True, lambda x: (x, x), pytest.raises(TypeError)),  # Raises an error
+    ])
+    def test_train_split_with_nans(self, needs_y, train_split, raises):


I'm a bit confused about the name test_train_split_with_nans, where do nans come into play? Is the test not about y being passed to train_split only when it is not None?

General remark: indeed, I didn't think well about the name of the test case.

where do nans come into play?

The NaNs here come into play as y=None and train_split=None

Is the test not about y being passed to train_split only when it is not None?

I am writing a more verbose answer to this question as the PR comment, so we can move the discussion there

So, can we call this function test_passes_y_to_train_split_when_not_nan? Frankly, I can't invent a better name 😅

The NaNs here come into play as y=None and train_split=None

Can you elaborate on that? Do you mean NaN in the sense of "not a number"? I don't see how that can happen.

So, can we call this function test_passes_y_to_train_split_when_not_nan? Frankly, I can't invent a better name

With tests, it's always better to use a lengthy name that is more descriptive than a short, less descriptive name. Nobody is going to import your test anyway. If in doubt, tests should also rather contain too many than too few explanatory comments (some skorch tests are hard to understand because we didn't do that right from the start).

Regarding the name itself, I would prefer test_passes_y_to_train_split_when_not_none (last word changed), since that is what y actually is.

Can you elaborate on that?

I think this is covered in the PR comments. Yes, I just confused None and NaN in the conversation, sorry.

I agree to the changes, see the new code.

BenjaminBossan · 2020-06-07T10:48:15Z

skorch/tests/test_net.py

+        )
+        with raises:
+            net.fit(X, y)
+            assert net.predict(X) is not None


Could you explain why this assertion is necessary?

It's not, I think. I wanted to be more explicit: even though the y=None the predictions still make sense. I can change this 🤷‍♂️

Yes, I think it's better to remove it, since it detracts from what is really tested. I actually believe it's impossible for net.predict to ever return None, so this assertion will never fail.

Agreed, I think this can be resolved

kqf · 2020-06-07T11:21:16Z

Thanks again for the suggestions.

Some explanations about the test case. The tests are applied to the NeuralNet as it's the only class that works for y=None. This means that neither classifier nor regressor should pass through if y is not None line. Then for each test scenario, we check if the model can fit and make predictions. There is a dummy criterion defined that ignores y_pred to make things easy to read, and random data are generated.

The test cases, these two lines reproduce an old behaviour:

        (False, None, ExitStack()),  # ExitStack = does not raise
        (True, None, ExitStack()),
        (True, lambda x, y: (x, x), ExitStack()),

Where the first two lines correspond to the default train_split method.

Then we need to check if it works when train_split does not require a positional y:

        (False, lambda x: (x, x), ExitStack()),
        (True, lambda x: (x, x), pytest.raises(TypeError)),  # Raises an error

These cases forced me to change the test data in test_scoring. One can see this as a safety check that says: "you are trying to apply unsupervised split for supervised data".

Thanks again for the review

BenjaminBossan · 2020-06-07T13:38:02Z

Thanks for adding the explanation. I actually believe that a condensed form of it could be added to the test (as I mentioned above, better to add too many than too few comments in tests).

Where the first two lines correspond to the default train_split method.

This is not quite true, since the default train_split is CVSplit(5), not None. Maybe you could actually even add the former?

Thanks again for the review

Thanks for the PR ;)

kqf · 2020-06-07T16:05:32Z

This is not quite true, since the default train_split is CVSplit(5)

😮 I overlooked that. For some reason I thought, that it goes like self.train_split = train_split or CVSplit(5). Not sure if CVSplit invoked with y=None should fail or not. I have to think.

I agree to the comments, I'll try to fix the issues tomorrow (now I am way from my workstation).

kqf · 2020-06-08T04:47:26Z

skorch/tests/test_net.py

        from skorch.net import NeuralNet
        from skorch.toy import MLPModule

+        # By default, `train_split=CVSplit(5)` in the `NeuralNet` definition
+        if train_split == "default":


This looks a bit ugly, but I didn't want to introduce a new function/import to this namespace.

kqf · 2020-06-08T04:50:43Z

skorch/tests/test_net.py

+        # By default, `train_split=CVSplit(5)` in the `NeuralNet` definition
+        if train_split == "default":
+            from skorch.dataset import CVSplit
+            train_split = CVSplit(5)


This one is an explicit statement, however it can be replaced by the implicit one for sake of extensibility:

train_split = NeuralNet(None, None).train_split

So, this way you don't have to modify the test when the default value of the parameter is changed.
@BenjaminBossan , what do you think?

I think it would be even better to solve it like this:

kwargs = {} if train_split == 'default' else {'train_split': train_split}

and then below:

net = NeuralNet( ... **kwargs)

BenjaminBossan

I suggested two minor changes, the rest looks good.

BenjaminBossan · 2020-06-09T20:25:19Z

skorch/tests/test_net.py

+        # By default, `train_split=CVSplit(5)` in the `NeuralNet` definition
+        if train_split == "default":
+            from skorch.dataset import CVSplit
+            train_split = CVSplit(5)


I think it would be even better to solve it like this:

kwargs = {} if train_split == 'default' else {'train_split': train_split}

and then below:

net = NeuralNet( ... **kwargs)

BenjaminBossan · 2020-06-09T20:28:35Z

skorch/tests/test_net.py

        n_samples, n_features = 128, 10
        X = np.random.rand(n_samples, n_features).astype(np.float32)
        y = np.random.binomial(n=1, p=0.5, size=n_samples) if needs_y else None

+        # The `NeuralNetClassifier` or `NeuralNetRegressor` always require `y`
+        # Only `NeuralNet`can transfer `y=None` to `tran_split` method.


Suggested change

# Only `NeuralNet`can transfer `y=None` to `tran_split` method.

# Only `NeuralNet` can transfer `y=None` to `train_split` method.

…t in tests

BenjaminBossan · 2020-06-10T21:10:25Z

Great work, thanks for being patient.

kqf added 2 commits June 4, 2020 19:26

Check if y is none before dividing the train_split step

23e50dd

Add the description to CHANGES.md

d1cdeae

kqf force-pushed the bugfix-dont-split-by-y-if-none branch from 5655dbe to d1cdeae Compare June 4, 2020 17:28

Add comments; Add the tests to cover the changes

23305cc

BenjaminBossan requested changes Jun 7, 2020

View reviewed changes

Add the comments and the default test cases

9c321fc

kqf commented Jun 8, 2020

View reviewed changes

BenjaminBossan reviewed Jun 9, 2020

View reviewed changes

Switch to kwargs to pass the default value of train_split to NeuralNe…

f0663f2

…t in tests

BenjaminBossan approved these changes Jun 10, 2020

View reviewed changes

BenjaminBossan merged commit 44ac4dc into skorch-dev:master Jun 10, 2020

This was referenced Aug 13, 2020

Predefined split broken (TypeError: _make_split() got multiple values for argument 'valid_ds') #681

Closed

Fix a TypeError that occurred when using predefined_split #682

Merged

Prepare for release of skorch v0.9.0 #683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't pass `y=None` to the `train_split` function #646

Don't pass `y=None` to the `train_split` function #646

kqf commented Jun 3, 2020

BenjaminBossan commented Jun 3, 2020

kqf commented Jun 4, 2020

BenjaminBossan commented Jun 6, 2020

BenjaminBossan left a comment

BenjaminBossan Jun 7, 2020

kqf Jun 7, 2020

kqf Jun 7, 2020

BenjaminBossan Jun 7, 2020

kqf Jun 8, 2020

BenjaminBossan Jun 7, 2020

kqf Jun 7, 2020

BenjaminBossan Jun 7, 2020

kqf Jun 8, 2020

kqf commented Jun 7, 2020

BenjaminBossan commented Jun 7, 2020

kqf commented Jun 7, 2020

kqf Jun 8, 2020

kqf Jun 8, 2020

BenjaminBossan Jun 9, 2020

BenjaminBossan left a comment

BenjaminBossan Jun 9, 2020

BenjaminBossan Jun 9, 2020

BenjaminBossan commented Jun 10, 2020

	# Only `NeuralNet`can transfer `y=None` to `tran_split` method.
	# Only `NeuralNet` can transfer `y=None` to `train_split` method.

Don't pass y=None to the train_split function #646

Don't pass y=None to the train_split function #646

Conversation

kqf commented Jun 3, 2020

BenjaminBossan commented Jun 3, 2020

kqf commented Jun 4, 2020

BenjaminBossan commented Jun 6, 2020

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kqf commented Jun 7, 2020

BenjaminBossan commented Jun 7, 2020

kqf commented Jun 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented Jun 10, 2020

Don't pass `y=None` to the `train_split` function #646

Don't pass `y=None` to the `train_split` function #646