Use an already trained Torch model to predict on lots of data #111

mrocklin · 2019-10-10T18:53:02Z

Extending on #35 it would be nice to have an example using Dask and Torch together to parallelize prediction. This should be a simple embarrassingly parallel use case, but I suspect that it would be pragmatic for lots of folks.

The challenge, I think, is constructing a simple example that hopefully doesn't get too much into Torch or a dataset. In my ideal world this would be something like

import torchvision
model = torchvision.get_model("model_name")

dataset = get_canned_dataset()
>>> imshow(dataset[0])  # show an example image

>>> model.predict(dataset[0])
"this is a cat"

... # then dask things here

Does anyone have good pointers to such a simple case?

cc @stsievert @TomAugspurger @AlbertDeFusco

stsievert · 2019-10-10T19:16:29Z

cc @muammar

In addition to this example, I'd also link to integration with a Scikit-learn wrapper for PyTorch skorch and Dask-ML's ParallelPostFit.

TomAugspurger · 2019-10-15T21:49:34Z

Should have an example ready tomorrow.

hopefully doesn't get too much into Torch or a dataset.

I think we'll want to go into some detail about torch.utils.data.Dataset, because it's not 100% straightforward how to get the data loaded onto workers. To predict for a directory of images, I had to write the following myself

import glob

from PIL import Image


def default_loader(path, fs=__builtins__):
    with fs.open(path, 'rb') as f:
        img = Image.open(f).convert("RGB")
        return img


class FileDataset(torch.utils.data.Dataset):
    def __init__(self, files, transform=None, target_transform=None,
                 classes=None,
                 loader=default_loader):
        self.files = files
        self.transform = transform
        self.target_transform = target_transform
        self.loader = loader
        if classes is None:
            classes = list(sorted(set(x.split(os.path.sep)[-2] for x in files)))
        else:
            classes = list(classes)
        self.classes = classes

    def __len__(self):
        return len(self.files)
    
    def __getitem__(self, index):
        filename = self.files[index]
        img = self.loader(filename)
        target = self.classes.index(filename.split(os.path.sep)[-2])
        
        if self.transform is not None:
            img = self.transform(img)
            
        if self.target_transform is not None:
            target = self.target_transform(target)
        
        return img, target

and use it as

files = glob.glob("hymenoptera_data/val/*/*.jpg")
dataset = FileDataset(files, transform=data_transforms['val'])

For s3, the usage would be FileDataset(files, ..., loader=functools.partial(default_loader, fs=s3fs.S3FileSystem(...)). As a relative newcomer to PyTorch, writing that wasn't 100% straightforward.

Things seem to be working out well after that. PyTorch models seem to (de)serialize much better than tensorflow's did last time I tried.

mrocklin · 2019-10-15T21:53:52Z

Do we need to use the Torch Dataset API here?

because it's not 100% straightforward how to get the data loaded onto workers

I guess my hope is that, for image data at least, we could just pass around Numpy arrays. So we might created dask.delayed objects using skimage.io.imread or something similar. (maybe like https://blog.dask.org/2019/06/20/load-image-data , but before the dask array bit)

mrocklin · 2019-10-15T21:54:30Z

Also, if you haven't seen it, this video is nice: https://developer.download.nvidia.com/video/gputechconf/gtc/2019/video/S9198/s9198-dask-and-v100s-for-fast-distributed-batch-scoring-of-computer-vision-workloads.mp4

TomAugspurger · 2019-10-15T22:29:21Z

Ahh, yes if we’re doing prediction only we can probably do that. A Dataset would only be necessary for training.

…

On Oct 15, 2019, at 16:53, Matthew Rocklin ***@***.***> wrote: Do we need to use the Torch Dataset API here? because it's not 100% straightforward how to get the data loaded onto workers I guess my hope is that, for image data at least, we could just pass around Numpy arrays. So we might created dask.delayed objects using skimage.io.imread or something similar. (maybe like https://blog.dask.org/2019/06/20/load-image-data , but before the dask array bit) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

mrocklin · 2019-10-15T22:30:24Z

Distributed training would also be interesting of course, but my guess is that that's more of an open problem. It's not clear to me which way is the right way. On Tue, Oct 15, 2019 at 3:29 PM Tom Augspurger <[email protected]> wrote:

…

Ahh, yes if we’re doing prediction only we can probably do that. A Dataset would only be necessary for training. > On Oct 15, 2019, at 16:53, Matthew Rocklin ***@***.***> wrote: > > > Do we need to use the Torch Dataset API here? > > because it's not 100% straightforward how to get the data loaded onto workers > > I guess my hope is that, for image data at least, we could just pass around Numpy arrays. So we might created dask.delayed objects using skimage.io.imread or something similar. (maybe like https://blog.dask.org/2019/06/20/load-image-data , but before the dask array bit) > > — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#111?email_source=notifications&email_token=AACKZTDDOGKVT36QISDXWJLQOY74DA5CNFSM4I7QYORKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBKNVLQ#issuecomment-542431918>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCJTRFOQTY4AMNREFTQOY74DANCNFSM4I7QYORA> .

stsievert · 2019-10-16T00:22:36Z

I'm curious about inputs of Dask Arrays and outputs of model predictions too. I think PyTorch Datasets will need to play an intermediate role; at least that's what skorch uses when tracing net.py's Net.predict to net.py#L1150.

Distributed training would also be interesting of course, but my guess is that that's more of an open problem

It's also mentioned in dask/distributed#2581

AlbertDeFusco · 2019-10-18T22:10:30Z

Skorch looks interesting to me. Can the wrapper be used after loading the model from disk where the wrapper was not used?

I've practiced applying the dask-ml parallelpostfit wrapper on a pre-trained model and I remember having to do a few manual steps before running predictions. I need to dig up that code.

stsievert · 2019-10-19T21:20:48Z

Can the wrapper be used after loading the model from disk where the wrapper was not used?

Yup. The underlying model is an attribute (.module_), so it's simple:

import torch
from skorch import NeuralNetClassifier

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        ...

model = Net()
#  Train model

# Save trained model using PyTorch
torch.save(model.state_dict(), "trained_model.pt")

# Use skorch later (not necessarily the training session)
sk_net = NeuralNetClassifier(Net)
sk_net.initialize()

# Load parameters saved with PyTorch
sk_net.module_.load_state_dict(torch.load("trained_model.pt"))

TomAugspurger self-assigned this Oct 15, 2019

TomAugspurger mentioned this issue Oct 16, 2019

Add torch prediction example #114

Merged

jrbourbeau closed this as completed in #114 Oct 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use an already trained Torch model to predict on lots of data #111

Use an already trained Torch model to predict on lots of data #111

mrocklin commented Oct 10, 2019 •

edited

Loading

stsievert commented Oct 10, 2019

TomAugspurger commented Oct 15, 2019

mrocklin commented Oct 15, 2019

mrocklin commented Oct 15, 2019

TomAugspurger commented Oct 15, 2019 via email

mrocklin commented Oct 15, 2019 via email

stsievert commented Oct 16, 2019

AlbertDeFusco commented Oct 18, 2019

stsievert commented Oct 19, 2019 •

edited

Loading

Use an already trained Torch model to predict on lots of data #111

Use an already trained Torch model to predict on lots of data #111

Comments

mrocklin commented Oct 10, 2019 • edited Loading

stsievert commented Oct 10, 2019

TomAugspurger commented Oct 15, 2019

mrocklin commented Oct 15, 2019

mrocklin commented Oct 15, 2019

TomAugspurger commented Oct 15, 2019 via email

mrocklin commented Oct 15, 2019 via email

stsievert commented Oct 16, 2019

AlbertDeFusco commented Oct 18, 2019

stsievert commented Oct 19, 2019 • edited Loading

mrocklin commented Oct 10, 2019 •

edited

Loading

stsievert commented Oct 19, 2019 •

edited

Loading