You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Matminer's multiprocessing problem comes from featurizing a few expensive entries in large iterables of entries (generally structures). When a multiprocessing job hangs for a long time, it is usually because the memory needed for some low number of chunks winds up getting thrashed around or stagnating, resulting in the final result never being able to compute.
Proposed solution: replace multiprocessing with Dask bag
Assumptions:
The order of samples featurization does not matter, and the result df can be resorted trivially
The computation is embarrassingly parallel wrt. samples
Keeping the overall df in memory is not the issue, computing the features is the issue.
General overview:
In featurize_many, create a Dask.bag.Bag from chunks (AKA partitions in dask) of the dataframe input samples. Alternatively could be done with a delayed call.
Compute the bag lazily
Reconvert the bag back into dataframe (should be done automatically via generator in featurize_many)
By default, dask uses multiprocessing as the scheduler if no distributed client is defined. So to actually take advantage of this, you need to define a LocalClient or distributed client with a different scheduler. Then you can compute according to whatever is available (including constraints on memory)
I don't think the memory-locking problem from multiprocessing will happen if parallelization is done this way, but we can't be sure until we try it. If it still does happen, it might be worth looking into. At very least, using dask will allow you to use multiple machines to compute features
The text was updated successfully, but these errors were encountered:
Matminer's multiprocessing problem comes from featurizing a few expensive entries in large iterables of entries (generally structures). When a multiprocessing job hangs for a long time, it is usually because the memory needed for some low number of chunks winds up getting thrashed around or stagnating, resulting in the final result never being able to compute.
Proposed solution: replace multiprocessing with Dask bag
Assumptions:
General overview:
featurize_many
, create aDask.bag.Bag
from chunks (AKA partitions in dask) of the dataframe input samples. Alternatively could be done with adelayed
call.featurize_many
)By default, dask uses multiprocessing as the scheduler if no
distributed
client is defined. So to actually take advantage of this, you need to define aLocalClient
or distributed client with a different scheduler. Then you can compute according to whatever is available (including constraints on memory)I don't think the memory-locking problem from multiprocessing will happen if parallelization is done this way, but we can't be sure until we try it. If it still does happen, it might be worth looking into. At very least, using dask will allow you to use multiple machines to compute features
The text was updated successfully, but these errors were encountered: