Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyprind with joblib #21

Open
ajkl opened this issue Feb 22, 2016 · 5 comments
Open

pyprind with joblib #21

ajkl opened this issue Feb 22, 2016 · 5 comments

Comments

@ajkl
Copy link

ajkl commented Feb 22, 2016

Is it possible to use the callback from joblib parallel to make it work with pyprind for Parallel processing tasks ?

@rasbt
Copy link
Owner

rasbt commented Feb 23, 2016

Hi,
thanks for the suggestion/request, supporting joblib sounds like a useful feature. Personally, I haven't experimented with this combo, yet.

So, I could think of 2 possible scenarios here:

  1. Having an outer for loop that runs multiple joblib processes iteratively and updates them like so
pbar = ProgBar(len(x))
for _ in x:
    # do something w. joblib in parallel
    pbar.update()

which would already work I guess.

  1. Tracking the process inside joblib. Here, you are running multiple processes via joblib where each of them has a for-loop. The goal is to
def some_func():
    for _ in x:
        # so something
pbar = ProgBar(n)
# run multiple instances of some_func in parallel
# let all processes update the pbar

Is this what you have in mind? I think in theory this should be easily possible; all the processes would have to do is to call the update method I guess!? Would be nice if you have some example code that we could use to experiment a bit.

@ajkl
Copy link
Author

ajkl commented Feb 23, 2016

The second option is what I was looking for but doesnt seem to work with your suggestion of letting all processes update pbar

from joblib import Parallel, delayed
import time
import pyprind
timesleep = 0.05
n=1000
bar = pyprind.ProgBar(n)
def foo(x):
    time.sleep(timesleep)
    bar.update()
    return x
Parallel(n_jobs=4, verbose=0)(delayed(foo)(i) for i in range(n))

@rasbt
Copy link
Owner

rasbt commented Feb 24, 2016

Hm, I think the problem is that the standard output is blocked during the computation which is why the pogressbar appears after everything has finished. I think this is something to investigate further after the "double progressbar" support had been added (see #18)

In any case, another problem is that multiprocessing created copies of the objects that are send to the different processors (in contrast to threading). So basically, there are 4 progressbars then that are running from 0% to 25% each if you use 4 processors.

Honest question: What's the advantage of joblib over multiprocessing? I saw it in certain libraries (e.g., scikit-learn) but never really understood why joblib instead of multiprocessing. E.g.,

from joblib import Parallel, delayed
import time
import pyprind

timesleep = 0.05
n = 1000
n_jobs = 4

bar = pyprind.ProgBar(n, stream=1)
def foo(x):
    time.sleep(timesleep)
    bar.update()
    return x

results = Parallel(n_jobs=n_jobs, 
                   verbose=0, 
                   backend="multiprocessing")(delayed(foo)(i) for i in range(n))

vs.

import multiprocessing as mp
pool = mp.Pool(processes=2)
results = [pool.apply(foo, args=(x,)) for x in range(n)]

@rasbt rasbt closed this as completed Feb 24, 2016
@rasbt rasbt reopened this Feb 24, 2016
@ajkl
Copy link
Author

ajkl commented Feb 24, 2016

well i am kinda new to the python ecosystem and I recently came across joblib. I noticed sklearn is using it, so kinda assumed it must be solving some issues that multiprocessing might have. Honestly didnt evaluate the 2 yet.
I understand that multiprocessing is creating different object hence you always see 25% on the above example. Not sure if there is an easy solution around it. I dont want to waste your time since it is not that critical. Thanks for this great package!

@rasbt
Copy link
Owner

rasbt commented Feb 24, 2016

I understand that multiprocessing is creating different object hence you always see 25% on the above example. Not sure if there is an easy solution around it.

I think there could be a way around that ... but it'll require some tweaks. Btw. if you use the "threading" backend, it should give you the 100% correctly but the problem is still how to print to stdout while the processing are still running...

Not sure if there is an easy solution around it. I dont want to waste your time since it is not that critical. Thanks for this great package!

Unfortunately, there are too many things on my to do list, currently. But I will leave this issue open, maybe someone has a good idea how to implement it, or maybe there will be a boring weekend for me some day ... ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants