Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caffe integration attempt #23

Open
wants to merge 46 commits into
base: master
Choose a base branch
from
Open

Caffe integration attempt #23

wants to merge 46 commits into from

Conversation

immars
Copy link

@immars immars commented Apr 27, 2015

Caffe Integration attempt with parameter_server.
This pull request is created not because the code is ready for merge, but to draw attention and ask questions to make better implementation. Thanks.

about algorithm

  • follow google's nips 2012 Downpour SGD approach
  • workers do Forward/Backward only, server updates parameters
  • Adadelta with momentum SGD is tested to produce good convergence

about implementation

  • uses push/pull as communication frameworks
  • src/app/caffe/caffe_main.cc: 1 process per gpu device; computation drived by workers; puller/pusher/solver in different thread
  • src/app/caffe/caffe_synced.cc: 1 process per gpu device; computation drived by server; all workers compute a batch and those batches accumulated to a larger batch, effectively a N times larger batch-size
  • src/app/caffe/caffe_async_share.cc: 1 process per node, 1 thread per gpu device; weights pulled from server are shared by threads; computation drived by workers
  • tested with 2 nodes with 4 gpu devices each. caffe_async_share gets better performance overall w.r.t. network usage/convergence speed

about usage:

  • script/caffe_local.sh for test ( not with caffe_async_share ), script/caffe_lan.sh for launch in cluster
  • script/caffe_kill_lan.sh for (ungracefully) shutdown in cluster.
  • caffe_lan.sh usage: script/caffe_lan.sh {conf_file}. See ./conf/ for example conf_file setting up workers/servers.

Pending questions about code

  • probably not going to be merged, at least not before following issues solved:
  • haven't rebased to current master yet
  • Is there a better way notifying server of worker's push and pull? like vectorChanged vectorGetting interfaces introduced in my code. Server need to update weights after worker's diff pushed, and to synchronize weights from gpu back to host memory just before worker pull.
  • Is there document explaining SharedParameter, its subclasses, channels, and their usage in the cluster? VVector added for simpler implementation: all parameters in one un-dividable vector (only 1 server supported as a result) but I guess that could be replaced by KVVector to support multi-servers. I just don't know how.

@immars
Copy link
Author

immars commented Apr 27, 2015

should compile against https://github.com/immars/caffe/tree/mydev

@mli
Copy link
Member

mli commented Apr 27, 2015

Hi Lisen,

Many thanks for your contributions!

I have a few questions about your codes (apologies for the stupid questions due to i didn't read your codes carefully)

  1. It seems to me that you moved the caffe's forward/backward logic out so you know when should do a push or a pull. Is it possible to insert several lines of codes into caffe to do that job?
  2. Is it a good idea to let the server node just use CPUs to avoid copying data into gpus? since updating is quite cheap, CPU is as faster as GPUs.
  3. Have you benchmarked the convergence rate under the distributed setting. I heard that distbelief has serious convergence problem: 10x machines for only 2x speedup to achieve the same test accuracy.
  4. How you divide the data? because each worker only need to process a part of the training data.
  5. Do you have any comment comparing to cxxnet's strategy? Assume there are 3 layers: 0,1,2. In the back propagation stage: compute the gradient of layer 2, push the gradient to the servers (need to move the gpu memory to cpu), send the pull request for layer 2's weight. then move the layer 1 and 0. In the next forward stage, the worker need to first wait layer 0's weight have been pulled back, and then wait layer 1 and 2. Everything (push, pull, memcpy) is asynchronous to hide the communication cost.
  6. Don't worry about the shared parameter. The new API will be much simpler, https://github.com/dmlc/ps-lite/blob/master/src/ps.h and you can also check cxxnet's implementation https://github.com/dmlc/mshadow/tree/master/guide/mshadow-ps

Best
Mu

@immars
Copy link
Author

immars commented Apr 28, 2015

Mu,
Thanks for your reply,

  • in my caffe branch, an iteration for one minibatch is split into several phases, so that a) different phases be invoked by worker and server, and b) worker would know appropriate time to pull/push. It is possible to add hooks in caffe side to achieve (b) to simplify parameter_server side, but (a) is still needed if that's what you mean, or is it?
  • It is a good idea, no impact on server/worker observed, thanks!
  • haven't tested thoroughly, but here's what have been done:

figure_1
Only losses make sense because I had modified accuracy definition. n1 is single threaded, n8.sync is synchronized sgd by caffe_synced, n8.async is asynced sgd by caffe_async_share. It's fair to say n8.async is only 2x faster than n1, but It's also fair to say n8.async at iteration 20k is better than n1 at iteration 50k, so i guess the speedup is better than 2x, but far from 8x indeed.

Note that all iterations are counted at worker's side, and n8.synced is much slower at the beginning because server updates less frequently actually. But it catches up with n1 probably because of effectively larger batch. so it's not easy to say n8.synced is actually slower than n1...

  • It's data parallelism setup, so data is partitioned evenly among gpu devices.
  • Sorry I haven't looked into cxxnet carefully. According to your description, cxxnet pushes diff after every iteration and pulls weights before every iteration, and push/pull/forwardBackward pipelined, similar to 'weird trick' by Alex, is it? It's a good idea indeed. Aside from pipeline part, it's equivalent to PULLSTEP=1 PUSHSTEP=1 in my implementation, and IMHO it's good for convergence than larger pull/push steps. My problem is that I cannot afford to push/pull in every worker iteration, it will hit the cap of my gigabyte ethernet. So my answer to this problem is larger push/pull step and let momentum fight against the lag of diff / weight updates. As for the pipeline part for concurrency, since the pipline is only most effective when pull/push step=1, my approach is simply a double buffer: one for pusher, one for accumulate diff by forward/backward threads, and swap at times.
  • So ps-lite will be a replacement of this project? Also, should I use this project as a template to folk my version instead of as a library? or would mshadow-ps refactored out as a library?

Thanks,

@tqchen
Copy link
Member

tqchen commented Apr 28, 2015

@immars mshadow-ps is a library that implements async copying and communication for GPU thread.

  • so you can view it as a GPU thread based PS library
  • The distributed backend was backed by PS.

using mshadow-ps might help you handle some of the communication computation over-laps and unifies multi-card implementation with multi node code in one verison.

You can find description here

https://github.com/dmlc/mshadow/blob/master/guide/neuralnet/nnet_ps.cu was an example of implementing on mshadow-ps for a simple net

@mli
Copy link
Member

mli commented Apr 29, 2015

i'll be more focused on ps-lite (basically i'll split paramet_server into two repos, one is a pure interface, and move the apps into another repo). but the class kv-layer which mshadow used doesn't change between ps-lite and master branch of parameter_server. so it is safe to use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants