Caffe integration attempt #23

immars · 2015-04-27T11:19:37Z

Caffe Integration attempt with parameter_server.
This pull request is created not because the code is ready for merge, but to draw attention and ask questions to make better implementation. Thanks.

about algorithm

follow google's nips 2012 Downpour SGD approach
workers do Forward/Backward only, server updates parameters
Adadelta with momentum SGD is tested to produce good convergence

about implementation

uses push/pull as communication frameworks
src/app/caffe/caffe_main.cc: 1 process per gpu device; computation drived by workers; puller/pusher/solver in different thread
src/app/caffe/caffe_synced.cc: 1 process per gpu device; computation drived by server; all workers compute a batch and those batches accumulated to a larger batch, effectively a N times larger batch-size
src/app/caffe/caffe_async_share.cc: 1 process per node, 1 thread per gpu device; weights pulled from server are shared by threads; computation drived by workers
tested with 2 nodes with 4 gpu devices each. caffe_async_share gets better performance overall w.r.t. network usage/convergence speed

about usage:

script/caffe_local.sh for test ( not with caffe_async_share ), script/caffe_lan.sh for launch in cluster
script/caffe_kill_lan.sh for (ungracefully) shutdown in cluster.
caffe_lan.sh usage: script/caffe_lan.sh {conf_file}. See ./conf/ for example conf_file setting up workers/servers.

Pending questions about code

probably not going to be merged, at least not before following issues solved:
haven't rebased to current master yet
Is there a better way notifying server of worker's push and pull? like vectorChanged vectorGetting interfaces introduced in my code. Server need to update weights after worker's diff pushed, and to synchronize weights from gpu back to host memory just before worker pull.
Is there document explaining SharedParameter, its subclasses, channels, and their usage in the cluster? VVector added for simpler implementation: all parameters in one un-dividable vector (only 1 server supported as a result) but I guess that could be replaced by KVVector to support multi-servers. I just don't know how.

immars · 2015-04-27T11:26:06Z

should compile against https://github.com/immars/caffe/tree/mydev

mli · 2015-04-27T14:30:17Z

Hi Lisen,

Many thanks for your contributions!

I have a few questions about your codes (apologies for the stupid questions due to i didn't read your codes carefully)

It seems to me that you moved the caffe's forward/backward logic out so you know when should do a push or a pull. Is it possible to insert several lines of codes into caffe to do that job?
Is it a good idea to let the server node just use CPUs to avoid copying data into gpus? since updating is quite cheap, CPU is as faster as GPUs.
Have you benchmarked the convergence rate under the distributed setting. I heard that distbelief has serious convergence problem: 10x machines for only 2x speedup to achieve the same test accuracy.
How you divide the data? because each worker only need to process a part of the training data.
Do you have any comment comparing to cxxnet's strategy? Assume there are 3 layers: 0,1,2. In the back propagation stage: compute the gradient of layer 2, push the gradient to the servers (need to move the gpu memory to cpu), send the pull request for layer 2's weight. then move the layer 1 and 0. In the next forward stage, the worker need to first wait layer 0's weight have been pulled back, and then wait layer 1 and 2. Everything (push, pull, memcpy) is asynchronous to hide the communication cost.
Don't worry about the shared parameter. The new API will be much simpler, https://github.com/dmlc/ps-lite/blob/master/src/ps.h and you can also check cxxnet's implementation https://github.com/dmlc/mshadow/tree/master/guide/mshadow-ps

Best
Mu

immars · 2015-04-28T03:26:54Z

Mu,
Thanks for your reply,

in my caffe branch, an iteration for one minibatch is split into several phases, so that a) different phases be invoked by worker and server, and b) worker would know appropriate time to pull/push. It is possible to add hooks in caffe side to achieve (b) to simplify parameter_server side, but (a) is still needed if that's what you mean, or is it?
It is a good idea, no impact on server/worker observed, thanks!
haven't tested thoroughly, but here's what have been done:

Only losses make sense because I had modified accuracy definition. n1 is single threaded, n8.sync is synchronized sgd by caffe_synced, n8.async is asynced sgd by caffe_async_share. It's fair to say n8.async is only 2x faster than n1, but It's also fair to say n8.async at iteration 20k is better than n1 at iteration 50k, so i guess the speedup is better than 2x, but far from 8x indeed.

Note that all iterations are counted at worker's side, and n8.synced is much slower at the beginning because server updates less frequently actually. But it catches up with n1 probably because of effectively larger batch. so it's not easy to say n8.synced is actually slower than n1...

It's data parallelism setup, so data is partitioned evenly among gpu devices.
Sorry I haven't looked into cxxnet carefully. According to your description, cxxnet pushes diff after every iteration and pulls weights before every iteration, and push/pull/forwardBackward pipelined, similar to 'weird trick' by Alex, is it? It's a good idea indeed. Aside from pipeline part, it's equivalent to PULLSTEP=1 PUSHSTEP=1 in my implementation, and IMHO it's good for convergence than larger pull/push steps. My problem is that I cannot afford to push/pull in every worker iteration, it will hit the cap of my gigabyte ethernet. So my answer to this problem is larger push/pull step and let momentum fight against the lag of diff / weight updates. As for the pipeline part for concurrency, since the pipline is only most effective when pull/push step=1, my approach is simply a double buffer: one for pusher, one for accumulate diff by forward/backward threads, and swap at times.
So ps-lite will be a replacement of this project? Also, should I use this project as a template to folk my version instead of as a library? or would mshadow-ps refactored out as a library?

Thanks,

tqchen · 2015-04-28T23:46:50Z

@immars mshadow-ps is a library that implements async copying and communication for GPU thread.

so you can view it as a GPU thread based PS library
The distributed backend was backed by PS.

using mshadow-ps might help you handle some of the communication computation over-laps and unifies multi-card implementation with multi node code in one verison.

You can find description here

https://github.com/dmlc/mshadow/blob/master/guide/neuralnet/nnet_ps.cu was an example of implementing on mshadow-ps for a simple net

mli · 2015-04-29T04:27:34Z

i'll be more focused on ps-lite (basically i'll split paramet_server into two repos, one is a pure interface, and move the apps into another repo). but the class kv-layer which mshadow used doesn't change between ps-lite and master branch of parameter_server. so it is safe to use it.

immars added 30 commits March 16, 2015 16:59

make file change for caffe

883a468

.gitignore change for Makefile.config

dd25c8a

initial commit

1dbef2b

bugfix

044476c

worket init into run()

8dab9d2

caffe_lan script

64574f6

fix for solver state sync

115c9c3

fix for large solverstate

cef5955

support fb_only

f995212

script improvement

72a63db

sync mode

a4cb45e

caffe_sync

b400195

o3

8f15de4

deprecated flags

0b79ec0

worker forward in main thread

13838fd

multithread share memory

3c4fd19

init solver on different gpu

0d15bcc

remove log

89fdd85

weight_ready removd

b67d4c9

log change

f89c517

log for debug

f292349

init solver on Forwarder::start

4ffc00b

debug check NAN

9fa84fe

checkNAN in sync

03a07b1

bugfix

60f101c

disable checknan

00fa8ec

async shared memory

424c9d3

async shared workder

ed588a5

server/worker run() looping

deac22d

better killing caffe in lan

3cc4c5b

immars added 16 commits April 21, 2015 16:17

sync initial weight pull

0182415

optimize diff lock

2ef5388

wantedVersion init error

9f558ec

benchmark log

8d01e49

better lock to mu_version

97e6da8

try adding in gpu

3b34dbe

p2p access test

01a69b7

double buffer for diff

3105d59

benchmark log

e857a54

pre sync back buffer to gpu before swapping to front

dae815f

log benchmarking accumulateDiff

4539181

fix for server: back buf clear to 0

36e2702

remove benchmark log

d5446f7

pull iteration count from server

891dc83

copyWeight after test phase

d7e6eaf

script change

2953f2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caffe integration attempt #23

Caffe integration attempt #23

immars commented Apr 27, 2015

immars commented Apr 27, 2015

mli commented Apr 27, 2015

immars commented Apr 28, 2015

tqchen commented Apr 28, 2015

mli commented Apr 29, 2015

Caffe integration attempt #23

Are you sure you want to change the base?

Caffe integration attempt #23

Conversation

immars commented Apr 27, 2015

immars commented Apr 27, 2015

mli commented Apr 27, 2015

immars commented Apr 28, 2015

tqchen commented Apr 28, 2015

mli commented Apr 29, 2015