ETL does not parallelize tasks #26

marfago · 2016-12-07T10:50:27Z

Hi,

I'm not sure this is a problem with geotrellis or spark or my configuration.
Starting from geodocker-cluster and with some upgrade of the actual chattademo project (basically a porting to scalal 2.11) I was able to run the chatta demo.
What I noticed is that, even using multiple spark workers, all the tasks are always executed sequentially and not in parallel. Is there any parameters to tune up to increase parallelism?

pomadchin · 2016-12-07T10:55:05Z

Hi @marfago, interesting, how can it be executed sequentially across executors? o: Can you attach your spark-submit command and screenshot of spark web ui with jobs, executors and tasks? Btw, we can discuss it in our gitter channel :)

marfago · 2016-12-07T11:26:25Z

hi @pomadchin , find attached the PNGs.

Let me elaborate a little bit.
I have the phisical servers, SN and FN: SN hosts spark master, spark worker, accumulo and hdfs name and data while FN hosts just a spark worker. Both node are in a docker network.
I have also slightly changed the demo in order to ingest 10 raster images and the mask.

I would expect the ETL to fully allocate all CPUs, but both servers are quite idle and, in spark UI, I can just see one task at time running on one of the servers.

Any suggestions?

ingest.txt