An easy tool to parallelize and distribute your python pandas dataframe operation across your cluster or your personal machines. Although, in theory, you can distribute arbitrary python functions with Pandas-Farm but it was built and tested to work with Pandas Dataframes
To quickly get started with Pandas Farm you need 3 instacens running
- Master: to mange, schedule and relay the data
- Slave(s): to compute the functions
- Driver: from which you can submit code to Pandas Farm cluster
In order to use Pandas Farm you need a netwrok accessible master running. If you have docker installed in the master machine, just run
docker run -p 5555:5555 medo/farm-master
Now the master is running, and you can schedule operations. In order to process operations you need at least one slave running. In your slave machine run
docker run -e "CL_MASTER_HOST=<MASTER_IP>" -e "CL_MASTER_PORT=5555" medo/farm-slave
Now you are ready to play with Pandas Farm. All you need to do is create a function that takes a dataframe and returns a dataframe.
import pandas.rpy.common as rcom
iris = rcom.load_data('iris')
def area(df):
df['Sepal.Area'] = df['Sepal.Width'] * df['Sepal.Length']
return df
Let's try our function on iris dataset.
from pandafarm.driver import PandaFarm
pf = PandaFarm('<MASTER_HOST>')
job = pf.parallelize(iris, area, 10)
You can check the progress of the operation
print("Progress = %d / 100" % pf.progress(job))
To get the result of the operation
result = pf.collect(job)
The result we get here is a single dataframe, However, Panda Farm runs a merge function on partitions to reduce the partitions into a single result, by default the function is pd.concat
. You can get the raw result of the paritions or pass a different merge function
pf.parallelize(iris, apply = area, merge = None)
In order to be able to install libraries on your slaves. You will need to create your own docker image, push it to the registery and then you can use your image to install the dependecies.
Create a Dockerfile
FROM medo/farm-slave
MAINTAINER <[email protected]>
RUN pip3 install nltk
Now build the image
docker build -t <image_name> .
Push it to the registery
docker push <image_name>
Now you need to run your image on the slaves
docker run -e "CL_MASTER_HOST=<MASTER_IP>" -e "CL_MASTER_PORT=5555" <image_name>
Watch tower is a docker image that enables you to automatically update your containers. check this post http://www.ecliptik.com/Automating-Container-Updates-With-Watchtower/
All you need to do is to run watchtower container on the slaves
docker run -d --name watchtower -v /var/run/docker.sock:/var/run/docker.sock centurylink/watchtower --interval 10 <image_name>
if you don't specify <image_name>
then all the container on the slave machine will be included in the update script
Now you have everything set up. All you need to do to install new dependencies for your script is to install the dependency on the docker container and push it with the same name <image_name>
TODO...
I built this tool for cost-efficient split/merge workloads on a small cluster, typically machines that you don't have full access to or you can use it to parallelize your existing pandas dataframe transformation code on spot instances AWS cluster for example.
- Use Docker-In-Docker for running jobs
- Shadow Master for failure recovery
- Redis store for intermediate jobs execution
- Security Tokens
- Efficient Scheduler based on resources costs