Pandas Farm

An easy tool to parallelize and distribute your python pandas dataframe operation across your cluster or your personal machines. Although, in theory, you can distribute arbitrary python functions with Pandas-Farm but it was built and tested to work with Pandas Dataframes

Getting Started

To quickly get started with Pandas Farm you need 3 instacens running

Master: to mange, schedule and relay the data
Slave(s): to compute the functions
Driver: from which you can submit code to Pandas Farm cluster

Master

In order to use Pandas Farm you need a netwrok accessible master running. If you have docker installed in the master machine, just run

docker run -p 5555:5555 medo/farm-master

Slave

Now the master is running, and you can schedule operations. In order to process operations you need at least one slave running. In your slave machine run

docker run -e "CL_MASTER_HOST=<MASTER_IP>" -e "CL_MASTER_PORT=5555"  medo/farm-slave

Driver

Now you are ready to play with Pandas Farm. All you need to do is create a function that takes a dataframe and returns a dataframe.

import pandas.rpy.common as rcom
iris = rcom.load_data('iris')

def area(df):
  df['Sepal.Area'] = df['Sepal.Width'] * df['Sepal.Length']
  return df

Let's try our function on iris dataset.

from pandafarm.driver import PandaFarm

pf = PandaFarm('<MASTER_HOST>')

job = pf.parallelize(iris, area, 10)

You can check the progress of the operation

print("Progress = %d / 100" % pf.progress(job))

To get the result of the operation

result = pf.collect(job)

The result we get here is a single dataframe, However, Panda Farm runs a merge function on partitions to reduce the partitions into a single result, by default the function is pd.concat. You can get the raw result of the paritions or pass a different merge function

pf.parallelize(iris, apply = area, merge = None)

Manage Dependenices

In order to be able to install libraries on your slaves. You will need to create your own docker image, push it to the registery and then you can use your image to install the dependecies.

Run inside your own Container

Create a Dockerfile

FROM medo/farm-slave

MAINTAINER <example@mail.com>

RUN pip3 install nltk

Now build the image

docker build -t <image_name> .

Push it to the registery

docker push <image_name>

Run the image on the slaves

Now you need to run your image on the slaves

docker run -e "CL_MASTER_HOST=<MASTER_IP>" -e "CL_MASTER_PORT=5555"  <image_name>

Run watchtower

Watch tower is a docker image that enables you to automatically update your containers. check this post http://www.ecliptik.com/Automating-Container-Updates-With-Watchtower/

All you need to do is to run watchtower container on the slaves

docker run -d  --name watchtower  -v /var/run/docker.sock:/var/run/docker.sock centurylink/watchtower --interval 10 <image_name>

if you don't specify <image_name> then all the container on the slave machine will be included in the update script

Update your Image

Now you have everything set up. All you need to do to install new dependencies for your script is to install the dependency on the docker container and push it with the same name <image_name>

Provision slaves with AWS Spot Instances

TODO...

Architecture

Why I built this

I built this tool for cost-efficient split/merge workloads on a small cluster, typically machines that you don't have full access to or you can use it to parallelize your existing pandas dataframe transformation code on spot instances AWS cluster for example.

Future Work

Use Docker-In-Docker for running jobs
Shadow Master for failure recovery
Redis store for intermediate jobs execution
Security Tokens
Efficient Scheduler based on resources costs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pandas Farm

Getting Started

Master

Slave

Driver

Manage Dependenices

Run inside your own Container

Run the image on the slaves

Run watchtower

Update your Image

Provision slaves with AWS Spot Instances

Architecture

Why I built this

Future Work

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pandas Farm

Getting Started

Master

Slave

Driver

Manage Dependenices

Run inside your own Container

Run the image on the slaves

Run watchtower

Update your Image

Provision slaves with AWS Spot Instances

Architecture

Why I built this

Future Work