Skip to content

mniehoff/spark-cassandra-playground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-cassandra-playground

Playground for Spark anlytics on data stored in a cassandra database. Used for demo purposes.

Data Used

The project uses the data of the movielens project, a non commercial movie recommendations site

You can find different datasets of the project here: http://grouplens.org/datasets/movielens/

For this project I used the latest dataset (as of April 2015): http://files.grouplens.org/datasets/movielens/ml-latest.zip

Set Up

Cassandra

Install Cassandra as you like

  • with as much nodes as you want
  • optional with the opscenter
  • maybe you don't want to use vnodes, as these slow down performance together with spark

Possiblities to install:

Example with the CCM:

  • Creates Cluster "moviedb" with 3 Nodes, using DSE v4.6.5, Opscenter in Version 5.1.1, starts the cluster

      ccm create --dse -v 4.6.5 --dse-username=<<dseUserName>> --dse-password=<<dsePassword>> -o 5.1.1 -n 3 moviedb -s
    
  • Creates Cluster "moviedb" with 3 Nodes, using Cassandra 2.1.5, starts the Cluster

      ccm create moviedb -v binary:2.1.6 -n 3 -s
    

On a Mac you first have to alias the loopback adapter:

sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3

Spark

Just grap the "Prebuild for Hadoop 2.6" tarball from https://spark.apache.org/downloads.html, untar it and you are done

Spark Cassandra Connector

The Spark Cassandra Connector is written by Datastax and is found here at github. Please beware of the version compatibility. At the time of writing the master is at the 1.3-snapshot which is working fine with Spark 1.3.1 core, but not with Spark SQL 1.3.1. If you want to use all features you probably want to use Spark 1.2.x with the connector 1.2.x.

  1. Clone Git Repo from https://github.com/datastax/spark-cassandra-connector, Change into dir
  2. Checkout the Branch you want use (e.g. b1.2)
  3. Build Assembly Jar with ./sbt/sbt assembly

Copy the assembly jar found under spark-cassandra-connector-java/target/scala-2.10 to a place you will find it easily.

Data & Datastructures

Start a CQL Sessions from the repository root, e.g with ccm node1 cqlsh

Cassandra Keyspaces & Tables

Create the Cassandra Keyspace and Tables using the provided schema.cql

SOURCE './schema.cql' USE movie

Load the data

  • Load the latest dataset from the link above and copy the expanded file to /data-original

  • For the movies DON'T use the Data from the project, but the changed dataset provided in this repo: data-changed/movies.zip I added some years and fixed some formatting issues in this file. Just unzip the file.

  • Import the data using the COPY Command. The ratings might need up to 2 hrs for import.

      COPY movies_raw (movieid, title, genres) FROM 'data-changed/movies.csv' with header= true;
      COPY tags_by_user FROM 'data-original/tags.csv' WITH header=true;
      COPY ratings_by_user FROM 'data-original/ratings.csv' with header=true; 
    

Now all the data you need is inside your cassandra cluster and you are ready to go!

Spark

To start the spark shell switch into the spark install dir and run

./bin/spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar --conf spark.cassandra.connection.host=localhost --driver-memory 3g

We need 3g of memory for caching large ratings rdds. If your Cassandra cluster is not running in localhost change the parameter accordingly.

Then import the Cassandra Spark Classes:

```import com.datastax.spark.connector., org.apache.spark.SparkConf, org.apache.spark.SparkContext, org.apache.spark.SparkContext.````

For some examples to use Spark on this data refer to the docs folder.

TODO

  • More automation

About

See README.md for details

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages