Skip to content

Including Spark DBSCAN in your application

alitouka edited this page Nov 16, 2015 · 3 revisions

You can use Spark DBSCAN in your own application. To do that, please add the following dependency information to your build.sbt file:

libraryDependencies += "org.alitouka" % "spark_dbscan_2.10" % "0.0.4"

resolvers += "Aliaksei Litouka's repository" at "http://alitouka-public.s3-website-us-east-1.amazonaws.com/"

Now you are able to use it in your code. First of all, create a Spark context:

val sc = new SparkContext ("spark://master:7077", "My App")

Read input data with the IOHelper class. Currently it supports only CSV files. You can read data from any path which SparkContext.textFile method will accept.

val data = IOHelper.readDataset(sc, "/path/to/my/data.csv")

Specify parameters of the DBSCAN algorithm using SparkSettings class:

val clusteringSettings = new DbscanSettings ().withEpsilon(25).withNumberOfPoints(30)

Run clustering algorithm:

val model = Dbscan.train (data, clusteringSettings)

Save clustering result. This call will create a folder which will contain multiple partXXXX files. If you concatenate these files, you will get a CSV file. Each record in this file will contain coordinates of one point followed by an identifier of a cluster which this point belongs to. For noise points, cluster identifier is 0. The order of records in the resulting CSV file will be different from your input file. You can save the data to any path which RDD.saveAsTextFile method will accept.

IOHelper.saveClusteringResult(model, "/path/to/output/folder")

Predict clusters for new points:

val predictedClusterId = model.predict(new Point (100, 100))
println (predictedClusterId)
Clone this wiki locally