-
Notifications
You must be signed in to change notification settings - Fork 116
Using Spark DBSCAN as a standalone application
This page describes how to submit Spark DBSCAN to a Spark cluster. To do that, you will need an assembly JAR which contains Spark DBSCAN and all its dependencies. You can download it here . Please make sure that you are familiar with application submission process described here
A class which runs clustering algorithm is named org.alitouka.spark.dbscan.DbscanDriver . Specify this name when you submit the application to Spark. The application creates its own Spark context so you also have to pass it a master URL and a path to the assembly JAR (note that these parameters appear twice in the command line below, because they are required by the submission program and by the driver program). Also, the following parameters are required:
- --ds-input - path to the input data
- --ds-output - path where clustering results will be stored
- --eps - value of the epsilon parameter
- --numPts - value of the minPts parameter
The resulting command line may look like this:
./bin/spark-submit \
--class org.alitouka.spark.dbscan.DbscanDriver \
--master spark://your.spark.master:7077 \
--deploy-mode cluster \
hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
--ds-master spark://your.spark.master:7077 \
--ds-jar hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
--ds-input hdfs://your.hdfs:9000/path/to/your/data.csv \
--ds-output hdfs://your.hdfs:9000/path/to/output/folder \
--eps 25 \
--numPts 30
The following parameters are optional:
- --npp - an approximate number of points in each partition of the data set. This value is used by density-based partitioning algorithm which splits your data set into parts of the specified size to speed up further processing. The default value for this parameter is 50,000;
- --distanceMeasure - a full name of a class which implements org.apache.commons.math3.ml.distance.DistanceMeasure interface. Currently, only Euclidean and Manhattan distances are supported.
A class responsible for this task is named org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver . You have to pass it a master URL, a path to the assembly JAR, an input path and an output path. The resulting command line may look like this:
./bin/spark-submit \
--class org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver \
--master spark://your.spark.master:7077 \
--deploy-mode cluster \
hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
--ds-master spark://your.spark.master:7077 \
--ds-jar hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
--ds-input hdfs://your.hdfs:9000/path/to/your/data.csv \
--ds-output hdfs://your.hdfs:9000/path/to/output/folder
This program will produce a histogram. You can specify the number of buckets in this histogram with an optional --numBuckets parameter. The default value is 16. You can also specify --npp and --distanceMeasure parameters described above.
This task is performed by the org.alitouka.spark.dbscan.exploratoryAnalysis.NumberOfPointsWithinDistanceDriver class. You have to pass it a master URL, a path to the assembly JAR, an input path, an output path and a distance within which it should count neighbors of each point. The resulting command line may look like this:
./bin/spark-submit \
--class org.alitouka.spark.dbscan.exploratoryAnalysis.NumberOfPointsWithinDistanceDriver \
--master spark://your.spark.master:7077 \
--deploy-mode cluster \
hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
--ds-master spark://your.spark.master:7077 \
--ds-jar hdfs://your.hdfs:9000/path/to/dbscan_assembly.jar \
--ds-input hdfs://your.hdfs:9000/path/to/your/data.csv \
--ds-output hdfs://your.hdfs:9000/path/to/output/folder \
--eps 25
This program also accepts optional parameters --numBuckets, --npp and --distanceMeasure described above.