-
Notifications
You must be signed in to change notification settings - Fork 116
Data formats
Spark DBSCAN reads and writes data in CSV format. Each row of an input CSV file should contain coordinates of one point. Header row should not be included in the input file. Example:
178.505256857926,13.0232102015073
192.647916294684,-54.9397894514337
221.447056375756,54.7299660721257
27.3622293956578,-1401.0191693902
88.692161857309,59.1680534628075
-10.9376996678452,57.462273892661
1125.10539018549,-99.3853457272053
52.7498085418551,-9.04225181250282
57.0464906188891,63.3980855359247
9.09417446372015,-140.061523108837
This example consists of 2-dimensional points but it doesn't mean that Spark DBSCAN is limited to 2 dimensions. You can use as many dimensions as you need, but, of course, each point in the dataset must have the same number of coordinates.
A larger example (1 million of points) is available here.
Output data is also stored in CSV format. Each row of the output file contains coordinates of a point followed by a cluster ID. Cluster ID for noise points is always 0. The order of points in the output file is different from the order of points in the input file. Example:
-10.9376996678452,57.462273892661,3181710
9.09417446372015,-140.061523108837,3181710
178.505256857926,13.0232102015073,3181710
192.647916294684,-54.9397894514337,3181710
221.447056375756,54.7299660721257,3181710
27.3622293956578,-1401.0191693902,0
88.692161857309,59.1680534628075,3181710
1125.10539018549,-99.3853457272053,0
52.7498085418551,-9.04225181250282,3181710
57.0464906188891,63.3980855359247,3181710
This example contains two noise points and 8 points assigned to a cluster with ID 3181710.
A larger example (1 million clustered
points) is available here.
You have to create an application which implements this workflow. You also have to implement 2 functions:
- A function which reads a dataset in your format and returns an RDD[Point]
- A function which saves a DbscanModel object
Use your custom functions instead of IOHelper class (instead of these calls):
val data = IOHelper.readDataset(sc, "/path/to/my/data.csv")
IOHelper.saveClusteringResult(model, "/path/to/output/folder")