Skip to content

Data formats

alitouka edited this page Mar 15, 2015 · 3 revisions

What is supported out of the box

Input

Spark DBSCAN reads and writes data in CSV format. Each row of an input CSV file should contain coordinates of one point. Header row should not be included in the input file. Example:

178.505256857926,13.0232102015073
192.647916294684,-54.9397894514337
221.447056375756,54.7299660721257
27.3622293956578,-1401.0191693902
88.692161857309,59.1680534628075
-10.9376996678452,57.462273892661
1125.10539018549,-99.3853457272053
52.7498085418551,-9.04225181250282
57.0464906188891,63.3980855359247
9.09417446372015,-140.061523108837

This example consists of 2-dimensional points but it doesn't mean that Spark DBSCAN is limited to 2 dimensions. You can use as many dimensions as you need, but, of course, each point in the dataset must have the same number of coordinates.

A larger example (1 million of points) is available here.

Output

Output data is also stored in CSV format. Each row of the output file contains coordinates of a point followed by a cluster ID. Cluster ID for noise points is always 0. The order of points in the output file is different from the order of points in the input file. Example:

-10.9376996678452,57.462273892661,3181710
9.09417446372015,-140.061523108837,3181710
178.505256857926,13.0232102015073,3181710
192.647916294684,-54.9397894514337,3181710
221.447056375756,54.7299660721257,3181710
27.3622293956578,-1401.0191693902,0
88.692161857309,59.1680534628075,3181710    
1125.10539018549,-99.3853457272053,0
52.7498085418551,-9.04225181250282,3181710
57.0464906188891,63.3980855359247,3181710

This example contains two noise points and 8 points assigned to a cluster with ID 3181710.

A larger example (1 million of clustered points) is available here.

What if my data format is not supported?

You have to create an application which implements this workflow. You also have to implement 2 functions:

  1. A function which reads a dataset in your format and returns an RDD[Point]
  2. A function which saves a DbscanModel object

Use your custom functions instead of IOHelper class (instead of these calls):

val data = IOHelper.readDataset(sc, "/path/to/my/data.csv")

IOHelper.saveClusteringResult(model, "/path/to/output/folder")
Clone this wiki locally