See https://github.com/GregorySchwartz/hierarchical-spectral-clustering for latest version.
hierarchical-spectral-clustering
is a program (cluster-tree
) and library for
hierarchical spectral clustering of sparse and dense matrices. Outputted JSON
trees can be used with birch-beer for pretty trees!
See https://docs.haskellstack.org/en/stable/README/ for more details.
curl -sSL https://get.haskellstack.org/ | sh
stack setup
Probably easiest if you don’t want to mess with dependencies:
git clone https://github.com/GregorySchwartz/hierarchical-spectral-clustering.git
cd hierarchical-spectral-clustering
stack install
We only require stack
(or cabal
), you do not need to download any source
code (but you might need the stack.yaml dependency versions), just run the
following command to place cluster-tree
in your ~/.local/bin/
:
stack install hierarchical-spectral-clustering
If you run into errors like Error: While constructing the build plan, the
following exceptions were encountered:
, then follow it’s advice. Usually you
just need to follow the suggestion and add the dependencies to the specified
file. For a quick yaml
configuration, refer to
https://github.com/GregorySchwartz/hierarchical-spectral-clustering/blob/master/stack.yaml.
Relies on eigen-3.3.4.1
right now.
We first need our data in the format of “vertex,vertex,value”, such that a
vertex “x” connects with vertex “y” with an edge weight of value. This edge list
can be used as input into the program. Make sure there is a newline at the
end, otherwise, the program may expect more from stdin
.
a,b,1 a,c,2 a,d,4.5 b,d,2.1 c,d,1.1 e,f,1
Then we can give this file to the program:
cat input.csv | cluster-tree --clustering-type Dense
| item | cluster | | a | 1 | | b | 1 | | c | 1 | | d | 1 | | e | 2 | | f | 2 |
As a more advanced example, we can get:
- Output clustering
- Output tree structure in
JSON
format. - First isolate each connected component before clustering and generating their own trees.
- Use k-means instead of sign on the eigenvectors to determine cluster assignment.
- Use k-means on two eigenvectors rather than one.
cat advanced_input.csv | cluster-tree --output-tree "output_trees/tree.json" --clustering-type Dense --eigen-group KMeansGroup --num-eigen 2 --separate-components > "clusters.csv"
cluster-tree -h
cluster-tree, Gregory W. Schwartz. Hierarchical spectral clustering of data Computes real symmetric part of matrix, so ensure the input is real and symmetric. Diagonal should be 0s for adjacency matrix. Format is row,column,value with no header. Must end with a newline. Usage: cluster-tree [-c|--clustering-type STRING] [-d|--delimiter CHAR] [-S|--min-size INT] [-n|--num-eigen INT] [-m|--min-modularity DOUBLE] [-e|--eigen-group STRING] [-s|--separate-components] [-o|--output-tree STRING] Available options: -h,--help Show this help text -c,--clustering-type STRING ([Sparse] | Dense) Method for clustering data. -d,--delimiter CHAR ([,] | CHAR) The delimiter of the CSV file. Format is row,column,value with no header. -S,--min-size INT ([Nothing] | INT) Minimum size of a cluster. -n,--num-eigen INT ([1] | INT) Number of eigenvectors to use while clustering with kmeans. Takes from the first eigenvector. Recommended to start at 2 and work up from there if needed. -m,--min-modularity DOUBLE ([0] | DOUBLE) Minimum modularity to be over to continue recursion. -e,--eigen-group STRING ([SignGroup] | KMeansGroup) Whether to group the eigenvector using the sign or kmeans while clustering. While the default is sign, kmeans may be more accurate (but starting points are arbitrary). -s,--separate-components Whether to first separate connected components of the graph first. Will output a dendrogram for each component with the name of the tree and the number of nodes within the tree, along with the base set by --output-tree. -o,--output-tree STRING ([Nothing] | FILE) The name of the file to output the tree in JSON format.