Note

According to my test, K-means is roughly balanced if the n_cluster is small. Thus a possible solution is to use the hierarchical K-means.

faiss has a good implementation.

Parameters

Progressive Dim Clustering Parameters

progressive_dim_steps = 10
apply_pca = true
niter = 10

Clustering Parameters

niter = 25
nredo = 1
spherical = false (normalize centroids after each iteration for inner production)
- true: use IVF flat IP
- false: use IVF flat L2
int_centroids = false (round centroids after each iteration)
update_index = false (re-train index after each iteration)
frozen_centroids = false (use provided centroids)
min_points_per_centroid = 39
max_points_per_centroid = 256 (subsample if exceeds)
seed = 1234
decode_block_size = 32768 (batch size of the codec decoder when the training set is encoded)
check_input_data_for_NaNs = true
use_faster_subsampling = false (splitmix64-based RNG, faster but could pick duplicates)

Training

Clustering

clustering
- assert (num >= k)
- cast float to u8 (for convenient indexing, line_size is sizeof(float) * d)
- subsample (if num > max_points_per_centroid * k)
  - shuffle and pick the top-(k * max_points_per_centroid)
- random select k points as the initial centroids
- post process centroids
  - normalize (if spherical)
  - round (if int_centroids)
- index (IVF flat) add centroids
- niter
  - search and assign
  - compute centroids
    - new centroids = sum(points) / count(points)
  - split clusters
    - for clusters that is empty
      - iter from cluster 0..
        
        if (rand() < (cluster_size - 1) / (n - k)) // here n is the sampled points number
        
        copy the centroid to the empty cluster and apply symmetric permutation
        
        d_i%2 == 0: centroid[i][d_i] *= 1 + EPS, centroid[i][d_i+1] *= 1 - EPS
        
        d_i%2 == 1: centroid[i][d_i] *= 1 - EPS, centroid[i][d_i-1] *= 1 + EPS
        
        EPS = 1 / 1024.0
        
        even split these two clusters
  - post process centroids
  - update index

Progressive Dim Clustering

train PCA for (n, x)
iteration: di = int(pow(d, (1.0 + i) / progressive_dim_steps))
- copy the centroids from the previous iteration
- clustering for (n, xsub(di))
revert PCA transform on centroids