Clustering

KMeans([n_clusters, init, …]) Scalable KMeans for clustering
PartialMiniBatchKMeans(**kwargs) Mini-Batch K-Means clustering
SpectralClustering([n_clusters, …]) Apply parallel Spectral Clustering

The dask_ml.cluster module implements several algorithms for clustering unlabeled data.

Spectral Clustering

Spectral Clustering finds a low-dimensional embedding on the affinity matrix between samples. The embedded dataset is then clustered, typically with KMeans.

Typically, spectral clustering algorithms do not scale well. Computing the \(n\_samples \times n\_samples\) affinity matrix becomes prohibitively expensive when the number of samples is large. Several algorithms have been proposed to work around this limitation.

In dask-ml, we use the Nyström method to approximate the large affinity matrix. This involves sampling n_components rows from the entire training set. The exact affinity is computed for this subset ( \(n\_components \times n\_components\) ), and between this small subset and the rest of the data ( \(n\_components \times (n\_samples - n\_components)\) ). We avoid the direct computation of the rest of the affinity matrix.

Let \(S\) be our \(n \times n\) affinity matrix. We can rewrite that as

\[\begin{split}S_d = \left[ \begin{array}\ A & B \\ B^T & C \\ \end{array} \right]\end{split}\]

Where \(A\) is the \(n \times n\) affinity matrix of the \(n\_components\) that we sampled, and \(B\) is the \(n \times (n - n\_components)\) affinity matrix between the sample and the rest of the dataset. Instead of computing \(C\) directly, we approximate it with \(B^T A^{-1} B\).

See the spectral clustering benchmark for an example showing how dask_ml.cluster.SpectralClustering scales in the number of samples.