Today we will dive into unsupervised learning and experiment with clustering. k-means(++) is a popular choice (maybe even the most widely used clustering algorithm) when it comes to clustering data. Therefore, we will explore this algorithm a bit more in the following three tasks. In this exercise, we will continue to use sklearn, which implements a variety of clustering algorithms in the sklearn.cluster package.
The goal of this exercise part is to peek under the hood of this algorithm in order to empirically explore its strengths and weaknesses. Initially, we will use synthetic data to develop a basic understanding of the algorithm's performance characteristics.
In the literature on cluster analysis, k-means often refers not only to the clustering algorithm but also to the underlying optimization problem:
In cases where k-means refers to the problem formulation, the algorithm itself is sometimes called the Lloyd's algorithm. The algorithm repeats two steps until it converges:
-
Assign each data point to its nearest cluster center based on the squared Euclidean norm.
-
Update the centers by calculating the mean of each cluster center and using it as the new cluster center.
This algorithm will always converge and find a solution. However, there is no guarantee that this solution is the best solution, and the algorithm may converge slowly. Also, the algorithm requires an initial guess for the cluster centers. This is usually done by randomly selecting some of the points to be the initial centers. Therefore, it is good practice to run the algorithm several times with different initial center guesses to find a solution that is hopefully close to the best solution.
Fortunately, the implementation of k-means in sklearn takes care of all these details and provides us with a simple interface to control all these aspects.
Navigate to src/ex1_kmeans.py. Implement the first part of the plot_kmeans_clustering function as follows:
- Load the input data from the given path. You can now run the file and examine the data.
- k-means clustering is scale sensitive. This means that we generally need to rescale our input data before performing clustering. Note that our
plot_kmeans_clusteringfunction has astandardizeparameter that is set toFalseby default. Standardize the data according to$x_i = \frac{x_i - \mu}{\sigma}$ where$\mu$ is the sample mean,$\sigma$ is the sample standard deviation, in case thatstandardizeis set toTrue.sklearn.preprocessingmay be helpful.
Now we want to perform k-means clustering. Implement the perform_kmeans_clustering function following these steps:
-
Use
sklearn.cluster.KMeansto train on the given data. Set the parameterinit, which controls the initialization of the cluster centers, torandom. There is a better way to set this value, but we will discuss that in Task 3. -
Retrieve the cluster centers and predict the cluster index for each point.
-
Return the inertia as a float, the cluster centers and the predicted cluster indices as an array each.
Go back to the plot_kmeans_clustering function and finish the remaining TODOs:
-
Call the
perform_kmeans_clusteringfunction three times. Visualize the data points, cluster centers and the assignment of data points to cluster centers in a single scatter plot. To do this, use a for-loop andscatter_clusters_2dto plot all the results in one plot (have a closer look at matplotlib subplots and axes). -
Return the figure object.
-
Now have a look at the
mainfunction. The default number$k$ of clusters is not optimal. Experiment with different values and set the number of k-means clusters you want to use.
Note: Data preprocessing benefits greatly from expert knowledge of the field/application in which the data was measured. Some preprocessing methods may not be applicable in certain settings.
Recall that k-means assigns a given point
Each cell in this diagram is the set of points which are closest to a center:
A Voronoi diagram can be used as a tool to visualize the boundaries of the k-means cluster, but is also useful as a tool to understand the algorithm.
- Navigate to the
plot_decision_boundariesfunction and load, preprocess and cluster the synthetic data using the functionperform_kmeans_clusteringagain. - Use
Voronoiandvoronoi_plot_2dfrom thescipy.spatialpackage to visualize the boundaries of the k-means clusters. Again use theaxobject of the plot andscatter_clusters_2d. - Test your code with the test framework of vscode or by typing
nox -r -s testin your terminal. - (Optional) Which assumptions/limitations of the k-Means algorithm are illustrated by this visualization?
A common application of clustering is data compression, where large amounts of data are reduced by extracting and keeping only the "important" aspects (i.e., cluster centers, covariance matrices and weights). You may want to use this technique if you cannot store or transmit all the measurements. It can also be useful to reduce the amount of data if a tool/function you want to use in your analysis has a runtime that makes it infeasible to use on thousands or millions of data points. Sometimes, k-means takes a long time to perform the clustering (although you can apply it to large datasets). If you want to speed things up, you can consider using MiniBatchKMeans, which performs k-means clustering by drawing multiple random subsets of the entire data.
In this task the goal is to reduce the storage requirement of an image with width
- Open the file
src/ex2_image_compression.py. The image is loaded in themainfunction using theload_imagefunction. Inspect theinput_imgvariable and print the information about its dimensions.
Implement the compress_colorspace function using the k-means algorithm:
-
Reshape the input image into
$(w\cdot h, 3)$ to perform clustering on colors. -
Use
MiniBatchKMeansto cluster the image into$k$ clusters. -
Return a compressed image where the number of unique colors where reduced from
$256^3$ to$k$ via k-means clustering. The compressed image must have the same shape as the original one. -
Use
compress_colorspacein yourmainfunction to compress the image for$k \in {2,8,64,256}$ and plot the result using imshow. Set the corresponding value of$k$ as title for each result. -
Test your code with the test framework of vscode or by typing
nox -r -s testin your terminal.
As mentioned above, Lloyd's algorithm requires an initial set of centers. Looking more closely at the sklearn documentation, the second argument allows us to use either uniformly randomly selected points or something called "kmeans++" or user-defined array as the initial set of centers. The main contribution of k-means++ is a clever strategy for choosing the initial centers:
- Choose a point
$x_1 \in P$ uniformly at random, set$C^1 = { x_1 }$ . -
for
$i = 0$ to$k$ do: -
$\qquad$ Draw a point$x_i \in P$ according to the probability distribution
-
$\qquad$ Set$C^{i} = C^{i-1} \cup {x_i}$ . - end for
Navigate into src/ex3_kmeans_plus_plus.py and have a look at the code.
- Implement the
uniform_samplingfunction by drawing points uniformly from the datasets. - Implement the
d2_samplingfunction using the$D^2$ sampling algorithm described above. - Compare the results on the two datasets by executing the scirpt with
python ./src/ex3_kmeans_plus_plus.py. Which advantages does$D^2$ sampling provide as an initialization? Hint: (Weighted) sampling with and without replacement can be performed usingnp.random.choice. - Test your code with the test framework of vscode or by typing
nox -r -s testin your terminal.
Navigate into src/ex4_gmm.py and have a look at the code. We are creating synthetic dataset with three classes (the same that we used in the lecture) an want to compare k-means clustering and GMMs. Implement the TODOs in the file.