k-means: An Introduction to the Simple Yet Powerful Unsupervised Learning Algorithm for Clustering

·

2 min read

k-means is an unsupervised learning algorithm that is commonly used for clustering. At first, the term "unsupervised learning" may sound complicated and daunting, as if the algorithm is some kind of magical black box that can learn without any help or guidance. It was daunting to me, to be honest. However, k-means is actually a fairly straightforward algorithm that can be easily understood.

To illustrate the basic idea behind k-means, let's consider a sample set with two dimensions and imagine that we want to divide it into two clusters. One way to do this is to randomly choose two points in the sample set and assign data points to the cluster whose chosen point is closest to them. We can then move the chosen points to the centre of their respective clusters, and repeat the process: assign each data point to the cluster whose centroid is closest to it, and move the centroids to the centre of their respective clusters. By iterating this process, the sample set will eventually separate into two distinct clusters.

k-means algorithm can be applied to sample sets with any number of dimensions, not just two, and that the number of clusters doesn't have to be two either - it can be any positive integer k. Additionally, k-means has some limitations, such as the fact that it can get stuck in local optima and may not always find the "best" clustering solution. However, overall, k-means is a useful and widely-used algorithm for clustering data.