Random initialization trap
Random initialization trap
when the centroids are randomly initialized, each run of k means produce different WCSS. Incorrect choice of centroids lead to suboptimal clustering.
To solve the issue of incorrect centroids, we use K-means++, where we select the centroids as far as possible at initialization.
The idea is to have centroids to create distinct clusters centers to have optimal clustering to converge fast.
let’s explain that with an example
We have a dataset as shown in the scatterplot below and we have to cluster the data into three clusters.
Based on the random initialization of centroids, we have have clustering 1 and clustering 2 shown below
This shows that clustering will be different based on different initialization for the centroids. The circled point displays how data points are grouped differently based on different initialization for centroids
This is solved by k-means++, which uses the following algorithm
Step 1: pick up random centroids for k clusters
Step 2: calculate sum of squares distance of each point to each centroid
Step 3: find the smallest distance or the cluster closet for each of the data points in the dataset
Step 4: find how many points are assigned to each cluster and calculate the mean for each cluster and they become the new centroid.
we repeat this based on a configurable parameter .
Comments
Post a Comment