Posts

Showing posts from July, 2019

Random initialization trap

Image
Random initialization trap when the centroids are randomly initialized, each run of k means produce different WCSS. Incorrect choice of centroids lead to suboptimal clustering. To solve the issue of incorrect centroids, we use K-means++, where we select the centroids as far as possible at initialization. The idea is to have centroids to create distinct clusters centers to have optimal clustering to converge fast. let’s explain that with an example We have a dataset as shown in the scatterplot below and we have to cluster the data into three clusters. Data set for grouping into 3 clusters Based on the random initialization of centroids, we have have clustering 1 and clustering 2 shown below different clusters based on different initialization of centroids This shows that clustering will be different based on different initialization for the centroids. The circled point displays how data points are grouped differently based on different initialization for centr

Encoding

Image
Label Encoding In this example, the first column is the country column, which is all text. As you might know by now, we can’t have text in our data if we’re going to run any kind of model on it. So before we can run a model, we need to make this data ready for the model. And to convert this kind of categorical text data into model-understandable numerical data, we use the Label Encoder class. So all we have to do, to label encode the first column, is import the LabelEncoder class from the sklearn library, fit and transform the first column of the data, and then replace the existing text data with the new encoded data. Let’s have a look at the code. from sklearn.preprocessing import LabelEncoder labelencoder = LabelEncoder() x[:, 0] = labelencoder.fit_transform(x[:, 0]) We’ve assumed that the data is in a variable called ‘x’. After running this piece of code, if you check the value of x, you’ll see that the three countries in the first column have been replaced by th