Load the previously downloaded dataset by using the Pandas function read_csv(). Store the dataset in a Pandas DataFrame named data:
import pandas as pd import matplotlib.pyplot as plt import numpy as np np.random.seed(0)
First, import the required libraries. Then, feed the dataset path to the Pandas function's read_csv():
data = pd.read_csv("datasets/wholesale_customers_data.csv")
Check for missing values in your DataFrame. Using the isnull() function plus the sum() function, count the missing values of the entire dataset at once:
data.isnull().sum()
As you can see from the preceding screenshot, there are no missing values in the dataset.
Check for outliers in your DataFrame. Using the technique you learned in the previous chapter, label those values that fall outside of three standard deviations from the mean as outliers. The following code snippet allows you to look for outliers in the entire set of features at once. However, another valid method would be to check for outliers one feature at a time:
outliers = {} for i in range(data.shape[1]): min_t = data[data.columns[i]].mean() - (3 * data[data.columns[i]].std()) max_t = data[data.columns[i]].mean() + (3 * data[data.columns[i]].std()) count = 0 for j in data[data.columns[i]]: if j < min_t or j > max_t: count += 1 outliers[data.columns[i]] = [count,data.shape[0]-count] print(outliers)
The count of outliers for each of the features is shown in the following figure:
As you can see from the preceding screenshot, some features do have outliers. Considering that there are only a few outliers for each feature, there are two possible ways to handle them.
First, you could decide to delete the outliers. This decision can be supported by displaying a histogram for the features with outliers:
plt.hist(data["Fresh"]) plt.show()
For instance, for the feature named Fresh, it can be seen through the histogram that most instances are represented by values below 40,000. Hence, deleting the instances above that value will not affect the performance of the model.
On the other hand, the second approach would be to leave the outliers as they are, considering that they do not represent a large portion of the dataset, which can be supported with data visualization tools using a pie chart. See the code and the output that follow:
plt.figure(figsize=(8,8)) plt.pie(outliers["Detergents_Paper"],autopct="%.2f") plt.show()
The preceding diagram shows the participation of the outliers from the Detergents_papers feature, which was the feature with the most outliers in the dataset. Only 2.27% of the values are outliers, a value so low that it will not affect the performance of the model either.
Rescale the data. For this solution, the formula for standardization has been used. Note that the formula can be applied to the entire dataset at once, instead of being applied individually to each feature:
data_standardized = (data - data.mean())/data.std() data_standardized.head()
Open the Jupyter Notebook that you used for the previous activity. There, you should have imported all the required libraries and stored the dataset in a variable named data. The standardized data should look as follows:
data_standardized = (data - data.mean())/data.std() data_standardized.head()
Calculate the average distance of data points from its centroid in relation to the number of clusters. Based on this distance, select the appropriate number of clusters to train the model to.
First, import the algorithm class:
from sklearn.cluster import KMeans
Next, using the code in the following snippet, calculate the average distance of data points from its centroid based on the number of clusters created:
ideal_k = [] for i in range(1,21): est_kmeans = KMeans(n_clusters=i) est_kmeans.fit(data_standardized) ideal_k.append([i,est_kmeans.inertia_]) ideal_k = np.array(ideal_k)
Finally, plot the relation to find the breaking point of the line, and select the number of clusters:
plt.plot(ideal_k[:,0],ideal_k[:,1]) plt.show()
Train the model and assign a cluster to each data point in your dataset. Plot the results.
To train the model, use the following code:
est_kmeans = KMeans(n_clusters=6) est_kmeans.fit(data_standardized) pred_kmeans = est_kmeans.predict(data_standardized)
The number of clusters selected is 6; however, since there is no exact breaking point, values between 5 and 10 are also acceptable.
Finally, plot the results of the clustering process. As the dataset contains eight different features, choose two features to draw at once, as shown in the following code:
plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16,8)) plt.scatter(data.iloc[:,5], data.iloc[:,3], c=pred_kmeans, s=20) plt.xlim([0, 20000]) plt.ylim([0,20000]) plt.xlabel('Frozen') plt.subplot(1, 2, 1) plt.scatter(data.iloc[:,4], data.iloc[:,3], c=pred_kmeans, s=20) plt.xlim([0, 20000]) plt.ylim([0,20000]) plt.xlabel('Grocery') plt.ylabel('Milk') plt.show()
The subplots() function from Matplotlib has been used to plot two scatter graphs at a time.
As can be seen from the plots, there is no obvious visual relation due to the fact that we are only able to use two of the eight features present in the dataset. However, the final output of the model creates six different clusters that represent six different profiles of clients.
Open the Jupyter Notebook that you used for the previous activity.
Train the model and assign a cluster to each data point in your dataset. Plot the results.
First, do not forget to import the algorithm class:
from sklearn.cluster import MeanShift
To train the model, use the following code:
est_meanshift = MeanShift(0.4) est_meanshift.fit(data_standardized) pred_meanshift = est_meanshift.predict(data_standardized)
The model was trained using a bandwidth of 0.4. However, feel free to test other values to see how the result changes.
Finally, plot the results of the clustering process. As the dataset contains eight different features, choose two features to draw at once, as shown in the snippet below. Similar to the previous activity, the separation between clusters is not visually seen due to the capability to only draw two out of the eight features:
plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16,8)) plt.scatter(data.iloc[:,5], data.iloc[:,3], c=pred_meanshift, s=20) plt.xlim([0, 20000]) plt.ylim([0,20000]) plt.xlabel('Frozen') plt.subplot(1, 2, 1) plt.scatter(data.iloc[:,4], data.iloc[:,3], c=pred_meanshift, s=20) plt.xlim([0, 20000]) plt.ylim([0,20000]) plt.xlabel('Grocery') plt.ylabel('Milk') plt.show()
Open the Jupyter Notebook that you used for the previous activity.
Train the model and assign a cluster to each data point in your dataset. Plot the results.
First, do not forget to import the algorithm class:
from sklearn.cluster import DBSCAN
To train the model, use the following code:
est_dbscan = DBSCAN(eps=0.8) pred_dbscan = est_dbscan.fit_predict(data_standardized)
The model was trained using an epsilon value of 0.8. However, feel free to test other values to see how the results change.
Finally, plot the results of the clustering process. As the dataset contains eight different features, choose two features to draw at once, as shown in the following code:
plt.subplots(1, 2, sharex='col', sharey='row', figsize=(16,8)) plt.scatter(data.iloc[:,5], data.iloc[:,3], c=pred_dbscan, s=20) plt.xlim([0, 20000]) plt.ylim([0,20000]) plt.xlabel('Frozen') plt.subplot(1, 2, 1) plt.scatter(data.iloc[:,4], data.iloc[:,3], c=pred_dbscan, s=20) plt.xlim([0, 20000]) plt.ylim([0,20000]) plt.xlabel('Grocery') plt.ylabel('Milk') plt.show()
Similar to the previous activity, the separation between clusters is not visually seen due to the capability to only draw two out of the eight features at once.
Open the Jupyter Notebook that you used for the previous activity.
Calculate both the Silhouette Coefficient score and the Calinski–Harabasz index for all the models that you trained previously.
First, do not forget to import the metrics:
from sklearn.metrics import silhouette_score from sklearn.metrics import silhouette_score
Calculate the Silhouette Coefficient score for all the algorithms, as shown in the following code:
kmeans_score = silhouette_score(data_standardized, pred_kmeans, metric='euclidean') meanshift_score = silhouette_score(data_standardized, pred_meanshift, metric='euclidean') dbscan_score = silhouette_score(data_standardized, pred_dbscan, metric='euclidean') print(kmeans_score, meanshift_score, dbscan_score)
The scores come to be around 0.355, 0.093, and 0.168 for the k-means, Mean-Shift, and DBSCAN algorithms, respectively.
Finally, calculate the Calinski–Harabasz index for all the algorithms. The following is a snippet of the code:
kmeans_score = calinski_harabaz_score(data_standardized, pred_kmeans) meanshift_score = calinski_harabaz_score(data_standardized, pred_meanshift) dbscan_score = calinski_harabaz_score(data_standardized, pred_dbscan) print(kmeans_score, meanshift_score, dbscan_score)
The scores come to be approximately 139.8, 112.9, and 42.45 for the three algorithms in the respective order in the code snippet.
By quickly looking at the results obtained for both metrics, it is possible to conclude that the k-means algorithm outperforms the other models, and hence, should be the one selected to solve the data problem.