k-means Cluster Analysis in Excel using Python

One of the major limitations of Excel has always been that in order to do anything more than simple analysis you either needed add-ins (which varied a lot in quality) or to code things up in VBA (which was clunky and not what it was designed for.

The addition of Python to Excel (currently for beta users but to eventually be rolled out to everyone) now opens up all of the power of the Python programming language and all of the statistics/data science libraries.

In this video I use Python within Excel to conduct a k-means cluster analysis on the famous Iris data set, a very common activity in data science classes, first using a built in version of the dataset, and then using the data within Excel.

Subscribe to stay up to date on my latest videos, courses, and content

from sklearn import datasets
from sklearn.cluster import KMeans
iris = datasets.load_iris()
X = iris.data
k = 3  # You can adjust the number of clusters as needed
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
labels = kmeans.labels_
colors = ['red', 'green', 'blue']

# Create an empty scatterplot
plt.figure(figsize=(8, 6))

# Loop through each cluster and plot the data points
for i in range(k):
    plt.scatter(X[labels == i, 0], X[labels == i, 1], c=colors[i], label=f'Cluster {i+1}')

# Plot the cluster centers (centroids)
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', s=200, c='black', label='Centroids')

# Add labels and legend
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Iris Data Clustering')
plt.legend()

# Show the plot
plt.show()