K-Means & Other Clustering Algorithms: A Quick Intro with Python

This post was originally published here

Clustering is the grouping of objects together so that objects belonging in the same group (cluster) are more similar to each other than those in other groups (clusters). In this intro cluster analysis tutorial, we’ll check out a few algorithms in python so you can get abasic understanding of the fundamentals of clustering on a real dataset.

The Dataset

For the clustering problem, we will use the famous Zachary’s Karate Club dataset. The story behind the data set is quite simple: There was a Karate Club that had an administrator “John A” and an instructor “Mr. Hi” (both pseudonyms). Then a conflict arose between them, causing the students (Nodes) to split into two groups. One that followed John and one that followed Mr. Hi.

K-Means & Other Clustering Algorithms: A Quick Intro with Python

Source: Wikipedia Getting Started with Clustering in Python

But enough with the introductory talk, let’s get to main reason you are here, the code itself. First of all, you need to install both scikit-learn and networkx libraries to complete this tutorial. If you don’t know how, the links above should help you. Also, feel free to follow along by grabbing the source code for this tutorial over on Github .

Usually, the datasets that we want to examine are available in text form (JSON, Excel, simple txt file, etc.) but in our case, networkx provide it for us. Also, to compare our algorithms, we want the truth about the members (who followed whom) which unfortunately is not provided. But with these two lines of code, you will be able to load the data and store the truth (from now on we will refer it as ground truth):

# Load and Store both data and groundtruth of Zachary's Karate Club
G = nx.karate_club_graph()
groundTruth = [0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1]

The final step of the data preprocessing, is to transform the graph into a matrix (desirable input for our algorithms). This is also quite simple:

def graphToEdgeMatrix(G):
# Initialize Edge Matrix
edgeMat = [[0 for x in range(len(G))] for y in range(len(G))]
# For loop to set 0 or 1 ( diagonal elements are set to 1)
for node in G:
tempNeighList = G.neighbors(node)
for neighbor in tempNeighList:
edgeMat[node][neighbor] = 1
edgeMat[node][node] = 1
return edgeMat

Before we get going with the Clustering Techniques, I would like you to get a visualization on our data. So, let’s compile a simple function to do that:

def drawCommunities(G, partition, pos):
# G is graph in networkx form
# Partition is a dict containing info on clusters
# Pos is base on networkx spring layout (nx.spring_layout(G))
# For separating communities colors
dictList = defaultdict(list)
nodelist = []
for node, com in partition.items():
dictList[com].append(node)
# Get size of Communities
size = len(set(partition.values()))
# For loop to assign communities colors
for i in range(size):
amplifier = i % 3
multi = (i / 3) * 0.3
red = green = blue = 0
if amplifier == 0:
red = 0.1 + multi
elif amplifier == 1:
green = 0.1 + multi
else:
blue = 0.1 + multi
# Draw Nodes
nx.draw_networkx_nodes(G, pos,
nodelist=dictList[i],
node_color=[0.0 + red, 0.0 + green, 0.0 + blue],
node_size=500,
alpha=0.8)
# Draw edges and final plot
plt.title("Zachary's Karate Club")
nx.draw_networkx_edges(G, pos, alpha=0.5)

What that function does is to simply extract the number of clusters that are in our result and then assign a different color to each of them (up to 10 for the given time is fine) before plotting them.

Clustering Algorithms

Some clustering algorithms will cluster your data quite nicely and others will end up failing to do so. That is one of the main reasons why clustering is such a difficult problem. But don’t worry, we won’t let you drown in an ocean of choices. We’ll go through a few algorithms that are known to perform very well.

K-Means Clustering

N (the number of node):

K (the number of cluster):

Source: github.com/nitoyon/tech.nitoyon.com

K-means is considered by many the gold standard when it comes to clustering due to its simplicity and performance, and it’s the first one we’ll try out. When you have no idea at all what algorithm to use, K-means is usually the first choice. Bear in mind that K-means might under-perform sometimes due to its concept: spherical clusters that are separable in a way so that the mean value converges towards the cluster center. To simply construct and train a K-means model, use the follow lines:

# K-means Clustering Model
kmeans = cluster.KMeans(n_clusters=kClusters, n_init=200)
kmeans.fit(edgeMat)
# Transform our data to list form and store them in results list
results.append(list(kmeans.labels_)) Agglomerative Clustering

The main idea behind agglomerative clustering is that each node starts in its own cluster, and recursively merges with the pair of clusters that minimally increases a given linkage distance. The main advantage of agglomerative clustering (and hierarchical clustering in general) is that you don’t need to specify the number of clusters. That of course, comes with a price: performance. But, in scikit’s implementation, you can specify the number of clusters to assist the algorithm’s performance. To create and train an agglomerative model use the following code:

# Agglomerative Clustering Model
agglomerative = cluster.AgglomerativeClustering(n_clusters=kClusters, linkage="ward")
agglomerative.fit(edgeMat)
# Transform our data to list form and store them in results list
results.append(list(agglomerative.labels_)) Spectral

The Spectral clustering technique applies clustering to a projection of the normalized Laplacian. When it comes to image clustering, spectral clustering works quite well. See the next few lines of Python for all the magic:

# Spectral Clustering Model
spectral = cluster.SpectralClustering(n_clusters=kClusters, affinity="precomputed", n_init= 200)
spectral.fit(edgeMat)
# Transform our data to list form and store them in results list
results.append(list(spectral.labels_)) Affinity Propagation Well this one is a bit different. Unlike the previous algorithms, you can see AF does not require the number of clusters to be determined before running the algorithm. AF, performs really well on several computer vision and biology problems,

K-Means & Other Clustering Algorithms: A Quick Intro with Python

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本