Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

K-Means & Other Clustering Algorithms: A Quick Intro with Python

$
0
0

This post was originally published here

Clustering is the grouping of objects together so that objects belonging in the same group (cluster) are more similar to each other than those in other groups (clusters). In this intro cluster analysis tutorial, we’ll check out a few algorithms in python so you can get abasic understanding of the fundamentals of clustering on a real dataset.

The Dataset

For the clustering problem, we will use the famous Zachary’s Karate Club dataset. The story behind the data set is quite simple: There was a Karate Club that had an administrator “John A” and an instructor “Mr. Hi” (both pseudonyms). Then a conflict arose between them, causing the students (Nodes) to split into two groups. One that followed John and one that followed Mr. Hi.


K-Means & Other Clustering Algorithms: A Quick Intro with Python
Source: Wikipedia Getting Started with Clustering in Python

But enough with the introductory talk, let’s get to main reason you are here, the code itself. First of all, you need to install both scikit-learn and networkx libraries to complete this tutorial. If you don’t know how, the links above should help you. Also, feel free to follow along by grabbing the source code for this tutorial over on Github .

Usually, the datasets that we want to examine are available in text form (JSON, Excel, simple txt file, etc.) but in our case, networkx provide it for us. Also, to compare our algorithms, we want the truth about the members (who followed whom) which unfortunately is not provided. But with these two lines of code, you will be able to load the data and store the truth (from now on we will refer it as ground truth):

# Load and Store both data and groundtruth of Zachary's Karate Club
G = nx.karate_club_graph()
groundTruth = [0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1]

The final step of the data preprocessing, is to transform the graph into a matrix (desirable input for our algorithms). This is also quite simple:

def graphToEdgeMatrix(G):
# Initialize Edge Matrix
edgeMat = [[0 for x in range(len(G))] for y in range(len(G))]
# For loop to set 0 or 1 ( diagonal elements are set to 1)
for node in G:
tempNeighList = G.neighbors(node)
for neighbor in tempNeighList:
edgeMat[node][neighbor] = 1
edgeMat[node][node] = 1
return edgeMat

Before we get going with the Clustering Techniques, I would like you to get a visualization on our data. So, let’s compile a simple function to do that:

def drawCommunities(G, partition, pos):
# G is graph in networkx form
# Partition is a dict containing info on clusters
# Pos is base on networkx spring layout (nx.spring_layout(G))
# For separating communities colors
dictList = defaultdict(list)
nodelist = []
for node, com in partition.items():
dictList[com].append(node)
# Get size of Communities
size = len(set(partition.values()))
# For loop to assign communities colors
for i in range(size):
amplifier = i % 3
multi = (i / 3) * 0.3
red = green = blue = 0
if amplifier == 0:
red = 0.1 + multi
elif amplifier == 1:
green = 0.1 + multi
else:
blue = 0.1 + multi
# Draw Nodes
nx.draw_networkx_nodes(G, pos,
nodelist=dictList[i],
node_color=[0.0 + red, 0.0 + green, 0.0 + blue],
node_size=500,
alpha=0.8)
# Draw edges and final plot
plt.title("Zachary's Karate Club")
nx.draw_networkx_edges(G, pos, alpha=0.5)

What that function does is to simply extract the number of clusters that are in our result and then assign a different color to each of them (up to 10 for the given time is fine) before plotting them.


K-Means & Other Clustering Algorithms: A Quick Intro with Python
Clustering Algorithms

Some clustering algorithms will cluster your data quite nicely and others will end up failing to do so. That is one of the main reasons why clustering is such a difficult problem. But don’t worry, we won’t let you drown in an ocean of choices. We’ll go through a few algorithms that are known to perform very well.

K-Means Clustering

N (the number of node):

K (the number of cluster):

Source: github.com/nitoyon/tech.nitoyon.com

K-means is considered by many the gold standard when it comes to clustering due to its simplicity and performance, and it’s the first one we’ll try out. When you have no idea at all what algorithm to use, K-means is usually the first choice. Bear in mind that K-means might under-perform sometimes due to its concept: spherical clusters that are separable in a way so that the mean value converges towards the cluster center. To simply construct and train a K-means model, use the follow lines:

# K-means Clustering Model
kmeans = cluster.KMeans(n_clusters=kClusters, n_init=200)
kmeans.fit(edgeMat)
# Transform our data to list form and store them in results list
results.append(list(kmeans.labels_)) Agglomerative Clustering

The main idea behind agglomerative clustering is that each node starts in its own cluster, and recursively merges with the pair of clusters that minimally increases a given linkage distance. The main advantage of agglomerative clustering (and hierarchical clustering in general) is that you don’t need to specify the number of clusters. That of course, comes with a price: performance. But, in scikit’s implementation, you can specify the number of clusters to assist the algorithm’s performance. To create and train an agglomerative model use the following code:

# Agglomerative Clustering Model
agglomerative = cluster.AgglomerativeClustering(n_clusters=kClusters, linkage="ward")
agglomerative.fit(edgeMat)
# Transform our data to list form and store them in results list
results.append(list(agglomerative.labels_)) Spectral

The Spectral clustering technique applies clustering to a projection of the normalized Laplacian. When it comes to image clustering, spectral clustering works quite well. See the next few lines of Python for all the magic:

# Spectral Clustering Model
spectral = cluster.SpectralClustering(n_clusters=kClusters, affinity="precomputed", n_init= 200)
spectral.fit(edgeMat)
# Transform our data to list form and store them in results list
results.append(list(spectral.labels_)) Affinity Propagation Well this one is a bit different. Unlike the previous algorithms, you can see AF does not require the number of clusters to be determined before running the algorithm. AF, performs really well on several computer vision and biology problems,

Viewing all articles
Browse latest Browse all 9596

Trending Articles