K-nearest neighbor algorithm implementation in Python from scratch

K-nearest-neighbor implementation in python from scratch K-nearest-neighbor algorithm implementation in Python from scratch

In the introduction to k-nearest-neighbor algorithm article, we have learned the key aspects of the knn algorithm. Also learned about the applications using knn algorithm to solved the real world problems. Now in this article, we are going to implement K-Nearest Neighbors algorithm from scratch in Python programming language.

K-nearest-neighbor-black-box-model

After learning knn algorithm, we can use pre-packed python machine learning libraries to use knn classifier models directly. Then everything seems like a black box approach. Using the input data and the inbuilt k-nearest neighbor algorithms models to build the knn classifier model and using the trained knn classifier we can predict the results for the new dataset.This approach seems easy and mostly applied in the machine learning era. This method of using the inbuild python packages to solve a problem will be helpful you to get insights of KNN algorithm.

Image Credit:: Scikit-Learn

Let’s consider the above image. Here, we have three classes, i.e., Red, Green, Blue. The data points are dispersed. To draw a K-Nearest neighbor decision boundary map.

First of all, we will have to divide data set into training & testing data. Train & Test data can be split in any ratio like 60:40, 70:30, 80:20 etc. For each value of test data. We need to calculate metrics like Euclidean Distance and estimate the value of K . For the above classification; we have used K = 15

. If the Euclidean distance is less, then it means classes are close. By sorting

Euclidean distances in increasing order and selecting the class with maximum no. of data points out of top 15 Euclidean distances as the class of that testing data point. This way we can build classifiers using knnalgorithm.

Why we need to implement knn algorithm from scratch

Implementation of K-Nearest Neighbor algorithm in python from scratch will help you to learn the core concept of Knn algorithm. As we are going implement each every component of the knn algorithm and the other components like how to use the datasets and find the accuracy of our implemented model etc.

The components will be How to Load the dataset. Split the dataset into train and testing purpose. Euclidean distance calculation function. Prediction of classes for records with unknown labels. Accuracy calculation of our prediction model. Prerequisite Note:

This tutorial will help you to get started with theimplementation of Machine Learning Algorithms. We expect you to have at least basic programming experience in Python programming language.

Before going to implement the k- Nearest neighbor algorithms in Python fromscratch, Let’s quickly look at the k-nearest neighbor algorithm pseudocode for our previous article introduction to the k-nearest neighbor algorithm .

If you have any doubts about Knn algorithm or want to revise it. You can read our article Introduction to K-nearest neighbor classifier .

K-Nearest Neighbor Algorithm Pseudocode:

Let (X i , C i ) where i = 1, 2……., n be data points. X i denotes feature values & C i denotes labels for X i for each i.

Assuming the number of classes as ‘c’

C i ∈ {1, 2, 3, ……, c} for all values of i

Let x be a point for which label is not known, and we would like to find the label class using k-nearest neighbor algorithms.

Procedure: Calculate “d(x, x i )” i =1, 2, ….., n ; where d denotes theEuclidean distance between the points. Arrange the calculated n Euclidean distances in non-decreasing order. Let k be a +ve integer, take the first k distances from this sorted list. Find those k -points corresponding to these k -distances. Let k i denotes the number of points belonging to the i th class among k points i.e. k ≥ 0 If k i >k j i ≠ j then put x in class i.

Let’s use the above pseudocode for implementing the knn algorithm in python.

Description of the Dummy Data Set

We will use a sample dataset extracted from ionosphere database by John Hopkins University. We have converted thedatabase into a small dataset so as to simplify the learning curve for our readers.

This dummy dataset consists of 6 attributes and 30 records. Out Of these 5 attributes are continuous variables with values ranging from -1 to +1 i.e, [-1,+1]. Last(6th) attribute is a categorical variable with values as “g”(good) or “b”(bad) according to the definition summarized above. This is a binary classification task. Python modules we were going to use

We are importing only four python modules.

Unicodecsv : Our dataset is in CSV format i.e, Comma Separated Values. We can open this dataset using any text editor like notepad++, sublime, emac editor. But for data analysis, we need to import our data. For importing CSV data to Python lists or arrays we can use python’s unicodecsv module. It can handle unicode characters and can parse all elements of CSV to list. Random : This module implements pseudo-random number generators for various distributions. We want to split our dataset into train and test dataset. For splitting data in random order, we can use therandom module. Operator :The operator module exports a set of efficient functions corresponding to the intrinsic operators of Python. We useditemgetter() method of this module to return a callable object that fetches item from its operand. Math :It provides access to the mathematical functions defined by the C standard. We used this module to calculate Euclidean distance. K-Nearest neighbor algorithm implement in Python from scratch

We were going to follow the below workflow for implementingthe knn algorithm in python.

Getting Data Train & Test Data split Euclidean distance calculation Knn prediction function Accuracy calculation

Let’s get our hands dirty and start the coding stuff.

Getting Data:

For any type of programmatic implementation on thedataset, we first need to import it.Using unicodecsv reader, we inserted the dataset to a python list.

with open('i_data_sample.csv','rb') as f: reader = unicodecsv.reader(f) i_data = list(reader)

We can check the length of dataset using len()

>>> len(dataset) 30 Train & Test Data split: We have written shuffle() method. It takes da

Latest Images

Trending Articles

Latest Images