
K-nearest neighbor implementation with scikit learn Knn classifier implementation in scikit learn
In the introduction to k nearest neighbor and knn classifier implementation in python from scratch, We discussed the key aspects of knn algorithms and implementing knn algorithms in an easy way for few observations dataset.
However in K-nearest neighbor classifier implementation in scikit learn post, we are going to examine the Breast Cancer Dataset using python sklearnmachine learning library to model K-nearest neighbor algorithm. After modeling the knn classifier, we are going use the trained knn model to predict whether the patient is suffering from the benign tumor or malignant tumor.The greatness of using Sklearn is it provides us the functionality to implement machine learning algorithms in a few lines of code.
As we discussed the principle behind KNN classifier (K-Nearest Neighbor) algorithm is to find K predefined number of training samples closest in the distance to new point & predict the label from these. The distance measure is commonlyconsidered to beEuclidean distance.
Euclidean distance
Euclidean Distance
Euclidean distance is the most commonly used distance measure. Euclidean distance also called as simply distance. The usage of Euclidean distance measure is highly recommended when data is dense or continuous. Euclidean distance is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.The Pythagorean theorem gives this distance between two points.
Euclidean distance implementation in python #!/usr/bin/env python from math import* def euclidean_distance(x,y): return sqrt(sum(pow(a-b,2) for a, b in zip(x, y))) print euclidean_distance([0,3,4,5],[7,6,3,-1]) Script Output 9.74679434481 [Finishedin 0.0s]You can learn how to implement different similarity measure from the below post
Most popular similarity measures implementation in python
KNN classifier is also considered to be an instance based learning / non-generalizing algorithm. It stores records of training data in a multidimensional space. For each new sample & particular value of K, it recalculates Euclidean distances and predicts the target class. So, it does not createa generalized internal model.
Similar to KNN classifier, we can use Radius Neighbor Classifier forclassification tasks. As in KNNclassifier, we specify the value of K, similarly, in Radius neighbor classifier the value of R should be defined. The RNC classifier determines the target class based on the number of neighbors within a fixed radius for each training point. In this tutorial, we are going to use only KNN.
Knn implementation with Sklearn Wisconsin Breast Cancer Data SetThe Wisconsin Breast Cancer Database was collected byDr. William H. Wolberg (physician), University of Wisconsin Hospitals, USA. This dataset consists of 10 continuous attributes and 1 target class attributes. Class attribute shows the observation result, whether the patient is suffering from the benign tumor or malignant tumor. Benign tumors do not spread to other parts while the malignant tumor is cancerous. The dataset was collected & openly distributed so as to find out some patterns from this data.
Class attribute shows the observation result, whether the patient is suffering from the benign tumor or malignant tumor. Benign tumors do not spread to other parts while the malignant tumor is cancerous. The dataset was collected & openly distributed so as to find out some patterns from this data.

Cancer tumor detection with k-nearest neighbor Breast Cancer Data Set Attribute Information:
1. Sample code number: id number
2. Clump Thickness: 1 10
3. Uniformity of Cell Size: 1 10
4. Uniformity of Cell Shape: 1 10
5. Marginal Adhesion: 1 10
6. Single Epithelial Cell Size: 1 10
7. Bare Nuclei: 1 10
8. Bland Chromatin: 1 10
9. Normal Nucleoli: 1 10
10. Mitoses: 1 10
11. Class: (2 for benign, 4 for malignant)
Problem Statement:To modeled the knn classifier using the Brest Cancer data to predicting whether a patient is suffering from the benign tumor or malignant tumor.
KNN Model for Cancerous tumor detection:To diagnose Breast Cancer, the doctor uses his experience by analyzing details provided by
Patient’s Past Medical History Reports of all the tests performed.Using the modeled KNN classifier, we will solve the problem in a way similar to the procedure used by doctors. The modeled KNN classifier will compare the new patient’s test reports, observation metrics with the records of patients(training data) that correctly classified as benign or malignant.
Python packages used: NumPy NumPy is a Numeric Python module. It provides fast mathematical functions. Numpyprovides robust data structures for efficient computation of multi-dimensional arrays & matrices. We used numpy to read data files into numpy arrays and data manipulation. Scikit-Learn It’s a machine learning library. It includes various machine learning algorithms. We are using its Imputer, train_test_split, KNeighborsClassifier, accuracy_score algorithms.If you haven’t setup the machine learning setup in your system the below posts will helpful.
Python Machine learning setup in ubuntu
Python machine learningvirtual environment setup
K-nearest neighbor classifier implementation with scikit-learnAll the code snippets can be typed directly to jupyterIpython notebook.
Libraries:Thissectioninvolves importing all the libraries. We are importing numpy and sklearn imputer, train_test_split, KneighborsClassifier & accuracy_score modules.
import numpyas np from sklearn.preprocessingimport Imputer from sklearn.cross_validationimport train_test_split from sklearn.neighborsimport KNeighborsClassifier from sklearn.metricsimport accuracy_score Data Import:We are using breast cancer data. You can download it from archive.ics.uci.edu website. For importing the data and manipulating it, we are going to use numpy arrays.
Using
genfromtxt()
method, weare importing our dataset into the