
Decision tree algorithm in python Decision Tree Algorithm implementation with scikit learn
One of the cutest and lovable supervised algorithms is Decision Tree Algorithm. It can be used for both the classification as well as regression purposes also.
As in the previous article how the decision tree algorithm works we have given the enough introduction to the working aspects of decision tree algorithm. In this article, we are going to build a decision tree classifier in python using scikit-learn machine learning packages for balance scale dataset.
The summarizingway of addressing this article is to explain how we can implement Decision Tree classifier on Balance scale data set.We will program our classifier in Python language and will use its sklearn library .
How we can implement Decision Tree classifier in Python with Scikit-learn
Decision tree algorithm prerequisitesBefore get start building the decision tree classifier in Python, please gain enough knowledge on how the decision tree algorithm works. If you don’t have the basic understanding of how the Decision Tree algorithm. You can spend some time on how the Decision TreeAlgorithm works article.
Once we completed modeling the Decision Tree classifier, we will use the trained model to predict whether the balance scale tip to the right or tip to the left or be balanced . The greatness of using Sklearn is that. It provides the functionality to implement machine learning algorithms in a few lines of code.
Before get started let’s quicklylook into the assumptions we make while creating the decision tree and the decision tree algorithm pseudocode.
Assumptions we make while using Decision tree In the beginning, thewhole training set is considered at the root. Feature values are preferred to be categorical. If values are continuous then they are discretized prior to building the model. Records are distributed recursivelyon the basis of attribute values. Order to placing attributes as root or internal node of thetree is done by using some statistical approach. Decision Tree Algorithm Pseudocode Place the best attribute of our dataset at the root of the tree. Splitthe training set into subsets. Subsets should be made in such a way that each subset contains data with the same value for an attribute. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.While building our decision tree classifier,we can improve its accuracy bytuning it with different parameters. But this tuning should bedone carefullysince by doing this our algorithm can overfit on our training data & ultimately it will build bad generalization model.
Sklearn Library InstallationPython’s sklearn library holds tons of modules that help to build predictive models. It contains tools for data splitting, pre-processing, feature selection, tuning and supervised unsupervised learning algorithms, etc. It is similar to Caretlibrary in R programming.
For using it, we first need to install it. The best way to install data science libraries and its dependencies is by installing Anaconda package. You can also install only the most popular machine learning Python libraries .
Sklearn library provides us direct access to a different module for training our model with different machine learning algorithms like K-nearest neighbor classifier , Support vector machine classifier , decision tree,linear regression, etc.
Balance Scale Data Set DescriptionBalance Scale data set consists of 5 attributes, 4 as feature attributes and 1 as thetarget attribute. We will try to build aclassifier for predicting the Class attribute. The index of target attribute is 1st.
1.: 3 (L, B, R)
2. Left-Weight: 5 (1, 2, 3, 4, 5)
3. Left-Distance: 5 (1, 2, 3, 4, 5)
4. Right-Weight: 5 (1, 2, 3, 4, 5)
5. Right-Distance: 5 (1, 2, 3, 4, 5)
Index Variable Name Variable Values 1. Class Name( Target Variable) “R” : balance scale tip to the right“L” :balance scale tip to the left
“B” : balance scale be balanced 2. Left-Weight 1, 2, 3, 4, 5 3. Left-Distance 1, 2, 3, 4, 5 4. Right-Weight 1, 2, 3, 4, 5 5. Right-Distance 1, 2, 3, 4, 5
The above table shows all the details of data.
Balance Scale Problem StatementThe problem we are going to address is To model a classifier forevaluating balance tip’s direction.
Decision Tree classifier implementation in Python with sklearnLibraryThe modeled Decision Tree will compare the new records metrics with the prior records(training data) that correctly classifiedthe balance scale’s tip direction.
Python packages used NumPy NumPy is a Numeric Python module. It provides fast mathematical functions. Numpyprovides robust data structures for efficient computation of multi-dimensional arrays & matrices. We used numpy to read data files into numpy arrays and data manipulation. Pandas Provides DataFrame Object for data manipulation Provides reading & writing data b/w different files. DataFrames can hold different types data of multidimensional arrays. Scikit-Learn It’s a machine learning library. It includes various machine learning algorithms. We are using its train_test_split, DecisionTreeClassifier, accuracy_score algorithms.If you haven’t setup the machine learning setup in your system the below posts will helpful.
Python Machine learning setup in ubuntu