How To Implement Baseline Machine Learning Algorithms From Scratch With Python

It is important to establish baseline performance on a predictive modeling problem.

A baseline provides a point of comparison for the more advanced methods that you evaluate later.

In this tutorial, you will discover how to implement baseline machine learning algorithms from scratch in python.

After completing this tutorial, you will know:

How to implement the random prediction algorithm. How to implement the zero rule prediction algorithm.

Let’s get started.

How To Implement Baseline Machine Learning Algorithms From Scratch With Python

Photo by Vanesser III , some rights reserved.

Description

There are many machine learning algorithms to choose from. Hundreds in fact.

You must know whether the predictions for a given algorithm are good or not. But how do you know?

The answer is to use a baseline prediction algorithm. A baseline prediction algorithm provides a set of predictions that you can evaluate as you would any predictions for your problem, such as classification accuracy or RMSE.

The scores from these algorithms provide the required point of comparison when evaluating all other machine learning algorithms on your problem.

Once established, you can comment on how much better a given algorithm is as compared to the naive baseline algorithm, providing context on just how good a given method actually is.

The two most commonly used baseline algorithms are:

Random Prediction Algorithm. Zero Rule Algorithm.

When starting on a new problem that is more sticky than a conventional classification or regression problem, it is a good idea to first devise a random prediction algorithm that is specific to your predictionproblem. Later you can improve upon this and devise a zero rule algorithm.

Let’s implement these algorithms and see how they work.

Tutorial

This tutorial is divided into 2parts:

Random Prediction Algorithm. Zero Rule Algorithm.

These steps will provide the foundations you need to handle implementing and calculating baseline performance for your machine learning algorithms.

1. Random Prediction Algorithm

The random prediction algorithm predicts a random outcome as observed in the training data.

It is perhaps the simplest algorithm to implement.

It requires that you store all of the distinct outcome values in the training data, which could be large on regression problems with lots of distinct values.

Because random numbers are used to make decisions, it is a good idea to fix the random number seed prior to using the algorithm. This is to ensure that we get the same set of random numbers, and in turn the same decisions each time the algorithm is run.

Below is an implementation of the Random Prediction Algorithm in a function named random_algorithm() .

The function takes both a training dataset that includes output values and a test dataset for which output values must be predicted.

The function will work for both classification and regression problems. It assumes that the output value in the training data is the final column for each row.

First, the set of unique output values is collected from the training data. Then a randomly selected output value from the set is selected for each row in the test set.

# Generate random predictions defrandom_algorithm(train, test): output_values = [row[-1] for rowin train] unique = list(set(output_values)) predicted = list() for rowin test: index = randrange(len(unique)) predicted.append(unique[index]) return predicted

We can test this function with a small dataset that only contains the output column for simplicity.

The output values in the training datasetare either “0” or “1”, meaning that the set of predictions the algorithm will choose from is {0, 1}. The test set also contains a single column, with no data as the predictions are not known.

fromrandomimportseed fromrandomimportrandrange # Generate random predictions defrandom_algorithm(train, test): output_values = [row[-1] for rowin train] unique = list(set(output_values)) predicted = list() for rowin test: index = randrange(len(unique)) predicted.append(unique[index]) return predicted seed(1) train = [[0], [1], [0], [1], [0], [1]] test = [[None], [None], [None], [None]] predictions = random_algorithm(train, test) print(predictions)

Running the example calculates random predictions for the test dataset and prints those predictions.

[0, 1, 1, 0]

The random prediction algorithm is easy to implement and fast to run, but we could do better as a baseline.

2. Zero Rule Algorithm

The Zero Rule Algorithm is a better baseline than the random algorithm.

It uses more information about a given problem to create one rule in order to make predictions. This rule is different depending on the problem type.

Let’s start with classification problems, predicting a class label.

Classification

For classification problems, the one rule is to predict the class value that is most common in the training dataset. This means that if a training dataset has 90 instances of class “0” and 10 instances of class “1” that it will predict “0” and achieve a baseline accuracy of 90/100 or 90%.

This is much better than the random prediction algorithm that would only achieve 82% accuracy on average. For details on how this is estimate for random search is calculated, see below:

= ((0.9 * 0.9) + (0.1 * 0.1)) * 100 = 82%

Below is a function named zero_rule_algorithm_classification() that implements this for the classification case.

# zero rule algorithm for classification defzero_rule_algorithm_classification(train, test): output_values = [row[-1] for rowin train] prediction = max(set(output_values), key=output_values.count) predicted = [predictionfor i in range(len(train))] return predicted

The function makes use of the max() function with the key attribute, which is a little clever.

Given a list of class values observed in the training data, the max() function takes a set of unique class values and calls the count on the list of class values for each class value in the set.

The result is that it returns the class value that has the highest count of observed values in the list of class values observed in the training dataset.

If all class values have the same count, then we will choose the first class value observed in the dataset.

Once we select a class value, it is used to make a prediction for each row in the test dataset.

Below is a worked example with a contrived dataset that contains 4 examples of class “0” and 2 examples of class “1”. We would expect the algorithm to choose the class value “0” as the prediction for each row in the test dataset.

fromrandomimportseed fromrandomimportrandrange # zero rule algorithm for classification defzero_rule_algorithm_classification(train, test): output_values = [row[-1] for rowin train] prediction = max(set(output_values), key=output_values.count) predicted = [predictionfor i in range(len(train))] return predicted seed(1) train = [['0'], ['0'], ['0'], ['0'], ['1'], ['1']] test = [[None], [None], [None], [None]] predictions = zero_rule_algorithm_classification(train, test) print(predictions) Running this example makes the predictions and prints them to screen. As expected, the class value of “0” was chosen and

How To Implement Baseline Machine Learning Algorithms From Scratch With Python

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎