How to Make Baseline Predictions for Time Series Forecasting with Python

Establishing a baseline is essential on any time series forecasting problem.

A baseline in performance gives you an idea of how well all other models will actually perform on your problem.

In this tutorial, you will discover how to develop a persistence forecast that you can use to calculate a baseline level of performance on a time series dataset with python.

After completing this tutorial, you will know:

The importance of calculating a baseline of performance on time series forecast problems. How to develop a persistence model from scratch in Python. How to evaluate the forecast from a persistence model and use it to establish a baseline in performance.

Let’s get started.

How to Make Baseline Predictions for Time Series Forecasting with Python

Photo by Bernard Spragg. NZ , some rights reserved.

Forecast Performance Baseline

A baseline in forecast performance provides a point of comparison.

It is apoint of reference for all other modeling techniques on your problem. If a model achieves performance at or below the baseline, the technique should be fixed or abandoned.

The technique used to generate a forecast to calculate the baseline performance must be easy to implement and naive of problem-specific details.

Before you can establish a performance baseline on your forecast problem, you must develop a test harness. This is comprised of:

The dataset you intend to use to train and evaluate models. The resampling technique you intend to use to estimate the performance of the technique (e.g. train/test split). The performance measure you intend to use to evaluate forecasts (e.g. mean squared error).

Once prepared, you then need to select a naive technique that you can use to make a forecast and calculate the baseline performance.

The goal is to get a baseline performance on your time series forecast problem as quickly as possible so that you can get to work better understanding the dataset and developing more advanced models.

Three properties of a good technique for making a baseline forecast are:

Simple : A method that requires little or no training or intelligence. Fast : A method that is fast to implement and computationally trivial to make a prediction. Repeatable : A method that is deterministic, meaning that it produces an expected output given the same input.

A common algorithm used in establishing a baseline performance is the persistence algorithm.

Persistence Algorithm (the “naive” forecast)

The most common baseline method for supervised machine learning is theZero Rule algorithm.

This algorithm predicts the majority class in the case of classification, or the average outcome in the case of regression. This could be used for time series, but does not respect the serial correlation structure in time series datasets.

The equivalent technique for use with time series dataset is the persistence algorithm.

The persistence algorithm uses the value at the previous time step (t-1) to predict the expected outcome at the next time step (t+1).

This satisfies the three above conditions for a baseline forecast.

To make this concrete, we will look at how to develop a persistence model and use it to establish a baseline performance for a simple univariate time series problem. First, let’s review the Shampoo Sales dataset.

Shampoo Sales Dataset

This dataset describes the monthly number of shampoo sales over a 3 year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

Below is a sample of the first 5 rows of data, including the header row.

"Month","Sales" "1-01",266.0 "1-02",145.9 "1-03",183.1 "1-04",119.3 "1-05",180.3

Below is a plot of the entire dataset taken from Data Market where you can download the dataset and learn more about it.

Shampoo Sales Dataset

The dataset shows an increasing trend, and possibly some seasonal component.

Download the dataset and place it in the current working directory with the filename “ shampoo-sales.csv “.

The following snippet of code will load the Shampoo Sales dataset and plot the time series.

frompandasimportread_csv frompandasimportdatetime frommatplotlibimportpyplot defparser(x): return datetime.strptime('190'+x, '%Y-%m') series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser) series.plot() pyplot.show()

Running the example plots the time series, as follows:

Shampoo Sales Dataset Plot

Persistence Algorithm

A persistence model can be implemented easily in Python.

We will break this section down into 4 steps:

Transform the univariate dataset into a supervised learning problem. Establish the train and test datasets for the test harness. Define the persistence model. Make a forecast and establish a baseline performance. Review the complete example and plot the output.

Let’s dive in.

Step 1: Define the Supervised Learning Problem

The first step is to load the dataset and create a lagged representation. That is, given the observation at t-1, predict the observation at t+1.

# Create lagged dataset values = DataFrame(series.values) dataframe = concat([values.shift(1), values], axis=1) dataframe.columns = ['t-1', 't+1'] print(dataframe.head(5))

This snippet creates the dataset and prints the first 5 rows of the new dataset.

We can see that the first row (index 0) will have to be discarded as there was no observation prior to the first observation to use to make the prediction.

From a supervised learning perspective, the t-1 column is the input variable, or X, and the t+1 column is the output variable, or y.

t-1t+1 0NaN266.0 1266.0145.9 2145.9183.1 3183.1119.3 4119.3180.3 Step 2: Train and Test Sets

The next step is to separate the dataset into train and test sets.

We will keep the first 66% of the observations for “training” and the remaining 34% for evaluation. During the split, we are careful to exclude the first row of data with the NaN value.

No training is required in this case; it’s just habit. Each of the train and test sets are then split into the input and output variables.

# split into train and test sets X = dataframe.values train_size = int(len(X) * 0.66) train, test = X[1:train_size], X[train_size:] train_X, train_y = train[:,0], train[:,1] test_X, test_y = test[:,0], test[:,1] Step 3: Persistence Algorithm

We can define our persistence model as a function that returns the value provided as input.

For example, if the t-1 value of 266.0 was provided, then this is returned as the prediction, whereas the actual real or expected value happens to be 145.9 (taken from the first usable row in our lagged dataset).

# pers

How to Make Baseline Predictions for Time Series Forecasting with Python

Trending Articles

拿畫筆變個人靦腆男孩世界冠軍

这里可以下载电影么

【英文字幕/OVA/冷门动画】装鬼兵系列美版两部全

上海网络政论作家任迺俊被关押至今已67天（图）

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

7号线芳华路地铁站单间独卫精装修2000元，急转租，朝南带阳台，芳芯路251弄12号503室，电话18238811550 (豆瓣...

JDownloader 2.0_20250902 免安裝中文版 - 最受歡迎萬用下載工具免空下載器 MEGA下載器

【报Bug】uni.downloadFile 在自定义基座下运行，在国外安卓系统返回400

[討論]分享手動分割列表統計面積計算 v1.0_Y572.LSP

财政部：推进企业会计准则通用分类标准实施

Linux 命令行工具使用小贴士及技巧（三）

关门一家亲：习远平、张澜澜、徐才厚

出售: isoclean 4位英式30A濾波

幼稚園駐校社工凝聚家庭協助老師幼兒

[转载]煞貢、直星、人專吉日\金神七煞歌

陳柏仰掌高市經發會帶頭拚經濟

晋察冀史研究——尾声，抗战时期晋察冀内部矛盾的最后总结

[DBD-Raws][剧场版刀剑神域进击篇黯淡黄昏的谐谑曲/Gekijouban Sword Art Online Progressive...

UAD A-Type Multiband Dynamic Enhancer 现已支持 Native

试玩了26100.1的大神们有掉帧的情况吗?