More examples on fast manipulations with data using numpy.
First part may be foundhere.
In[1]:from __future__ import print_function import numpy as np
Rolling window, strided tricksWhen working with time series / images it is frequently needed to do some operations on windows.
Simplest case: taking mean for running window:
In[2]:sequence = np.random.normal(size=10000) + np.arange(10000)
Very bad ideais to do this with pure python
In[3]: def running_average_simple(seq, window=100): result = np.zeros(len(seq) - window) for i in range(len(result)): result[i] = np.mean(seq[i:i + window]) return result running_average_simple(sequence) Out[3]: array([ 49.43051858, 50.42845047, 51.43946518, ..., 9946.35091814, 9947.34962938, 9948.35901262])A bit better is to use as_strided
In[4]: from numpy.lib.stride_tricks import as_strided def running_average_strides(seq, window=100): stride = seq.strides[0] sequence_strides = as_strided(seq, shape=[len(seq) - window + 1, window], strides=[stride, stride]) return sequence_strides.mean(axis=1) In[5]:running_average_strides(sequence)
Out[5]: array([ 49.43051858, 50.42845047, 51.43946518, ..., 9947.34962938, 9948.35901262, 9949.35015443])From computation side, as_strided does nothing. No copies and no computations, it only gives new view to the data, which is two-dimensional this time.
However the right way to compute mean over rolling window is using numpy.cumsum:
(this one is unbeatable in speed if n is not small)
In[6]: def running_average_cumsum(seq, window=100): s = np.insert(np.cumsum(seq), 0, [0]) return (s[window :] - s[:-window]) * (1. / window) In[7]:running_average_cumsum(sequence)
Out[7]: array([ 49.43051858, 50.42845047, 51.43946518, ..., 9947.34962938, 9948.35901262, 9949.35015443])See also for this purpose:
scipy.signal.smooth pandas.rolling_mean and similar functionsRemark: numpy.cumsum is equivalent to numpy.add.accumulate , but there are also:
numpy.maximum.accumulate , numpy.minimum.accumulate - running max and min numpy.multiply.accumulate , which is equivalent to numpy.cumprodRemark: for computing rolling mean, numpy.cumsum is best, however for other window statistics like min/max/percentile, use strides trick.
Strides and training on sequencesML algorithms in python are often taking numpy.arrays . In many cases when working with sequences you need to pass some data many times as part of different chunks.
Example: you have exhange rates for a year, you want GBDT to predict next exchange rate based on the previous 10.
In[8]: window = 10 rates = np.random.normal(size=1000) # target in training y = rates[window:]Typically the solution used is:
In[9]: X1 = np.zeros([len(rates) - window, window]) for day in range(len(X1)): X1[day, :] = rates[day:day + window]But strided tricks are better way, since they don't need additional space:
In[10]: stride, = rates.strides X2 = as_strided(