
Project author Raja Sekhar Vinnakota
Project mentor Nitesh Khandelwal
This article outlines the work undertaken by the author as a part of his final project submitted inthe Executive Programme in Algorithmic Trading (EPAT) at QuantInsti. You can view the author’s entire project work by clicking on the downloadable button.
In the project, the author has demonstrated the use of various machine learning techniques for forecasting the mid-price movements using limit order book dynamics. A simple trading strategy was also tested and shown to achieve profitable returns against the sample data. For the sample data, the author has used the free two trading days of TAQ NYSE OpenBook data.
The following machine learning techniques/tools were tested on the sample data:
Simple Baseline models (stratified, most_frequent) python sklearn (DummyClassifer) RandomForest H2O, Python, R, Spark, XGBoost Support Vector Machines Python (sklearn) Logistic Regression Python (sklearn) LDA Python (sklearn) KNN Python (sklearn) CART Python (sklearn) NB Python (sklearn) Autosklearn TPOT LSTM Keras/TheanoThis exhaustive project work was carried out on the followingAmazon EC2/GPU instances
ML Model/tool Amazon Instance Type vCPU Mem (GiB) Storage RandomForest/baseline/other classification models c4.2xlarge 8 15 EBS-only Auto-sklearn, TPOT r3.xlarge 4 30.5 SSD (GB) 1 x 80 Deep Learning g2.2xlarge 32 (4 GPUs) 15 SSD (GB) 1 x 60 Model framework The model framework has been shown below.
scala-openbook (Eugene Z.) library was used for parsing the NYSE TAQ data. Orderbook-dynamics (Eugene Z.) was used for order book construction/feature extraction. Code base was changed to upgrade to Spark 1.6.1 and relevant Spark ML related changes were added. Training/Test dataset for sample data obtained after feature extraction in the earlier step was used for training/validation of Random Forest classification model using tools like H2O, R, XGBoost, scikit-learn and Spark. Various other classification models were also tested using Python’s scikit-learn. Tested LSTM RNN model using Keras. AutoML using auto-sklearn/TPOT. Methodology
Feature space was chosen as a subset of the feature vector set shown in the table below. Feature vectors are calculated based on a configured Time Window (Δ) from the LOB snapshots. The mid-price movement (average of best bid and best ask.) was used as class labels. An upward movement indicator (0) is assigned to a data point if the mid-price at label duration (4Δ) later is larger than the mid-price of the current data point. Similarly, a down label (1) and Stationary (2) are assigned accordingly. This is implemented using two cursors (attribute + label) as shown below.


Feature Vector Set 5 levels of the LOB
The author used the training/test set to measure the performance. To validate the model, performance was measured using below measures:
1. Precision: P = #(correctly labeled y)/ #(y in the predictions)
2. Recall: R = #(correctly labeled y)/ #(y in the sample)
3. F1 measure F1 = 2PR/(P + R)
4. Balanced Accuracy
The author used various machine learning techniques on the ORCL sample data. We are listing some of the important results and comparisons. To view the complete analysis, check the attached project report.
The RandomForest had the best accuracy measures (balanced dataset), given that the feature space was nonlinear.

XGBoost had the best accuracy measures for Random Forest (10 Trees) across the different tools tested using balanced dataset.

Precision, Recall, F1-Measure using sklearn (Random Forest, optimized params, balanced dataset) which had the best-balanced accuracy as shown below.

The author tested a simple strategy using ORCL data. The following table lists the rules and assumptions.


Download Report
Next Step:Click on the downloadable button to view the entire 50 page project report. You can also check ourEPAT Project Work pageandhave a look at what our students are building. If you want to learn various aspects of Algorithmic Trading then check outthe Executive Programme in Algorithmic Trading (EPAT) . EPAT equips you with the required skill sets to be a successful algo trader.Enroll now!