From 3c18df515791d89ffcccbd02b28b959753ce5e44 Mon Sep 17 00:00:00 2001
From: Markus Bilz <github@markusbilz.com>
Date: Tue, 27 Feb 2024 18:24:36 +0100
Subject: [PATCH] feat: add section about data

---
 reports/Content/main-summary.tex | 27 +++++++++++++++------------
 reports/summary.tex              |  6 +++++-
 2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/reports/Content/main-summary.tex b/reports/Content/main-summary.tex
index 424d0d70..fdb872f3 100644
--- a/reports/Content/main-summary.tex
+++ b/reports/Content/main-summary.tex
@@ -22,15 +22,22 @@ \section{Background and Motivation}
 
 Popular heuristic to sign trades are the tick test \autocite[][]{hasbrouckTradesQuotesInventories1988}, quote rule \autocite[][]{harrisDayEndTransactionPrice1989}, and hybrids thereof such as the \gls{LR} algorithm \autocite[][]{leeInferringTradeDirection1991}. These rules have initially been proposed and tested in the stock market. For option markets, the works of \textcites[][]{savickasInferringDirectionOption2003}[][]{grauerOptionTradeClassification2022} raise concerns about the transferability of trade signing rules due to deteriorating classification accuracies and systematic misclassifications. The latter is crutial, as non-random misclassifications bias the dependent research \autocites[][]{odders-whiteOccurrenceConsequencesInaccurate2000}[][]{theissenTestAccuracyLee2001}.
 
-A contending body of research \autocites{blazejewskiLocalNonParametricModel2005}{rosenthalModelingTradeDirection2012}{ronenMachineLearningTrade2022} improves trade classification performance through machine learning. The scope of current research is yet bound to the stock market and the \textit{artificial} setting, where fully-labelled trades are available. 
+A contending body of research \autocites{blazejewskiLocalNonParametricModel2005}{rosenthalModelingTradeDirection2012}{ronenMachineLearningTrade2022} improves trade classification performance through \gls{ML}. The scope of current research is yet bound to the stock market and the \textit{artificial} setting, where fully-labelled trades are available. 
 
-The goal of our empirical study is to investigate if  machine learning-based classifier improve upon the accuracy of state-of-the-art approaches for option trade classification?
+The goal of our empirical study is to investigate if machine learning-based classifier improve upon the accuracy of state-of-the-art approaches for option trade classification?
 
 \section{Contributions}
 
 % Thereby, our work addresses several addressed shortcomings.
-% TODO: by how much?
-Our contributions are as follows: (I) By employing gradient-boosted trees and transformers we are able to establish a new state-of-the-art in terms of classification accuracy. (II) Our work is the first to consider both the supervised and the semi-supervised setting, where trades are partially-labelled. (III) Through a feature importance analysis based on Shapley values, we consistently attribute performance gains of rule-based and machine learning-based classifiers to feature groups. We discover that both paradigms share common features, but machine learning-based more effectively exploits the data. % Additional insights are gained from probing the Transformers' attention heads.
+% TODO: by how much? 
+% Our approaches outperform all rule-based approaches on International Securities Exchange (ISE) and Chicago Board Options Exchange (CBOE) data with comparable data requirements.
+
+Our contributions are three-fold: 
+\begin{enumerate}[label=(\roman*),noitemsep]
+\item By employing gradient-boosted trees and transformers we are able to establish a new state-of-the-art in terms of classification accuracy. 
+\item Our work is the first to consider both the supervised and the semi-supervised setting, where trades are partially-labelled.
+\item Through a feature importance analysis based on Shapley values, we can consistently attribute performance gains of rule-based and \gls{ML}-based classifiers to feature groups. We discover that both paradigms share common features, but \gls{ML}-based approaches more effectively exploit the data. % Additional insights are gained from probing the Transformers' attention heads.
+\end{enumerate}
 
 % consistently attribute probing attention heads
 
@@ -68,19 +75,15 @@ \section{Contributions}
 % Recent work of \textcite[][13--16]{grauerOptionTradeClassification2022} made significant progress in classification accuracy by proposing explicit overrides for order types and by combining multiple heuristics, thereby advancing the state-of-the-art performance in option trade classification. By this means, their approach enforces a more sophisticated decision boundary eventually leading to a more accurate classification. The fundamental constraint is, that overrides apply only to subsets of trades. Beyond heuristics, it remains open, if classifiers \emph{learned} on trade data can improve upon \emph{static} classification rules in terms of performance and robustness.
 
 
-
-
-
-% \section{State-of-the-Art}
-
 \section{Data}
 
-on a large-scale dataset of option trades recorded at the ISE and  previously studied in (Grauer). After a time-based train-validation-test split, we are left
+% We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared sourcetarget vocabulary of about 37000 tokens.
 
-We train
+We perform the empirical analysis on two large-scale datasets of option trades recorded at the \gls{ISE} and \gls{CBOE}. Our sample construction follows \textcite[][]{grauerOptionTradeClassification2022}, which fosters comparability between both works. 
 
-We distinguish between three feature sets with increasing data requirement
+After a time-based train-validation-test split (60-20-20), required by the \gls{ML} estimators, we are left with two test set spanning from November 2015 -- May 2017 at the \gls{ISE} and November 2015 -- October 2017 at the \gls{CBOE}, respectively. Each test set contains between 9.8 Mio. --  12.8 Mio. labelled option trades. An additional unlabelled, training set of \gls{ISE} trades executed between Oct. 2012 -- Oct. 2013 is reserved for semi-supervised learning.
 
+To establish a common ground with rule-based classification, we distinguish three feature sets with increasing data requirements and apply minimal feature engineering. The first set is based on the data requirements of tick/quote-based algorithms, the second of complex, hybrid algorithms with additional dependencies on trade size data, such as the \gls{GSU} method, and the third feature set includes option characteristics, like the option's $\Delta$. 
 
 \section{Methodology}
 
diff --git a/reports/summary.tex b/reports/summary.tex
index b06987c9..8a6b681e 100644
--- a/reports/summary.tex
+++ b/reports/summary.tex
@@ -30,7 +30,9 @@
 	\usepackage{graphicx} % Allows to implement graphics.
 	\usepackage{subfig} % Enables graphs consisting of several figures.
 	\graphicspath{{./Graphs/}} % Tells LATEX that the images are kept in a folder named images under the directory of the main document.
-	\usepackage[hypcap=false]{caption} % Provides many ways to customize captions.	
+	\usepackage[hypcap=false]{caption} % Provides many ways to customize captions.
+	
+	\usepackage{enumitem} % enumerate with letters https://tex.stackexchange.com/a/129960
 
 % Mathematics
     \usepackage{amscd,amsfonts,amsmath,amssymb,amsthm,amscd,bbm} % Extends the math set.
@@ -74,11 +76,13 @@
 	\newacronym{EMO}{EMO}{Ellis-Michaely-O’Hara}
 	\newacronym{FFN}{FFN}{feed-forward network}
 	\newacronym{GBM}{GBM}{gradient boosting machine}
+	\newacronym{GSU}{GSU}{Grauer-Schuster-Uhrig-Homburg}
 	\newacronym{ISE}{ISE}{International Securities Exchange}
 	\newacronym{LR}{LR}{Lee-Ready}
 	\newacronym[firstplural=long short-term memories (LSTMs)]{LSTM}{LSTM}{long short-term memory}
 	\newacronym{MAE}{MAE}{mean absolute error}
 	\newacronym{MSE}{MSE}{mean squared error}
+	\newacronym{ML}{ML}{machine learning}
 	\newacronym{RMSE}{RMSE}{root mean squared error}
 	\newacronym{RF}{RF}{random forest}
 	\newacronym{SSE}{SSE}{sum of squared errors}