From 3c18df515791d89ffcccbd02b28b959753ce5e44 Mon Sep 17 00:00:00 2001 From: Markus Bilz Date: Tue, 27 Feb 2024 18:24:36 +0100 Subject: [PATCH] feat: add section about data --- reports/Content/main-summary.tex | 27 +++++++++++++++------------ reports/summary.tex | 6 +++++- 2 files changed, 20 insertions(+), 13 deletions(-) diff --git a/reports/Content/main-summary.tex b/reports/Content/main-summary.tex index 424d0d70..fdb872f3 100644 --- a/reports/Content/main-summary.tex +++ b/reports/Content/main-summary.tex @@ -22,15 +22,22 @@ \section{Background and Motivation} Popular heuristic to sign trades are the tick test \autocite[][]{hasbrouckTradesQuotesInventories1988}, quote rule \autocite[][]{harrisDayEndTransactionPrice1989}, and hybrids thereof such as the \gls{LR} algorithm \autocite[][]{leeInferringTradeDirection1991}. These rules have initially been proposed and tested in the stock market. For option markets, the works of \textcites[][]{savickasInferringDirectionOption2003}[][]{grauerOptionTradeClassification2022} raise concerns about the transferability of trade signing rules due to deteriorating classification accuracies and systematic misclassifications. The latter is crutial, as non-random misclassifications bias the dependent research \autocites[][]{odders-whiteOccurrenceConsequencesInaccurate2000}[][]{theissenTestAccuracyLee2001}. -A contending body of research \autocites{blazejewskiLocalNonParametricModel2005}{rosenthalModelingTradeDirection2012}{ronenMachineLearningTrade2022} improves trade classification performance through machine learning. The scope of current research is yet bound to the stock market and the \textit{artificial} setting, where fully-labelled trades are available. +A contending body of research \autocites{blazejewskiLocalNonParametricModel2005}{rosenthalModelingTradeDirection2012}{ronenMachineLearningTrade2022} improves trade classification performance through \gls{ML}. The scope of current research is yet bound to the stock market and the \textit{artificial} setting, where fully-labelled trades are available. -The goal of our empirical study is to investigate if machine learning-based classifier improve upon the accuracy of state-of-the-art approaches for option trade classification? +The goal of our empirical study is to investigate if machine learning-based classifier improve upon the accuracy of state-of-the-art approaches for option trade classification? \section{Contributions} % Thereby, our work addresses several addressed shortcomings. -% TODO: by how much? -Our contributions are as follows: (I) By employing gradient-boosted trees and transformers we are able to establish a new state-of-the-art in terms of classification accuracy. (II) Our work is the first to consider both the supervised and the semi-supervised setting, where trades are partially-labelled. (III) Through a feature importance analysis based on Shapley values, we consistently attribute performance gains of rule-based and machine learning-based classifiers to feature groups. We discover that both paradigms share common features, but machine learning-based more effectively exploits the data. % Additional insights are gained from probing the Transformers' attention heads. +% TODO: by how much? +% Our approaches outperform all rule-based approaches on International Securities Exchange (ISE) and Chicago Board Options Exchange (CBOE) data with comparable data requirements. + +Our contributions are three-fold: +\begin{enumerate}[label=(\roman*),noitemsep] +\item By employing gradient-boosted trees and transformers we are able to establish a new state-of-the-art in terms of classification accuracy. +\item Our work is the first to consider both the supervised and the semi-supervised setting, where trades are partially-labelled. +\item Through a feature importance analysis based on Shapley values, we can consistently attribute performance gains of rule-based and \gls{ML}-based classifiers to feature groups. We discover that both paradigms share common features, but \gls{ML}-based approaches more effectively exploit the data. % Additional insights are gained from probing the Transformers' attention heads. +\end{enumerate} % consistently attribute probing attention heads @@ -68,19 +75,15 @@ \section{Contributions} % Recent work of \textcite[][13--16]{grauerOptionTradeClassification2022} made significant progress in classification accuracy by proposing explicit overrides for order types and by combining multiple heuristics, thereby advancing the state-of-the-art performance in option trade classification. By this means, their approach enforces a more sophisticated decision boundary eventually leading to a more accurate classification. The fundamental constraint is, that overrides apply only to subsets of trades. Beyond heuristics, it remains open, if classifiers \emph{learned} on trade data can improve upon \emph{static} classification rules in terms of performance and robustness. - - - -% \section{State-of-the-Art} - \section{Data} -on a large-scale dataset of option trades recorded at the ISE and previously studied in (Grauer). After a time-based train-validation-test split, we are left +% We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared sourcetarget vocabulary of about 37000 tokens. -We train +We perform the empirical analysis on two large-scale datasets of option trades recorded at the \gls{ISE} and \gls{CBOE}. Our sample construction follows \textcite[][]{grauerOptionTradeClassification2022}, which fosters comparability between both works. -We distinguish between three feature sets with increasing data requirement +After a time-based train-validation-test split (60-20-20), required by the \gls{ML} estimators, we are left with two test set spanning from November 2015 -- May 2017 at the \gls{ISE} and November 2015 -- October 2017 at the \gls{CBOE}, respectively. Each test set contains between 9.8 Mio. -- 12.8 Mio. labelled option trades. An additional unlabelled, training set of \gls{ISE} trades executed between Oct. 2012 -- Oct. 2013 is reserved for semi-supervised learning. +To establish a common ground with rule-based classification, we distinguish three feature sets with increasing data requirements and apply minimal feature engineering. The first set is based on the data requirements of tick/quote-based algorithms, the second of complex, hybrid algorithms with additional dependencies on trade size data, such as the \gls{GSU} method, and the third feature set includes option characteristics, like the option's $\Delta$. \section{Methodology} diff --git a/reports/summary.tex b/reports/summary.tex index b06987c9..8a6b681e 100644 --- a/reports/summary.tex +++ b/reports/summary.tex @@ -30,7 +30,9 @@ \usepackage{graphicx} % Allows to implement graphics. \usepackage{subfig} % Enables graphs consisting of several figures. \graphicspath{{./Graphs/}} % Tells LATEX that the images are kept in a folder named images under the directory of the main document. - \usepackage[hypcap=false]{caption} % Provides many ways to customize captions. + \usepackage[hypcap=false]{caption} % Provides many ways to customize captions. + + \usepackage{enumitem} % enumerate with letters https://tex.stackexchange.com/a/129960 % Mathematics \usepackage{amscd,amsfonts,amsmath,amssymb,amsthm,amscd,bbm} % Extends the math set. @@ -74,11 +76,13 @@ \newacronym{EMO}{EMO}{Ellis-Michaely-O’Hara} \newacronym{FFN}{FFN}{feed-forward network} \newacronym{GBM}{GBM}{gradient boosting machine} + \newacronym{GSU}{GSU}{Grauer-Schuster-Uhrig-Homburg} \newacronym{ISE}{ISE}{International Securities Exchange} \newacronym{LR}{LR}{Lee-Ready} \newacronym[firstplural=long short-term memories (LSTMs)]{LSTM}{LSTM}{long short-term memory} \newacronym{MAE}{MAE}{mean absolute error} \newacronym{MSE}{MSE}{mean squared error} + \newacronym{ML}{ML}{machine learning} \newacronym{RMSE}{RMSE}{root mean squared error} \newacronym{RF}{RF}{random forest} \newacronym{SSE}{SSE}{sum of squared errors}