Skip to content

Commit

Permalink
feat: add section about data
Browse files Browse the repository at this point in the history
  • Loading branch information
KarelZe committed Feb 27, 2024
1 parent 27ff5b0 commit 3c18df5
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 13 deletions.
27 changes: 15 additions & 12 deletions reports/Content/main-summary.tex
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,22 @@ \section{Background and Motivation}

Popular heuristic to sign trades are the tick test \autocite[][]{hasbrouckTradesQuotesInventories1988}, quote rule \autocite[][]{harrisDayEndTransactionPrice1989}, and hybrids thereof such as the \gls{LR} algorithm \autocite[][]{leeInferringTradeDirection1991}. These rules have initially been proposed and tested in the stock market. For option markets, the works of \textcites[][]{savickasInferringDirectionOption2003}[][]{grauerOptionTradeClassification2022} raise concerns about the transferability of trade signing rules due to deteriorating classification accuracies and systematic misclassifications. The latter is crutial, as non-random misclassifications bias the dependent research \autocites[][]{odders-whiteOccurrenceConsequencesInaccurate2000}[][]{theissenTestAccuracyLee2001}.

A contending body of research \autocites{blazejewskiLocalNonParametricModel2005}{rosenthalModelingTradeDirection2012}{ronenMachineLearningTrade2022} improves trade classification performance through machine learning. The scope of current research is yet bound to the stock market and the \textit{artificial} setting, where fully-labelled trades are available.
A contending body of research \autocites{blazejewskiLocalNonParametricModel2005}{rosenthalModelingTradeDirection2012}{ronenMachineLearningTrade2022} improves trade classification performance through \gls{ML}. The scope of current research is yet bound to the stock market and the \textit{artificial} setting, where fully-labelled trades are available.

The goal of our empirical study is to investigate if machine learning-based classifier improve upon the accuracy of state-of-the-art approaches for option trade classification?
The goal of our empirical study is to investigate if machine learning-based classifier improve upon the accuracy of state-of-the-art approaches for option trade classification?

\section{Contributions}

% Thereby, our work addresses several addressed shortcomings.
% TODO: by how much?
Our contributions are as follows: (I) By employing gradient-boosted trees and transformers we are able to establish a new state-of-the-art in terms of classification accuracy. (II) Our work is the first to consider both the supervised and the semi-supervised setting, where trades are partially-labelled. (III) Through a feature importance analysis based on Shapley values, we consistently attribute performance gains of rule-based and machine learning-based classifiers to feature groups. We discover that both paradigms share common features, but machine learning-based more effectively exploits the data. % Additional insights are gained from probing the Transformers' attention heads.
% TODO: by how much?
% Our approaches outperform all rule-based approaches on International Securities Exchange (ISE) and Chicago Board Options Exchange (CBOE) data with comparable data requirements.

Our contributions are three-fold:
\begin{enumerate}[label=(\roman*),noitemsep]
\item By employing gradient-boosted trees and transformers we are able to establish a new state-of-the-art in terms of classification accuracy.
\item Our work is the first to consider both the supervised and the semi-supervised setting, where trades are partially-labelled.
\item Through a feature importance analysis based on Shapley values, we can consistently attribute performance gains of rule-based and \gls{ML}-based classifiers to feature groups. We discover that both paradigms share common features, but \gls{ML}-based approaches more effectively exploit the data. % Additional insights are gained from probing the Transformers' attention heads.
\end{enumerate}

% consistently attribute probing attention heads

Expand Down Expand Up @@ -68,19 +75,15 @@ \section{Contributions}
% Recent work of \textcite[][13--16]{grauerOptionTradeClassification2022} made significant progress in classification accuracy by proposing explicit overrides for order types and by combining multiple heuristics, thereby advancing the state-of-the-art performance in option trade classification. By this means, their approach enforces a more sophisticated decision boundary eventually leading to a more accurate classification. The fundamental constraint is, that overrides apply only to subsets of trades. Beyond heuristics, it remains open, if classifiers \emph{learned} on trade data can improve upon \emph{static} classification rules in terms of performance and robustness.





% \section{State-of-the-Art}

\section{Data}

on a large-scale dataset of option trades recorded at the ISE and previously studied in (Grauer). After a time-based train-validation-test split, we are left
% We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared sourcetarget vocabulary of about 37000 tokens.

We train
We perform the empirical analysis on two large-scale datasets of option trades recorded at the \gls{ISE} and \gls{CBOE}. Our sample construction follows \textcite[][]{grauerOptionTradeClassification2022}, which fosters comparability between both works.

We distinguish between three feature sets with increasing data requirement
After a time-based train-validation-test split (60-20-20), required by the \gls{ML} estimators, we are left with two test set spanning from November 2015 -- May 2017 at the \gls{ISE} and November 2015 -- October 2017 at the \gls{CBOE}, respectively. Each test set contains between 9.8 Mio. -- 12.8 Mio. labelled option trades. An additional unlabelled, training set of \gls{ISE} trades executed between Oct. 2012 -- Oct. 2013 is reserved for semi-supervised learning.

To establish a common ground with rule-based classification, we distinguish three feature sets with increasing data requirements and apply minimal feature engineering. The first set is based on the data requirements of tick/quote-based algorithms, the second of complex, hybrid algorithms with additional dependencies on trade size data, such as the \gls{GSU} method, and the third feature set includes option characteristics, like the option's $\Delta$.

\section{Methodology}

Expand Down
6 changes: 5 additions & 1 deletion reports/summary.tex
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@
\usepackage{graphicx} % Allows to implement graphics.
\usepackage{subfig} % Enables graphs consisting of several figures.
\graphicspath{{./Graphs/}} % Tells LATEX that the images are kept in a folder named images under the directory of the main document.
\usepackage[hypcap=false]{caption} % Provides many ways to customize captions.
\usepackage[hypcap=false]{caption} % Provides many ways to customize captions.

\usepackage{enumitem} % enumerate with letters https://tex.stackexchange.com/a/129960

% Mathematics
\usepackage{amscd,amsfonts,amsmath,amssymb,amsthm,amscd,bbm} % Extends the math set.
Expand Down Expand Up @@ -74,11 +76,13 @@
\newacronym{EMO}{EMO}{Ellis-Michaely-O’Hara}
\newacronym{FFN}{FFN}{feed-forward network}
\newacronym{GBM}{GBM}{gradient boosting machine}
\newacronym{GSU}{GSU}{Grauer-Schuster-Uhrig-Homburg}
\newacronym{ISE}{ISE}{International Securities Exchange}
\newacronym{LR}{LR}{Lee-Ready}
\newacronym[firstplural=long short-term memories (LSTMs)]{LSTM}{LSTM}{long short-term memory}
\newacronym{MAE}{MAE}{mean absolute error}
\newacronym{MSE}{MSE}{mean squared error}
\newacronym{ML}{ML}{machine learning}
\newacronym{RMSE}{RMSE}{root mean squared error}
\newacronym{RF}{RF}{random forest}
\newacronym{SSE}{SSE}{sum of squared errors}
Expand Down

0 comments on commit 3c18df5

Please sign in to comment.