diff --git a/reports/Content/main-summary.tex b/reports/Content/main-summary.tex index 378148e8..4f4c5e57 100644 --- a/reports/Content/main-summary.tex +++ b/reports/Content/main-summary.tex @@ -1,18 +1,4 @@ -% The dominant sequence transduction models are based on complex recurrent or -% convolutional neural networks that include an encoder and a decoder. The best -% performing models also connect the encoder and decoder through an attention -% mechanism. We propose a new simple network architecture, the Transformer, -% based solely on attention mechanisms, dispensing with recurrence and convolutions -% entirely. Experiments on two machine translation tasks show these models to -% be superior in quality while being more parallelizable and requiring significantly -% less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including -% ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, -% our model establishes a new single-model state-of-the-art BLEU score of 41.8 after -% training for 3.5 days on eight GPUs, a small fraction of the training costs of the -% best models from the literature. We show that the Transformer generalizes well to -% other tasks by applying it successfully to English constituency parsing both with -% large and limited training data \section{Background and Motivation} @@ -28,52 +14,13 @@ \section{Background and Motivation} \section{Contributions} -% Thereby, our work addresses several addressed shortcomings. -% TODO: by how much? -% Our approaches outperform all rule-based approaches on International Securities Exchange (ISE) and Chicago Board Options Exchange (CBOE) data with comparable data requirements. - Our contributions are three-fold: \begin{enumerate}[label=(\roman*),noitemsep] -% Our We perform rigorous benchmarking. -\item By employing gradient-boosted trees and transformers we are able to establish a new state-of-the-art in terms of classification accuracy. -\item Our work is the first to consider both the supervised and the semi-supervised setting, where trades are partially-labelled. -\item Through a feature importance analysis based on Shapley values, we can consistently attribute performance gains of rule-based and \gls{ML}-based classifiers to feature groups. We discover that both paradigms share common features, but \gls{ML}-based approaches more effectively exploit the data. % Additional insights are gained from probing the Transformers' attention heads. -\end{enumerate} - -% consistently attribute probing attention heads - - -% rendering irrelevant - - -% We assess the performance -% Wee apply and where only a small fraction of trades can be labelled and peform a regorous benchmarking. - - -% In summary, machine learning has been applied successfully in the context of trade -% classification. A summary is given in Appendix A.1. No previous work performs -% machine learning-based classification in the options markets. Our work fills this gap -% and models trade classification using machine learning to improve upon extant rules. - - - -% The dataset is split into three disjoint sets for training, validation, and testing. As in \textcite{ellisAccuracyTradeClassification2000} and \textcite{ronenMachineLearningTrade2022} we perform a classical train-test split, thereby maintaining the temporal ordering within the data. - -% So far research focuses on the None of which have been tested -% hier ml based algos beschreiben - - -% TODO: case of partially labelled dat + -% TODO: Hier überleitung ML einführen. -% Envolving - - - - -% Recent works - -% Recent work of \textcite[][13--16]{grauerOptionTradeClassification2022} made significant progress in classification accuracy by proposing explicit overrides for order types and by combining multiple heuristics, thereby advancing the state-of-the-art performance in option trade classification. By this means, their approach enforces a more sophisticated decision boundary eventually leading to a more accurate classification. The fundamental constraint is, that overrides apply only to subsets of trades. Beyond heuristics, it remains open, if classifiers \emph{learned} on trade data can improve upon \emph{static} classification rules in terms of performance and robustness. +\item By employing gradient-boosted trees and transformers we are able to establish a new state-of-the-art in terms of classification accuracy. We outperform existing approaches by (...) in accuracy with comparable data requirements. We show that the model generalizes on other exchanges (...) Relative to the ubiquitous \gls{LR} algorithm, improvements are between (...) and (...). +\item In practice, unlabelled trades are abundant, whereas true labels for trades are scarce. Our work is the first to consider both the supervised and the semi-supervised setting, where trades are only required to be partially-labelled. +\item Through a feature importance analysis based on Shapley values, we can consistently attribute performance gains of rule-based and \gls{ML}-based classifiers to feature groups. We discover that both paradigms share features to a large extent, but \gls{ML}-based approaches more effectively exploit the data. +\end{enumerate} \section{Data} @@ -83,19 +30,18 @@ \section{Data} After a time-based train-validation-test split (60-20-20), required by the \gls{ML} estimators, we are left with two test set spanning from November 2015 -- May 2017 at the \gls{ISE} and Nov. 2015 -- Oct. 2017 at the \gls{CBOE}, respectively. Each test set contains between 9.8 Mio. -- 12.8 Mio. labelled option trades. An additional unlabelled, training set of \gls{ISE} trades executed between Oct. 2012 -- Oct. 2013 is reserved for learning in the semi-supervised setting. -To establish a common ground with rule-based classification, we distinguish three feature sets with increasing data requirements and employ minimal feature engineering. The first set is based on the data requirements of tick/quote-based algorithms, the second of hybrid algorithms with additional dependencies on trade size data, such as the \gls{GSU} method, and the third feature set includes option characteristics, like the option's $\Delta$ or underlying. +To establish a common ground with rule-based classification, we distinguish three feature sets with increasing data requirements and employ minimal feature engineering. The first set is based on the data requirements of tick/quote-based algorithms, the second of hybrid algorithms with additional dependencies on trade size data, such as the \gls{GSU} method, and the third feature set includes option characteristics, like the option's $\Delta$ or the underlying. \section{Methodology} -We model trade classification using gradient-boosted trees \autocites[][]{friedmanGreedyFunctionApproximation2001}, a wide tree-based ensemble, and the FT-Transformer \autocite{gorishniyRevisitingDeepLearning2021}, a Transformer-based neural network architecture. We chose these approaches for their state-of-the-art performance in tabular modelling \autocites[][]{gorishniyRevisitingDeepLearning2021}[][]{grinsztajnWhyTreebasedModels2022} and their extendability to learn on partially-labelled trades. Additionally, Transformers offer some model interpretability through the Attention mechanism. An advantage we exploit later to derive insights into the decision process of Transformers. +We model trade classification using gradient-boosted trees \autocites[][]{friedmanGreedyFunctionApproximation2001}, a wide tree-based ensemble, and the FT-Transformer \autocite{gorishniyRevisitingDeepLearning2021}, a Transformer-based neural network architecture. We chose these approaches for their state-of-the-art performance in tabular modelling \autocites[][]{gorishniyRevisitingDeepLearning2021}[][]{grinsztajnWhyTreebasedModels2022} and their extendability to learn on partially-labelled trades. Additionally, Transformers offer \textit{some} model interpretability through the Attention mechanism. An advantage we exploit later to derive insights into the decision process of Transformers. -As for being minimal intrusive +As stated earlier, our goal is to extend \gls{ML} classifiers for the semi-supervised setting to make use of the abundant, unlabelled trade data. We couple gradient-boosting with self-training \autocite{yarowskyUnsupervisedWordSense1995}, whereby confident predictions of unlabelled trades are interatively added into the training set as pseudo-labels. A new classifier is then retrained on labelled and pseudo-labelled trades. Likewise, the Transformer is pre-trained on unlabelled trades with the replaced token detection objective of \textcite{clarkElectraPretrainingText2020} and later finetuned on labelled training instances. Conceptually, the network learns to detect randomly replaced tokens or features of transactions. Both techniques are aimed at improving generalization performance. -For a rich evaluation, we all rule-based approaches are modelled a classifier.\footnote{The implementation of our classifiers is publicaly available under \url{https://github.com/KarelZe/tclf}} +Classical trade classification rules are implemented as rule-based classifier allowing us to construct arbitrary candidates for benchmarking and support richer evaluation of feature importances.\footnote{The implementation is publically available under \url{https://pypi.org/project/tclf/}.} -% Tuning -To facilitate a fair comparison, we employ an exhaustive Bayesian search, to find a suitable hyperparameter configuration for each of our models. Classical -rule have no hyperparameters per se. Akin to tuning the machine learning classifiers on the validation set, we select the classical benchmarks based on their validation performance. This is most rigorous, while preventing to overfit the test set.\footnote{All our experiments as well as the source code are publically available.} +To facilitate a fair comparison, we run an exhaustive Bayesian search, to find a suitable hyperparameter configuration for each of our models. Classical +rule have no hyperparameters per se. Akin to tuning the machine learning classifiers on the validation set, we select the classical benchmarks based on their validation performance. This is most rigorous, while preventing to overfit the test set.\footnote{All of our source code and experiments are publically available under \url{https://github.com/KarelZe/thesis}.} \section{Results}