Rework / complete chapter on feature set definition🧙 (#418)

KarelZe · Jun 27, 2023 · b9db779 · b9db779
1 parent b13ace9
commit b9db779
Show file tree

Hide file tree

Showing 17 changed files with 308 additions and 89 deletions.
diff --git a/references/obsidian/📑notes/🔚Discussion notes.md b/references/obsidian/📑notes/🔚Discussion notes.md
@@ -3,10 +3,20 @@
 - low accuracy for index options
 	- Study sources of missclassification. See e. g., [[@savickasInferringDirectionOption2003]]
 	- The extent to which inaccurate trade classification biases empirical research dependes on whether misclassifications occur randomly or systematically [[@theissenTestAccuracyLee2000]]. This document also contains ideas how to study the impact of wrong classifications in stock markets. Might different in option markets.
+	- “Spreads are portfolios of options of the same type (either only calls or only puts). Combinations are portfolios of options of different types. Traders can form these complex trades by individually buying the component options or by trading standard packages. The advantage of the latter approach is that the trader is subject to only one bid-ask spread, while buying the component options individually results in paying the bid-ask spread for each option. The market maker determines how to allocate the bid-ask spread among all options in a complex trade. Thus, not all (if any) of the component options necessarily trade at their quotes. Therefore, complex trades are highly likely to produce RQ and outside-quote trades. Furthermore, labeling complex trades as buys or  sells is not straightforward. For example, a bull spread involves buying a call option and selling another call option with a higher strike price. Thus, a buy requires a sell, and it is not clear whether treating the two trades separately is appropriate. Index option trading involves many complex trades because taking covered positions in index options is not as easy (or possible) as in equity options. Frequently, the only alternatives to naked positions in index options are complex options. Therefore, one way to reduce the problem of complex trades is to exclude all index trades. As Table 1 indicates, this results in a significant increase in the classification precision of all methods, but loses roughly one quarter of the sample, which is unacceptable.” (Savickas and Wilson, 2003, p. 899) (Savickas and Wilson, 2003, p. 898)
+	- Neither of the models can detect complex trades. It would require attention across rows and columns, which we outruled.
+	- “In contrast to Pan and Poteshman (2006), we use a unique data set from the International Securities Exchange (ISE), which contains the complete daily record of buy and sell activity in index options over a 12-year period, together with details on whether a transaction is involved in opening or closing an options position. These options are actively traded; indeed, on the ISE, the notional volume in index options is about onefifth of the total notional volume in all individual stock options during our sample period.” (Chordia et al., 2021, p. 1)
+
+“Savickas and Wilson 899 sells is not straightforward. For example, a bull spread involves buying a call option and selling another call option with a higher strike price. Thus, a buy requires a sell, and it is not clear whether treating the two trades separately is appropriate. Index option trading involves many complex trades because taking covered positions in index options is not as easy (or possible) as in equity options. Frequently, the only alternatives to naked positions in index options are complex options. Therefore, one way to reduce the problem of complex trades is to exclude all index trades. As Table 1 indicates, this results in a significant increase in the classification precision of all methods, but loses roughly one quarter of the sample, which is unacceptable.” (Savickas and Wilson, 2003, p. 899)
 - low accuracy for trades outside the quotes
 	- see also [[@ellisAccuracyTradeClassification2000]] for trades inside and outside the spread
+	- “On the one hand, we would expect that the greater (smaller) the transaction price relative to the midspread, the more likely that the transaction is a buy (sell) and occurs on an uptick (a downtick), implying higher classification success for outside-quote trades, especially for large trades in which the trade initiator is willing to pay a premium for the execution of his large order.” ([[@savickasInferringDirectionOption2003]] p. 888)
+	- “On the other hand, however, the outside-quote trades may be the manifestation of stale quotes, which result in misclassification. Also, the effect of market makers’ hedging and rebalancing trades on the classification of outside-quote trades is unclear. Section IV.C contains a logit analysis of outside-quote trades.” ([[@savickasInferringDirectionOption2003]], p. 888)
 - high gains for options for otm options and options with long maturity
 	- Accuracy is not the sole criterion. Depends on whether error is systematic or not. Thus, we do application study. See reasoning in ([[@theissenTestAccuracyLee2000]])
+	- “Specifically, one of the most noticeable regularities is that smaller trades are classified more precisely. This is because these trades are more likely to be executed at quotes and are less prone to reversed-quote trading (partially due to the fact that many small trades are executed on RAES)” (Savickas and Wilson, 2003, p. 889)
+	- Moneyness levels are “Out-of-the-money options offer the highest leverage (exposure for a dollar invested) and thus are particularly attractive for informed investors. Consistent with this argument, the information price impact is decreasing and convex in absolute delta. Figure 3(D) shows that the impact decreases from 0.4% for out-of-the-money options to 0.15% for in-the-money options. Next, private information is often short-lived and is related to near-term events, and thus short-term options are better suited for informed investors in addition to providing higher leverage. Indeed, the price impact decreases by 0.12% if time-to-expiration decreases from 80 days to 20 days. Buyer-initiated trades have a higher price impact than sell trades, because these trades provide an opportunity to bet not only on future volatility but also on the underlying direction. These results are broadly consistent with Pan and Poteshman (2006), except that I do not find a significant difference between call and put options, perhaps because my sample consists of large stocks that are easy to sell short.” (Muravyev, 2016, p. 695)
+“Since time to maturity is inversely related to trade size, we observe greater classification errors for shorter maturity options.” (Savickas and Wilson, 2003, p. 889)
 - performance gap in classical rules
 - strong performance of neural networks / tree-based ensembles
 	- We identify missingess in data to be down-ward biasing the results of classical estimators. ML predictors are robust to this missingness, as they can handle missing values and potentially substitute.
@@ -17,6 +27,15 @@
 	- Finetune. Low cost of inference
 - which algorithm is no preferable? Do Friedman rank test
 
+## time-to-maturity
+- “Expiration dummies are particularly good instruments. Investors substitute expiring option positions with similar nonexpiring ones in the three-day window around the expiration day (every third Friday of a month). Because investors are short call and put equity options on average, the rollover creates unprecedentedly large selling pressure in the nonexpiring options. Option expirations create exogenous variation in order imbalance, and thus exogenous variation in market-maker inventories as investors open new positions to replace positions in expiring options. Volatility and returns of the underlying stocks change little around expiration. Thus, fundamentals and informed trading are not responsible for the order imbalance.” (Muravyev, 2016, p. 700)
+- “Order imbalance is extremely negative around option expiration because investors are rolling over their positions to nonexpiring options. The selling pressure is particularly large on the postexpiration Monday when the abnormal order imbalance reaches −24%.” (Muravyev, 2016, p. 701)
+
+## Quotes change after the trade
+“With respect to the intraday analysis, the interaction between trades and quotes is key to understanding how and why prices change. The literature identifies two reasons why quoted prices increase after a buyer-initiated trade. First, market-makers adjust upward their beliefs about fair value as the trade may contain private information (e.g., Glosten and Milgrom (1985)). Second, market-makers require compensation for allowing their inventory position to deviate from the desired level, and thus a risk-averse market-maker will accommodate a subsequent buy order only at a higher price (e.g., Stoll (1978)).” (Muravyev, 2016, p. 674)
+
+## Quotes NBBO / Exchange
+- “Condition (d) also serves another purpose. Since the trade price is equal to the NBBO price quoted by at least two exchanges, this condition resolves ambiguity about trade direction as further discussed in the Internet Appendix.” (Muravyev, 2016, p. 689)
 
 ## Algorithm
 2.3.7 How to Write the Discussion  Assessment of the results  Comparison of your own results with the results of other studies = Citation of already published literature!  Components  Principles, relationships, generalizations shown by the results = Discussion, not recapitulation of the results  Exceptions, lack of correlation, open points  Referring to published work: = Results and interpretations in agreement with or in contrast to your results  Our Recommendations: The writing of the chapter “Discussion” is the most difficult one. Compare your own data/results with the results from other already published papers (and cite them!). Outline the discussion part in a similar way to that in the Results section = consistency. Evaluate whether your results are in agreement with or in contrast to existing knowledge to date. You can describe why or where the differences occur, e.g. in methods, in sites, in special conditions, etc. Sometimes it is difficult to discuss results without repetition from the chapter “Results”. Then, there is the possibility to combine the “Results” and “Discussion” sections into one chapter. However, in your presentation you have to classify clearly which are your own results and which are taken from other studies. For beginners, it is often easier to separate these sections.

diff --git a/references/obsidian/📖chapters/🍕Application study.md b/references/obsidian/📖chapters/🍕Application study.md
@@ -37,8 +37,13 @@ The null hypothesis is that the location of medians in two independent samples a
 (🔥What can we see? How do the results compare?)
 
 
+- “During our sample period of 2004–2015, quoted half-spreads of options on stocks in the S&P 500 index averaged 13 cents per share and 8.6% of the option price. Dollar (percentage) spreads were considerably wider for well in-the-money (out-of-the-money) options.” (Muravyev and Pearson, 2020, p. 4973)
+- “Although the costs of options market making can help explain why options spreads should be higher than the spreads of their underlying stocks (Battalio and Schultz 2011), a second puzzle is that existing theories are unable to explain the observed patterns of spreads. For example, the high dollar spreads of inthe-money (ITM) options and the relation between spreads and moneyness cannot be explained by hedge rebalancing costs incurred by options market makers, because hedges of well ITM options rarely need to be rebalanced. Similarly, the pattern cannot be explained by difficult to hedge gamma and vega risks that options market makers bear when they hold inventories of options, because well ITM options are not exposed to these risks.” (Muravyev and Pearson, 2020, p. 4974)
 
-
+## Inside / Outside / At the Quote
+- “Options traders exploit this predictability in timing their executions. Executions at the ask price tend to occur when the estimate of the fair value (the expected future midpoint) is close to but less than the quoted ask price, and executions at the bid price tend to occur when it is close to but greater than the quoted bid price. Traders who exploit this predictability are able to take liquidity at low costs, as we explain next.” (Muravyev and Pearson, 2020, p. 4975)
+- “Why do option market makers not update quotes frequently? Even if liquidity providers are faster than most liquidity takers, if they are slower than only one they are at risk to get picked off.4 To protect against this risk, market-makers post wider spreads that do not have to be changed with every change in the option fair value.5 Foucault, Roell, and Sandas (2003) model the trade-off that dealers face between the cost of frequent quote revisions and the benefits of being picked off less frequently.6 It is also costly for option market makers to update quotes frequently because the options exchanges place caps on the number of quote updates and fine exchange members whose ratios of messages to executions is large. In addition, market frictions, such as minimum tick sizes, prevent market makers from continuously centering their quotes on the fair value. Finally, trades by execution timers incur a half-spread of about three cents, which exceeds market-makers’ marginal costs of executing trades. Thus, nontimers’ trades are highly profitable for market makers, while the spreads on timers’ trades appear to at least cover market makers’ marginal costs of trading. Thus, market makers can facilitate trading by cost sensitive investors by changing their quotes infrequently.” (Muravyev and Pearson, 2020, p. 4977)
+- “During our sample the overwhelming bulk of option trading was electronic, with market makers generally using auto-quoting algorithms and quotes and trades disseminated almost instantly to participants in both the option and equity markets. In contrast to the previous option market structure in which trading occurred on exchange floors, in the current market structure an option market maker on the exchange where trade occurs does not have any informational advantage relative to other market participants, including market makers on the equity exchanges. This helps explain our findings that option quotes do not contain information not already reflected in stock quotes.” (Muravyev et al., 2013, p. 261)
 
 
 “We repeated this analysis with our dataset from the Frankfurt Stock Exchange. The results are presented in columns 2 and 3 of Table 5. The bias is even more dramatic. The traditional spread estimate is, on average, about twice as large as the “true” spread.8 A Wilcoxon test rejects the null hypothesis of equal medians (p < 0.01). Despite the large differences, the correlation between the two spread estimates is very high (ρ= 0.96). The magnitude of the relative bias (i.e., the traditional spread estimate divided by the “true” spread) is strongly negatively related to the classification accuracy. The correlation is –0.84.” ([[@theissenTestAccuracyLee2000]], p. 12)
@@ -80,6 +85,8 @@ Savickas and Wilson 897 TABLE 5 Estimated Effective Spreads Average Sprd. Quote
 % TODO: read: Pinder, S. (2003). An empirical examination of the impact of market microstructure changes on the determinants of option bid–ask spreads. International Review of Financial Analysis, 12(5):563–577.
 
 
+“In addition, my results offer little help in answering why option bid-ask spreads are so large. This is one of the biggest puzzles in the options literature—existing theories of the option spread fail to explain its magnitude and shape (Muravyev and Pearson (2014)).” (Muravyev, 2016, p. 696)
+
 - [[@rosenthalModelingTradeDirection2012]] lists fields where trade classification is used and what the impact of wrongly classified trades is.
 - The extent to which inaccurate trade classification biases empirical research depends on whether misclassifications occur randomly or systematically [[@theissenTestAccuracyLee2000]].
 

diff --git a/references/obsidian/📖chapters/👨‍🍳Tain-Test-split.md b/references/obsidian/📖chapters/👨‍🍳Tain-Test-split.md
@@ -1,3 +1,16 @@
+Prime examples for auto-correlation between trades are market or limit orders, that are split into smaller orders to encourage order execution. Also, informed traders tend to slice orders into smaller-sized trades to disguise their trading activity, as documented in ([[@anandStealthTradingOptions2007]]183). Oder splitting leads trades executed (almost) simultaneously with similar trade characteristics, which would be trivial to classify with the true label of a single transaction.
+
+
+
+“A floor broker seeking toexecute a market order for a "large" number of shares will frequently split his order among the quotations of several competing market participants, such as other floor brokers and book or? ders. In this situation, successive sales are a consequence ofthe same trade, and take place on the same side of the market, but are recorded as separate transac? tions. This, in turn, implies positive serial correlation in transactiontype” (Choi et al., 1988, p. 221)
+
+“Limit orders also can cause serial dependence in transaction type. Suppose the currentbid and ask quotes from the dealer are Pb and Pa, respectively. Limit orders to sell (buy) whose prices are lower (higher) than or equal to Pt,(Pa) are transacted at the market. All other limit orders remain in the dealer's book until there is a change in his quotations. However, a change in the dealer's quotation will result in transactions only on one side of the orders in the book. For exam? ple, if the "equilibrium" price increases (decreases), many ofthe limit orders to sell (buy) would be transacted at the same time. These transactions are recorded separately and would, therefore, induce serial correlation in transaction type.” (Choi et al., 1988, p. 221)
+
+“If these informed traders attempt to hide their information by splitting their trades into medium size trades, we should see medium size trades associated with higher price discovery in the dominant exchange and not in the other (non-dominant) exchanges. Underpinning our analysis is the intuition that an informed trader is likely to choose the options market venue (and option trade size) that best protects her ability to hide.” ([Anand and Chakravarty, 2007, p. 183)
+
+"Orders might also be split by option series" (anandStealthTradingOptions2007)
+
+
 Prior classical works assess the performance of classical rules in-sample (cp. [[@ellisAccuracyTradeClassification2000]]541) or in an out-of-sample setting (cp. [[@grauerOptionTradeClassification2022]]7--9) and ([[@chakrabartyTradeClassificationAlgorithms2007]]3814--3815).  In the presence of tunable hyperparameters in machine learning algorithms, we separate the dataset into *three* disjoint sets. The training set is used to fit the classifier to the data. The validation set is dedicated to tuning the hyperparameters, and the test set is used for unbiased out-of-sample estimates. 
 
 Trades in the dataset are ordered by time of execution, and nearby trades exhibit strong auto-correlation. Exemplary, subsequent trades on the same option series may share a similar trade price and quotes. This imposes constraints on the train-test split, which must ensure that minimal information leaks into the test set through serially-correlated features, leading to an otherwise overestimated model performance. The violation of statistical independence, out rules methods like the $k$-fold cross-validation or random test splits, both of which assume samples to be i.i.d ([[@lopezdepradoAdvancesFinancialMachine2018]] 104--105). We expect the previous research of ([[@ronenMachineLearningTrade2022]]14) to suffer from this problem leading to biased results. Differently, our work statically splits into subsets by time, which maintains the temporal ordering and eschews data leakage. This however limits the model's ability to leverage recent information for prediction beyond the training set's cut-off point. We do not explore dynamic training schemes, as they are practically intractable considering the number of model combinations and computational requirements of Transformers and gradient-boosted trees. In absence of an update mechanism, our results can be interpreted as a lower bound.