Add draf for complex example

KarelZe · Dec 26, 2023 · 2ad294f · 2ad294f
1 parent 604ab24
commit 2ad294f
Show file tree

Hide file tree

Showing 6 changed files with 202 additions and 40 deletions.
diff --git a/docs/img/gsu.png b/docs/img/gsu.png
diff --git a/docs/index.md b/docs/index.md
@@ -19,12 +19,22 @@ $ pip install .
 Successfully installed tclf-0.0.0
 ```
 
+## Supported Algorithms
+
+- (Rev.) CLNV rule[^1]
+- (Rev.) EMO rule[^2]
+- (Rev.) LR algorithm[^6]
+- (Rev.) Tick test[^5]
+- Depth rule[^3]
+- Quote rule[^4]
+- Tradesize rule[^3]
+
 ## Minimal Example
 
 Let's start simple: classify all trades by the quote rule and all other trades, which cannot be classified by the quote rule, randomly.
 
 Create a `main.py` with:
-```python
+```python title="main.py"
 import numpy as np
 import pandas as pd
 
@@ -54,10 +64,10 @@ In this example, input data is available as a pd.DataFrame with columns conformi
 
 The parameter `layers=[("quote", "ex")]` sets the quote rule at the exchange level and `strategy="random"` specifies the fallback strategy for unclassified trades. The true label `y` is not used in classification and only for API consistency by convention.
 
-## Advanced Example
+## Andvanced Example
 Often it is desirable to classify both on exchange level data and nbbo data. Also, data might only be available as a numpy array. So let's extend the previous example by classifying using the quote rule at exchange level, then at nbbo and all other trades randomly.
 
-```python hl_lines="6  16 17 20"
+```python title="main.py" hl_lines="6  16 17 20"
 import numpy as np
 from sklearn.metrics import accuracy_score
 
@@ -84,17 +94,7 @@ acc = accuracy_score(y_true, clf.predict(X))
 ```
 In this example, input data is available as np.arrays with both exchange (`"ex"`) and nbbo data (`"best"`). We set the layers parameter to `layers=[("quote", "ex"), ("quote", "best")]` to classify trades first on subset `"ex"` and remaining trades on subset `"best"`. Additionally, we have to set `ClassicalClassifier(..., features=features)` to pass column information to the classifier.
 
-Like before, column/feature names must follow our [naming conventions](https://karelze.github.io/tclf/naming_conventions/).
-
-## Supported Algorithms
-
-- (Rev.) CLNV rule[^1]
-- (Rev.) EMO rule[^2]
-- (Rev.) LR algorithm[^6]
-- (Rev.) Tick test[^5]
-- Depth rule[^3]
-- Quote rule[^4]
-- Tradesize rule[^3]
+Like before, column/feature names must follow our [naming conventions](https://karelze.github.io/tclf/naming_conventions/). For more pracitcal examples, see our [examples section](https://karelze.github.io/tclf/option_trade_classification).
 
 ## Citation
 

diff --git a/docs/faq.md → docs/nan_handling.md b/docs/faq.md → docs/nan_handling.md
@@ -1,7 +1,3 @@
-## Frequently Asked Questions
-
-**How are `NaN` values handled in by `tclf`?**
-
 We take care to treat `NaN` values correctly. If features relevant for classification like the trade price or quoted bid/ask prices are missing, no classification is performed and classification of the trade is deferred to the subsequent rule or fallback strategy.
 
 Alternatively, you can provide imputed data. See [`sklearn.impute`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute) for details.
diff --git a/docs/option_trade_classification.md b/docs/option_trade_classification.md
@@ -0,0 +1,104 @@
+
+## Define Rules
+This tutorial aims to reproduce plots from a working paper by Grauer et. al [^1], which achieves state-of-the-art performance in option trade classification. The authors recommend to classify option trades by:
+> [...] our new trade size rule together with quote rules successively applied to NBBO and quotes on the trading venue. Quotes at the midpoint on both the NBBO and the exchange should be classified first with the depth rule and any remaining trades with the reverse tick test.
+
+
+There's a lot going on.🥵 To match the author's description, we first set up `layers`. We use the subset "ex" to refer to exchange-specific data, "best" to the NBBO and "all" for inter-exchange level data. Identical to the paper we perform random classification on unclassified trades, hence `strategy="random"`.
+```python
+from tclf.classical_classifier import ClassicalClassifier
+
+layers = [
+    ("trade_size", "ex"),
+    ("quote", "best"),
+    ("quote", "ex"),
+    ("depth", "best"),
+    ("depth", "ex"),
+    ("rev_tick", "all"),
+]
+clf = ClassicalClassifier(layers=layers, strategy="random")
+```
+
+## Prepare Dataset
+
+Next, we load our input data. I store my dataset of ISE trades in a google cloud bucket as `parquet` files and load them into a dataframe `X`.
+
+```python
+import gcsfs
+import pandas as pd
+
+fs = gcsfs.GCSFileSystem()
+
+gcs_loc = fs.glob(
+        "gs://your_bucket/your_dir/*"
+)
+X = pd.read_parquet(gcs_loc, engine="pyarrow", filesystem=fs)
+```
+
+Once the dataset is loaded, we can prepare it  save the true label and the timestamp of the trade to a new dataframe, named `X_meta`, which is required for plotting and remove it from the original dataframe.
+```python
+features_meta = ["QUOTE_DATETIME", "buy_sell"]
+X_meta = X[features_meta]
+X = X.drop(columns=features_meta).rename(
+    {
+        "TRADE_PRICE": "trade_price",
+        "TRADE_SIZE": "trade_size",
+        "BEST_ASK": "ask_best",
+        "BEST_BID": "bid_best",
+        "buy_sell": "y_true",
+    },
+    axis=1,
+)
+X[["ask_size_best", "bid_size_best"]] = X[["ask_size_ex", "bid_size_ex"]]
+```
+
+## Plot Results
+
+To estimate the accuracy over time, we group by date and estimate the accuracy for each group. We make use of [`sklearn.metrics.accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).
+
+```python
+from sklearn.metrics import accuracy_score
+
+df_plot = X_meta.groupby(X_meta.QUOTE_DATETIME.dt.date).apply(
+    lambda x: accuracy_score(x["y_true"], x["y_pred"]) * 100
+)
+```
+We use [`matplotlib`](https://matplotlib.org/) to match the plots from the paper as close as possible.
+
+
+```python
+import matplotlib.pyplot as plt
+from matplotlib.dates import DateFormatter
+from matplotlib.ticker import PercentFormatter
+
+plt.rcParams["font.family"] = "serif"
+
+plt.figure(figsize=(9, 3))
+plt.grid(True, axis="y")
+
+# line plot
+plt.plot(df_plot, color="tab:orange", linewidth=1.5, label="ISE")
+
+# y-axis + x-axis
+plt.ylim(0, 100)
+plt.ylabel("Overall success rate")
+ax = plt.gca()
+ax.yaxis.set_major_formatter(PercentFormatter(100, decimals=0))
+ax.xaxis.set_major_formatter(DateFormatter("%b-%y"))
+
+# title + legend
+plt.title(
+    "C: Performance of trade classification based on\n trade size rule + depth rule + reverse LR (NBBO,exchange)",
+    loc="left",
+)
+plt.legend(loc="lower left", frameon=False)
+
+plt.show()
+```
+
+**Output:**
+
+!["gsu"](./img/gsu.png)
+
+[^1]: <div class="csl-entry">Grauer, C., Schuster, P., &amp; Uhrig-Homburg, M. (2023). <i>Option trade classification</i>. <a href="https://doi.org/10.2139/ssrn.4098475">https://doi.org/10.2139/ssrn.4098475</a></div>
+  <span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rft.type=document&amp;rft.title=Option%20trade%20classification&amp;rft.aufirst=Caroline&amp;rft.aulast=Grauer&amp;rft.au=Caroline%20Grauer&amp;rft.au=Philipp%20Schuster&amp;rft.au=Marliese%20Uhrig-Homburg&amp;rft.date=2023"></span>
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -17,8 +17,11 @@ edit_uri: ""
 nav:
   - Home: index.md
   - API reference: reference.md
-  - Naming conventions: naming_conventions.md
-  - FAQs: faq.md
+  - Examples:
+    - Option trade classification: option_trade_classification.md
+  - More:
+    - Naming conventions: naming_conventions.md
+    - Handling of NaNs: nan_handling.md
 
 markdown_extensions:
   - toc:
@@ -28,6 +31,8 @@ markdown_extensions:
   - admonition
   - codehilite
   - extra
+  - pymdownx.details
+  - pymdownx.superfences
   - pymdownx.superfences:
       custom_fences:
         - name: mermaid

diff --git a/notebooks/gsu.ipynb b/notebooks/gsu.ipynb
@@ -8,7 +8,12 @@
    },
    "outputs": [],
    "source": [
+    "import gcsfs\n",
+    "import matplotlib.pyplot as plt\n",
     "import pandas as pd\n",
+    "from matplotlib.dates import DateFormatter\n",
+    "from matplotlib.ticker import PercentFormatter\n",
+    "from sklearn.metrics import accuracy_score\n",
     "\n",
     "from tclf.classical_classifier import ClassicalClassifier"
    ]
@@ -27,14 +32,11 @@
     "    \"ask_ex\",\n",
     "    \"BEST_ASK\",\n",
     "    \"BEST_BID\",\n",
-    "    \"price_all_lag\",\n",
+    "    \"price_all_lead\",\n",
+    "    \"price_ex_lead\",\n",
     "]\n",
     "\n",
-    "features_size = [\n",
-    "    \"TRADE_SIZE\",\n",
-    "    \"bid_size_ex\",\n",
-    "    \"ask_size_ex\"\n",
-    "]\n",
+    "features_size = [\"TRADE_SIZE\", \"bid_size_ex\", \"ask_size_ex\"]\n",
     "\n",
     "features_meta = [\"QUOTE_DATETIME\", \"buy_sell\"]\n",
     "\n",
@@ -51,13 +53,34 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "gcs_loc = \"gs://thesis-bucket-option-trade-classification/data/preprocessed/ise_supervised_none/\"\n",
-    "X = pd.read_parquet(\n",
-    "        gcs_loc, engine=\"pyarrow\", columns=columns\n",
-    "    )\n",
+    "fs = gcsfs.GCSFileSystem()\n",
+    "\n",
+    "gcs_loc = fs.glob(\n",
+    "    \"gs://thesis-bucket-option-trade-classification/data/preprocessed/matched_ise_quotes*\"\n",
+    ")\n",
+    "X = pd.read_parquet(gcs_loc, engine=\"pyarrow\", columns=columns, filesystem=fs)\n",
     "\n",
     "X_meta = X[features_meta]\n",
-    "X = X.drop(columns=features_meta).rename({\"TRADE_PRICE\": \"trade_price\", \"TRADE_SIZE\": \"trade_size\", \"BEST_ASK\": \"ask_best\", \"BEST_BID\": \"bid_best\"}, axis=1)"
+    "X = X.drop(columns=features_meta).rename(\n",
+    "    {\n",
+    "        \"TRADE_PRICE\": \"trade_price\",\n",
+    "        \"TRADE_SIZE\": \"trade_size\",\n",
+    "        \"BEST_ASK\": \"ask_best\",\n",
+    "        \"BEST_BID\": \"bid_best\",\n",
+    "        \"buy_sell\": \"y_true\",\n",
+    "    },\n",
+    "    axis=1,\n",
+    ")\n",
+    "X[[\"ask_size_best\", \"bid_size_best\"]] = X[[\"ask_size_ex\", \"bid_size_ex\"]]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X.head()"
    ]
   },
   {
@@ -68,14 +91,14 @@
    },
    "outputs": [],
    "source": [
-    "layers = [ # grauer (benchmark 2) \n",
-    "        (\"trade_size\", \"ex\"),\n",
-    "        (\"quote\", \"best\"),\n",
-    "        (\"quote\", \"ex\"),\n",
-    "        (\"depth\", \"best\"),\n",
-    "        (\"depth\", \"ex\"),\n",
-    "        (\"rev_tick\", \"all\"),\n",
-    "    ]\n",
+    "layers = [  # grauer (benchmark 2)\n",
+    "    (\"trade_size\", \"ex\"),\n",
+    "    (\"quote\", \"best\"),\n",
+    "    (\"quote\", \"ex\"),\n",
+    "    (\"depth\", \"best\"),\n",
+    "    (\"depth\", \"ex\"),\n",
+    "    (\"rev_tick\", \"all\"),\n",
+    "]\n",
     "clf = ClassicalClassifier(layers=layers, strategy=\"random\")\n",
     "\n",
     "X_meta[\"y_pred\"] = clf.fit(X).predict(X)"
@@ -89,11 +112,45 @@
    "source": [
     "X_meta"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_plot = X_meta.groupby(X_meta.QUOTE_DATETIME.dt.date).apply(\n",
+    "    lambda x: accuracy_score(x[\"y_true\"], x[\"y_pred\"])\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.rcParams[\"font.family\"] = \"serif\"\n",
+    "plt.figure(figsize=(9, 3))\n",
+    "plt.plot(df_plot * 100, color=\"tab:orange\", linewidth=1.5, label=\"ISE\")\n",
+    "plt.ylim(0, 100)\n",
+    "plt.ylabel(\"Overall success rate\")\n",
+    "ax = plt.gca()\n",
+    "ax.yaxis.set_major_formatter(PercentFormatter(100, decimals=0))\n",
+    "ax.xaxis.set_major_formatter(DateFormatter(\"%b-%y\"))\n",
+    "plt.title(\n",
+    "    \"C: Performance of trade classification based on\\n trade size rule + depth rule + reverse LR (NBBO, exchange)\",\n",
+    "    loc=\"left\",\n",
+    ")\n",
+    "plt.grid(True, axis=\"y\")\n",
+    "plt.legend(loc=\"lower left\", frameon=False)\n",
+    "plt.show()"
+   ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "env",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },