Skip to content

Commit

Permalink
Add draf for complex example
Browse files Browse the repository at this point in the history
  • Loading branch information
KarelZe committed Dec 26, 2023
1 parent 604ab24 commit 2ad294f
Show file tree
Hide file tree
Showing 6 changed files with 202 additions and 40 deletions.
Binary file added docs/img/gsu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 14 additions & 14 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,22 @@ $ pip install .
Successfully installed tclf-0.0.0
```

## Supported Algorithms

- (Rev.) CLNV rule[^1]
- (Rev.) EMO rule[^2]
- (Rev.) LR algorithm[^6]
- (Rev.) Tick test[^5]
- Depth rule[^3]
- Quote rule[^4]
- Tradesize rule[^3]

## Minimal Example

Let's start simple: classify all trades by the quote rule and all other trades, which cannot be classified by the quote rule, randomly.

Create a `main.py` with:
```python
```python title="main.py"
import numpy as np
import pandas as pd

Expand Down Expand Up @@ -54,10 +64,10 @@ In this example, input data is available as a pd.DataFrame with columns conformi

The parameter `layers=[("quote", "ex")]` sets the quote rule at the exchange level and `strategy="random"` specifies the fallback strategy for unclassified trades. The true label `y` is not used in classification and only for API consistency by convention.

## Advanced Example
## Andvanced Example
Often it is desirable to classify both on exchange level data and nbbo data. Also, data might only be available as a numpy array. So let's extend the previous example by classifying using the quote rule at exchange level, then at nbbo and all other trades randomly.

```python hl_lines="6 16 17 20"
```python title="main.py" hl_lines="6 16 17 20"
import numpy as np
from sklearn.metrics import accuracy_score

Expand All @@ -84,17 +94,7 @@ acc = accuracy_score(y_true, clf.predict(X))
```
In this example, input data is available as np.arrays with both exchange (`"ex"`) and nbbo data (`"best"`). We set the layers parameter to `layers=[("quote", "ex"), ("quote", "best")]` to classify trades first on subset `"ex"` and remaining trades on subset `"best"`. Additionally, we have to set `ClassicalClassifier(..., features=features)` to pass column information to the classifier.

Like before, column/feature names must follow our [naming conventions](https://karelze.github.io/tclf/naming_conventions/).

## Supported Algorithms

- (Rev.) CLNV rule[^1]
- (Rev.) EMO rule[^2]
- (Rev.) LR algorithm[^6]
- (Rev.) Tick test[^5]
- Depth rule[^3]
- Quote rule[^4]
- Tradesize rule[^3]
Like before, column/feature names must follow our [naming conventions](https://karelze.github.io/tclf/naming_conventions/). For more pracitcal examples, see our [examples section](https://karelze.github.io/tclf/option_trade_classification).

## Citation

Expand Down
4 changes: 0 additions & 4 deletions docs/faq.md → docs/nan_handling.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
## Frequently Asked Questions

**How are `NaN` values handled in by `tclf`?**

We take care to treat `NaN` values correctly. If features relevant for classification like the trade price or quoted bid/ask prices are missing, no classification is performed and classification of the trade is deferred to the subsequent rule or fallback strategy.

Alternatively, you can provide imputed data. See [`sklearn.impute`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute) for details.
104 changes: 104 additions & 0 deletions docs/option_trade_classification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@

## Define Rules
This tutorial aims to reproduce plots from a working paper by Grauer et. al [^1], which achieves state-of-the-art performance in option trade classification. The authors recommend to classify option trades by:
> [...] our new trade size rule together with quote rules successively applied to NBBO and quotes on the trading venue. Quotes at the midpoint on both the NBBO and the exchange should be classified first with the depth rule and any remaining trades with the reverse tick test.

There's a lot going on.🥵 To match the author's description, we first set up `layers`. We use the subset "ex" to refer to exchange-specific data, "best" to the NBBO and "all" for inter-exchange level data. Identical to the paper we perform random classification on unclassified trades, hence `strategy="random"`.
```python
from tclf.classical_classifier import ClassicalClassifier

layers = [
("trade_size", "ex"),
("quote", "best"),
("quote", "ex"),
("depth", "best"),
("depth", "ex"),
("rev_tick", "all"),
]
clf = ClassicalClassifier(layers=layers, strategy="random")
```

## Prepare Dataset

Next, we load our input data. I store my dataset of ISE trades in a google cloud bucket as `parquet` files and load them into a dataframe `X`.

```python
import gcsfs
import pandas as pd

fs = gcsfs.GCSFileSystem()

gcs_loc = fs.glob(
"gs://your_bucket/your_dir/*"
)
X = pd.read_parquet(gcs_loc, engine="pyarrow", filesystem=fs)
```

Once the dataset is loaded, we can prepare it save the true label and the timestamp of the trade to a new dataframe, named `X_meta`, which is required for plotting and remove it from the original dataframe.
```python
features_meta = ["QUOTE_DATETIME", "buy_sell"]
X_meta = X[features_meta]
X = X.drop(columns=features_meta).rename(
{
"TRADE_PRICE": "trade_price",
"TRADE_SIZE": "trade_size",
"BEST_ASK": "ask_best",
"BEST_BID": "bid_best",
"buy_sell": "y_true",
},
axis=1,
)
X[["ask_size_best", "bid_size_best"]] = X[["ask_size_ex", "bid_size_ex"]]
```

## Plot Results

To estimate the accuracy over time, we group by date and estimate the accuracy for each group. We make use of [`sklearn.metrics.accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

```python
from sklearn.metrics import accuracy_score

df_plot = X_meta.groupby(X_meta.QUOTE_DATETIME.dt.date).apply(
lambda x: accuracy_score(x["y_true"], x["y_pred"]) * 100
)
```
We use [`matplotlib`](https://matplotlib.org/) to match the plots from the paper as close as possible.


```python
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
from matplotlib.ticker import PercentFormatter

plt.rcParams["font.family"] = "serif"

plt.figure(figsize=(9, 3))
plt.grid(True, axis="y")

# line plot
plt.plot(df_plot, color="tab:orange", linewidth=1.5, label="ISE")

# y-axis + x-axis
plt.ylim(0, 100)
plt.ylabel("Overall success rate")
ax = plt.gca()
ax.yaxis.set_major_formatter(PercentFormatter(100, decimals=0))
ax.xaxis.set_major_formatter(DateFormatter("%b-%y"))

# title + legend
plt.title(
"C: Performance of trade classification based on\n trade size rule + depth rule + reverse LR (NBBO,exchange)",
loc="left",
)
plt.legend(loc="lower left", frameon=False)

plt.show()
```

**Output:**

!["gsu"](./img/gsu.png)

[^1]: <div class="csl-entry">Grauer, C., Schuster, P., &amp; Uhrig-Homburg, M. (2023). <i>Option trade classification</i>. <a href="https://doi.org/10.2139/ssrn.4098475">https://doi.org/10.2139/ssrn.4098475</a></div>
<span class="Z3988" title="url_ver=Z39.88-2004&amp;ctx_ver=Z39.88-2004&amp;rfr_id=info%3Asid%2Fzotero.org%3A2&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rft.type=document&amp;rft.title=Option%20trade%20classification&amp;rft.aufirst=Caroline&amp;rft.aulast=Grauer&amp;rft.au=Caroline%20Grauer&amp;rft.au=Philipp%20Schuster&amp;rft.au=Marliese%20Uhrig-Homburg&amp;rft.date=2023"></span>
9 changes: 7 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,11 @@ edit_uri: ""
nav:
- Home: index.md
- API reference: reference.md
- Naming conventions: naming_conventions.md
- FAQs: faq.md
- Examples:
- Option trade classification: option_trade_classification.md
- More:
- Naming conventions: naming_conventions.md
- Handling of NaNs: nan_handling.md

markdown_extensions:
- toc:
Expand All @@ -28,6 +31,8 @@ markdown_extensions:
- admonition
- codehilite
- extra
- pymdownx.details
- pymdownx.superfences
- pymdownx.superfences:
custom_fences:
- name: mermaid
Expand Down
97 changes: 77 additions & 20 deletions notebooks/gsu.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,12 @@
},
"outputs": [],
"source": [
"import gcsfs\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"from matplotlib.dates import DateFormatter\n",
"from matplotlib.ticker import PercentFormatter\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"from tclf.classical_classifier import ClassicalClassifier"
]
Expand All @@ -27,14 +32,11 @@
" \"ask_ex\",\n",
" \"BEST_ASK\",\n",
" \"BEST_BID\",\n",
" \"price_all_lag\",\n",
" \"price_all_lead\",\n",
" \"price_ex_lead\",\n",
"]\n",
"\n",
"features_size = [\n",
" \"TRADE_SIZE\",\n",
" \"bid_size_ex\",\n",
" \"ask_size_ex\"\n",
"]\n",
"features_size = [\"TRADE_SIZE\", \"bid_size_ex\", \"ask_size_ex\"]\n",
"\n",
"features_meta = [\"QUOTE_DATETIME\", \"buy_sell\"]\n",
"\n",
Expand All @@ -51,13 +53,34 @@
"metadata": {},
"outputs": [],
"source": [
"gcs_loc = \"gs://thesis-bucket-option-trade-classification/data/preprocessed/ise_supervised_none/\"\n",
"X = pd.read_parquet(\n",
" gcs_loc, engine=\"pyarrow\", columns=columns\n",
" )\n",
"fs = gcsfs.GCSFileSystem()\n",
"\n",
"gcs_loc = fs.glob(\n",
" \"gs://thesis-bucket-option-trade-classification/data/preprocessed/matched_ise_quotes*\"\n",
")\n",
"X = pd.read_parquet(gcs_loc, engine=\"pyarrow\", columns=columns, filesystem=fs)\n",
"\n",
"X_meta = X[features_meta]\n",
"X = X.drop(columns=features_meta).rename({\"TRADE_PRICE\": \"trade_price\", \"TRADE_SIZE\": \"trade_size\", \"BEST_ASK\": \"ask_best\", \"BEST_BID\": \"bid_best\"}, axis=1)"
"X = X.drop(columns=features_meta).rename(\n",
" {\n",
" \"TRADE_PRICE\": \"trade_price\",\n",
" \"TRADE_SIZE\": \"trade_size\",\n",
" \"BEST_ASK\": \"ask_best\",\n",
" \"BEST_BID\": \"bid_best\",\n",
" \"buy_sell\": \"y_true\",\n",
" },\n",
" axis=1,\n",
")\n",
"X[[\"ask_size_best\", \"bid_size_best\"]] = X[[\"ask_size_ex\", \"bid_size_ex\"]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X.head()"
]
},
{
Expand All @@ -68,14 +91,14 @@
},
"outputs": [],
"source": [
"layers = [ # grauer (benchmark 2) \n",
" (\"trade_size\", \"ex\"),\n",
" (\"quote\", \"best\"),\n",
" (\"quote\", \"ex\"),\n",
" (\"depth\", \"best\"),\n",
" (\"depth\", \"ex\"),\n",
" (\"rev_tick\", \"all\"),\n",
" ]\n",
"layers = [ # grauer (benchmark 2)\n",
" (\"trade_size\", \"ex\"),\n",
" (\"quote\", \"best\"),\n",
" (\"quote\", \"ex\"),\n",
" (\"depth\", \"best\"),\n",
" (\"depth\", \"ex\"),\n",
" (\"rev_tick\", \"all\"),\n",
"]\n",
"clf = ClassicalClassifier(layers=layers, strategy=\"random\")\n",
"\n",
"X_meta[\"y_pred\"] = clf.fit(X).predict(X)"
Expand All @@ -89,11 +112,45 @@
"source": [
"X_meta"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_plot = X_meta.groupby(X_meta.QUOTE_DATETIME.dt.date).apply(\n",
" lambda x: accuracy_score(x[\"y_true\"], x[\"y_pred\"])\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.rcParams[\"font.family\"] = \"serif\"\n",
"plt.figure(figsize=(9, 3))\n",
"plt.plot(df_plot * 100, color=\"tab:orange\", linewidth=1.5, label=\"ISE\")\n",
"plt.ylim(0, 100)\n",
"plt.ylabel(\"Overall success rate\")\n",
"ax = plt.gca()\n",
"ax.yaxis.set_major_formatter(PercentFormatter(100, decimals=0))\n",
"ax.xaxis.set_major_formatter(DateFormatter(\"%b-%y\"))\n",
"plt.title(\n",
" \"C: Performance of trade classification based on\\n trade size rule + depth rule + reverse LR (NBBO, exchange)\",\n",
" loc=\"left\",\n",
")\n",
"plt.grid(True, axis=\"y\")\n",
"plt.legend(loc=\"lower left\", frameon=False)\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "env",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand Down

0 comments on commit 2ad294f

Please sign in to comment.