Skip to content

Commit

Permalink
Add checks if required columns are present 🙂 (#45)
Browse files Browse the repository at this point in the history
* update readme for true label

* Populate naming conventions docu

* Add checks for required columns

* Extend documentation
  • Loading branch information
KarelZe authored Dec 28, 2023
1 parent 3b7b2d4 commit 42127df
Show file tree
Hide file tree
Showing 7 changed files with 119 additions and 20 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The key features are:
```console
$ pip install .
---> 100%
Successfully installed tclf-0.0.0
Successfully installed tclf-0.0.1
```

## Supported Algorithms
Expand Down Expand Up @@ -62,7 +62,7 @@ $ python main.py
```
In this example, input data is available as a pd.DataFrame with columns conforming to our [naming conventions](https://karelze.github.io/tclf/naming_conventions/).

The parameter `layers=[("quote", "ex")]` sets the quote rule at the exchange level and `strategy="random"` specifies the fallback strategy for unclassified trades. The true label `y` is not used in classification and only for API consistency by convention.
The parameter `layers=[("quote", "ex")]` sets the quote rule at the exchange level and `strategy="random"` specifies the fallback strategy for unclassified trades.

## Advanced Example
Often it is desirable to classify both on exchange level data and nbbo data. Also, data might only be available as a numpy array. So let's extend the previous example by classifying using the quote rule at exchange level, then at nbbo and all other trades randomly.
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The key features are:
```console
$ pip install .
---> 100%
Successfully installed tclf-0.0.0
Successfully installed tclf-0.0.1
```

## Supported Algorithms
Expand Down Expand Up @@ -62,7 +62,7 @@ $ python main.py
```
In this example, input data is available as a pd.DataFrame with columns conforming to our [naming conventions](https://karelze.github.io/tclf/naming_conventions/).

The parameter `layers=[("quote", "ex")]` sets the quote rule at the exchange level and `strategy="random"` specifies the fallback strategy for unclassified trades. The true label `y` is not used in classification and only for API consistency by convention.
The parameter `layers=[("quote", "ex")]` sets the quote rule at the exchange level and `strategy="random"` specifies the fallback strategy for unclassified trades.

## Advanced Example
Often it is desirable to classify both on exchange level data and nbbo data. Also, data might only be available as a numpy array. So let's extend the previous example by classifying using the quote rule at exchange level, then at nbbo and all other trades randomly.
Expand Down
19 changes: 19 additions & 0 deletions docs/naming_conventions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
For `tclf` to work, we impose constraints on the column names. The following input is required by each rule. Data requirements are additive, if multiple rules are applied.




| Rule | Layer Name | Columns |
|-----------------------------|------------------------|-------------------------------------------------------------------------------------------|
| No classification | `("nan","sub")` | None |
| Tick test | `("tick","sub")` | `trade_price`, `price_{sub}_lag` |
| Reverse tick Test | `("rev_tick","sub")` | `trade_price`, `price_{sub}_lead` |
| Quote Rule | `("quote","sub")` | `trade_price`, `ask_{sub}`, `bid_{sub}` |
| Lee-Ready Algorithm | `("lr","sub")` | `trade_price`, `price_{sub}_lag`, `ask_{sub}`, `bid_{sub}` |
| EMO Algorithm | `("emo","sub")` | `trade_price`, `price_{sub}_lag`, `ask_{sub}`, `bid_{sub}` |
| CLNV Rule | `("clnv","sub")` | `trade_price`, `price_{sub}_lag`, `ask_{sub}`, `bid_{sub}` |
| Reverse Lee-Ready Algorithm | `("rev_lr","sub")` | `trade_price`, `price_{sub}_lead`, `ask_{sub}`, `bid_{sub}` |
| Reverse EMO Algorithm | `("rev_emo","sub")` | `trade_price`, `price_{sub}_lead`, `ask_{sub}`, `bid_{sub}` |
| Reverse CLNV Rule | `("rev_clnv","sub")` | `trade_price`, `price_{sub}_lead`, `ask_{sub}`, `bid_{sub}` |
| Depth rule | `("depth","sub")` | `trade_price`, `ask_{sub}`, `bid_{sub}`, `ask_size_{sub}`, `bid_size_{sub}` |
| Trade size rule | `("trade_size","sub")` | `trade_size`, `ask_size_{sub}`, `bid_size_{sub}` |
5 changes: 5 additions & 0 deletions docs/option_trade_classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,11 @@ X = pd.read_parquet(gcs_loc, engine="pyarrow", filesystem=fs)
```
Unfortunately, the dataset does not yet follow the [naming conventions](https://karelze.github.io/tclf/naming_conventions/) and is missing columns required by `tclf`. We take care of this next.😅

```python
clf.fit(X)
>>> ValueError: Expected to find columns: ['ask_best', 'ask_size_best', 'bid_best', 'bid_size_best', 'trade_price', 'trade_size']. Check naming/presenence of columns. See: https://karelze.github.io/tclf/naming_conventions/
```

The calculation of the [depth rule](https://github.com/KarelZe/tclf/blob/main/src/tclf/classical_classifier.py#L362C1-L363C1) requires the columns `ask_{subset}`, `bid_{subset}`, and `trade_price`, as well as `ask_size_{subset}`, `bid_size_{subset}` and `trade_size`. The columns `BEST_ASK`, `BEST_BID`, `TRADE_PRICE`, and `TRADE_SIZE` are renamed to match our naming conventions of `ask_{subset}`, `bid_{subset}`, `trade_price`, and `trade_size`.

As there is no `{ask/bid}_size_best` at the NBBO level (`subset="best"`), I copy the columns from the trading venue. This allows us to mimic the author's decision to filter for mid-spread at the NBBO level, but classify by the trade size relative to the ask/bid size at the exchange.
Expand Down
2 changes: 1 addition & 1 deletion src/tclf/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.0.2"
__version__ = "0.0.1"
90 changes: 75 additions & 15 deletions src/tclf/classical_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,10 @@
import pandas as pd
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils import check_random_state
from sklearn.utils.validation import _check_sample_weight, check_is_fitted
from sklearn.utils.validation import (
_check_sample_weight,
check_is_fitted,
)

from tclf.types import ArrayLike, MatrixLike

Expand Down Expand Up @@ -122,7 +125,7 @@ def _tick(self, subset: str) -> npt.NDArray:
"""Classify a trade as a buy (sell) if its trade price is above (below) the closest different price of a previous trade.
Args:
subset (str): subset i. e., 'all' or 'ex'.
subset (str): subset i.e., 'all' or 'ex'.
Returns:
npt.NDArray: result of tick rule. Can be np.NaN.
Expand All @@ -139,7 +142,7 @@ def _rev_tick(self, subset: str) -> npt.NDArray:
"""Classify a trade as a sell (buy) if its trade price is below (above) the closest different price of a subsequent trade.
Args:
subset (str): subset i. e.,'all' or 'ex'.
subset (str): subset i.e.,'all' or 'ex'.
Returns:
npt.NDArray: result of reverse tick rule. Can be np.NaN.
Expand All @@ -156,7 +159,7 @@ def _quote(self, subset: str) -> npt.NDArray:
"""Classify a trade as a buy (sell) if its trade price is above (below) the midpoint of the bid and ask spread. Trades executed at the midspread are not classified.
Args:
subset (str): subset i. e., 'ex' or 'best'.
subset (str): subset i.e., 'ex' or 'best'.
Returns:
npt.NDArray: result of quote rule. Can be np.NaN.
Expand All @@ -175,7 +178,7 @@ def _lr(self, subset: str) -> npt.NDArray:
Adapted from Lee and Ready (1991).
Args:
subset (str): subset i. e., 'ex' or 'best'.
subset (str): subset i.e., 'ex' or 'best'.
Returns:
npt.ndarray: result of the lee and ready algorithm with tick rule.
Expand All @@ -190,7 +193,7 @@ def _rev_lr(self, subset: str) -> npt.NDArray:
Adapted from Lee and Ready (1991).
Args:
subset (str): subset i. e.,'ex' or 'best'.
subset (str): subset i.e.,'ex' or 'best'.
Returns:
npt.NDArray: result of the lee and ready algorithm with reverse tick
Expand All @@ -205,7 +208,7 @@ def _mid(self, subset: str) -> npt.NDArray:
Midpoint is calculated as the average of the bid and ask spread if the spread is positive. Otherwise, np.NaN is returned.
Args:
subset (str): subset i. e.,
subset (str): subset i.e.,
'ex' or 'best'
Returns:
npt.NDArray: midpoints. Can be np.NaN.
Expand All @@ -220,7 +223,7 @@ def _is_at_ask_xor_bid(self, subset: str) -> pd.Series:
"""Check if the trade price is at the ask xor bid.
Args:
subset (str): subset i. e.,
subset (str): subset i.e.,
'ex' or 'best'.
Returns:
Expand All @@ -236,7 +239,7 @@ def _is_at_upper_xor_lower_quantile(
"""Check if the trade price is at the ask xor bid.
Args:
subset (str): subset i. e., 'ex'.
subset (str): subset i.e., 'ex'.
quantiles (float, optional): percentage of quantiles. Defaults to 0.3.
Returns:
Expand All @@ -260,7 +263,7 @@ def _emo(self, subset: str) -> npt.NDArray:
Adapted from Ellis et al. (2000).
Args:
subset (Literal["ex", "best"]): subset i. e., 'ex' or 'best'.
subset (Literal["ex", "best"]): subset i.e., 'ex' or 'best'.
Returns:
npt.NDArray: result of the emo algorithm with tick rule. Can be
Expand All @@ -276,7 +279,7 @@ def _rev_emo(self, subset: str) -> npt.NDArray:
Adapted from Grauer et al. (2022).
Args:
subset (str): subset i. e., 'ex' or 'best'.
subset (str): subset i.e., 'ex' or 'best'.
Returns:
npt.NDArray: result of the emo algorithm with reverse tick rule.
Expand All @@ -298,7 +301,7 @@ def _clnv(self, subset: str) -> npt.NDArray:
Adapted from Chakrabarty et al. (2007).
Args:
subset (str): subset i. e.,'ex' or 'best'.
subset (str): subset i.e.,'ex' or 'best'.
Returns:
npt.NDArray: result of the emo algorithm with tick rule. Can be
Expand All @@ -322,7 +325,7 @@ def _rev_clnv(self, subset: str) -> npt.NDArray:
Similar to extension of emo algorithm proposed Grauer et al. (2022).
Args:
subset (str): subset i. e., 'ex' or 'best'.
subset (str): subset i.e., 'ex' or 'best'.
Returns:
npt.NDArray: result of the emo algorithm with tick rule. Can be
Expand All @@ -340,7 +343,7 @@ def _trade_size(self, subset: str) -> npt.NDArray:
Adapted from Grauer et al. (2022).
Args:
subset (str): subset i. e., 'ex' or 'best'.
subset (str): subset i.e., 'ex' or 'best'.
Returns:
npt.NDArray: result of the trade size rule. Can be np.NaN.
Expand All @@ -366,7 +369,7 @@ def _depth(self, subset: str) -> npt.NDArray:
Adapted from Grauer et al. (2022).
Args:
subset (str): subset i. e., 'ex' or 'best'.
subset (str): subset i.e., 'ex' or 'best'.
Returns:
npt.NDArray: result of depth rule. Can be np.NaN.
Expand All @@ -392,6 +395,60 @@ def _nan(self, subset: str) -> npt.NDArray:
"""
return np.full(shape=(self.X_.shape[0],), fill_value=np.nan)

def _validate_columns(self, found_cols: list[str]) -> None:
"""Validate if all required columns are present.
Args:
found_cols (list[str]): columns present in dataframe.
"""

def lookup_columns(func_str: str, sub: str) -> list[str]:
LR_LIKE = [
"trade_price",
f"price_{sub}_lag",
f"ask_{sub}",
f"bid_{sub}",
]
REV_LR_LIKE = [
"trade_price",
f"price_{sub}_lead",
f"ask_{sub}",
f"bid_{sub}",
]

LUT_REQUIRED_COLUMNS: dict[str, list[str]] = {
"nan": [],
"clnv": LR_LIKE,
"depth": [
"trade_price",
f"ask_{sub}",
f"bid_{sub}",
f"ask_size_{sub}",
f"bid_size_{sub}",
],
"emo": LR_LIKE,
"lr": LR_LIKE,
"quote": ["trade_price", f"ask_{sub}", f"bid_{sub}"],
"rev_clnv": REV_LR_LIKE,
"rev_emo": REV_LR_LIKE,
"rev_lr": REV_LR_LIKE,
"rev_tick": ["trade_price", f"price_{sub}_lead"],
"tick": ["trade_price", f"price_{sub}_lag"],
"trade_size": ["trade_size", f"ask_size_{sub}", f"bid_size_{sub}"],
}
return LUT_REQUIRED_COLUMNS[func_str]

required_cols_set = set()
for func_str, sub in self._layers:
func_col = lookup_columns(func_str, sub)
required_cols_set.update(func_col)

missing_cols = sorted(required_cols_set - set(found_cols))
if missing_cols:
raise ValueError(
f"Expected to find columns: {missing_cols}. Check naming/presenence of columns. See: https://karelze.github.io/tclf/naming_conventions/"
)

def fit(
self,
X: MatrixLike,
Expand Down Expand Up @@ -464,6 +521,9 @@ def fit(
f"expected one of {ALLOWED_FUNC_STR}."
)

columns = self.columns_
self._validate_columns(columns)

return self

def predict(self, X: MatrixLike) -> npt.NDArray:
Expand Down
15 changes: 15 additions & 0 deletions tests/test_classical_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,21 @@ def test_invalid_func(self, x_train: pd.DataFrame) -> None:
with pytest.raises(ValueError, match=r"Unknown function string"):
classifier.fit(x_train)

def test_missing_columns(self, x_train: pd.DataFrame) -> None:
"""Test, if an error is raised, if required columns are missing.
An exception should be raised if required features are missing,
including the columns required for classification.
"""
classifier = ClassicalClassifier(
layers=[("tick", "all"), ("quote", "ex")], random_state=42
)
with pytest.raises(
ValueError,
match=r"Expected to find columns: ['ask_ex', 'bid_ex', 'price_all_lag']*",
):
classifier.fit(x_train[["trade_price", "trade_size"]])

def test_invalid_col_length(self, x_train: pd.DataFrame) -> None:
"""Test, if only valid column length can be passed.
Expand Down

0 comments on commit 42127df

Please sign in to comment.