Add checks if required columns are present 🙂 (#45)

* update readme for true label * Populate naming conventions docu * Add checks for required columns * Extend documentation
KarelZe · Dec 28, 2023 · 42127df · 42127df
1 parent 3b7b2d4
commit 42127df
Show file tree

Hide file tree

Showing 7 changed files with 119 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@ The key features are:
 ```console
 $ pip install .
 ---> 100%
-Successfully installed tclf-0.0.0
+Successfully installed tclf-0.0.1
 ```
 
 ## Supported Algorithms
@@ -62,7 +62,7 @@ $ python main.py
 ```
 In this example, input data is available as a pd.DataFrame with columns conforming to our [naming conventions](https://karelze.github.io/tclf/naming_conventions/).
 
-The parameter `layers=[("quote", "ex")]` sets the quote rule at the exchange level and `strategy="random"` specifies the fallback strategy for unclassified trades. The true label `y` is not used in classification and only for API consistency by convention.
+The parameter `layers=[("quote", "ex")]` sets the quote rule at the exchange level and `strategy="random"` specifies the fallback strategy for unclassified trades.
 
 ## Advanced Example
 Often it is desirable to classify both on exchange level data and nbbo data. Also, data might only be available as a numpy array. So let's extend the previous example by classifying using the quote rule at exchange level, then at nbbo and all other trades randomly.

diff --git a/docs/index.md b/docs/index.md
@@ -16,7 +16,7 @@ The key features are:
 ```console
 $ pip install .
 ---> 100%
-Successfully installed tclf-0.0.0
+Successfully installed tclf-0.0.1
 ```
 
 ## Supported Algorithms
@@ -62,7 +62,7 @@ $ python main.py
 ```
 In this example, input data is available as a pd.DataFrame with columns conforming to our [naming conventions](https://karelze.github.io/tclf/naming_conventions/).
 
-The parameter `layers=[("quote", "ex")]` sets the quote rule at the exchange level and `strategy="random"` specifies the fallback strategy for unclassified trades. The true label `y` is not used in classification and only for API consistency by convention.
+The parameter `layers=[("quote", "ex")]` sets the quote rule at the exchange level and `strategy="random"` specifies the fallback strategy for unclassified trades.
 
 ## Advanced Example
 Often it is desirable to classify both on exchange level data and nbbo data. Also, data might only be available as a numpy array. So let's extend the previous example by classifying using the quote rule at exchange level, then at nbbo and all other trades randomly.

diff --git a/docs/naming_conventions.md b/docs/naming_conventions.md
@@ -0,0 +1,19 @@
+For `tclf` to work, we impose constraints on the column names. The following input is required by each rule. Data requirements are additive, if multiple rules are applied.
+
+
+
+
+| Rule                        | Layer Name             | Columns                                                                                   |
+|-----------------------------|------------------------|-------------------------------------------------------------------------------------------|
+| No classification           | `("nan","sub")`        | None                                                                                      |
+| Tick test                   | `("tick","sub")`       | `trade_price`, `price_{sub}_lag`                                                          |
+| Reverse tick Test           | `("rev_tick","sub")`   | `trade_price`, `price_{sub}_lead`                                                         |
+| Quote Rule                  | `("quote","sub")`      | `trade_price`, `ask_{sub}`, `bid_{sub}`                                                   |
+| Lee-Ready Algorithm         | `("lr","sub")`         | `trade_price`, `price_{sub}_lag`, `ask_{sub}`, `bid_{sub}`                                |
+| EMO Algorithm               | `("emo","sub")`        | `trade_price`, `price_{sub}_lag`, `ask_{sub}`, `bid_{sub}`                                |
+| CLNV Rule                   | `("clnv","sub")`       | `trade_price`, `price_{sub}_lag`, `ask_{sub}`, `bid_{sub}`                                |
+| Reverse Lee-Ready Algorithm | `("rev_lr","sub")`     | `trade_price`, `price_{sub}_lead`, `ask_{sub}`, `bid_{sub}`                               |
+| Reverse EMO Algorithm       | `("rev_emo","sub")`    | `trade_price`, `price_{sub}_lead`, `ask_{sub}`, `bid_{sub}`                               |
+| Reverse CLNV Rule           | `("rev_clnv","sub")`   | `trade_price`, `price_{sub}_lead`, `ask_{sub}`, `bid_{sub}`                               |
+| Depth rule                  | `("depth","sub")`      | `trade_price`, `ask_{sub}`, `bid_{sub}`, `ask_size_{sub}`, `bid_size_{sub}` |
+| Trade size rule             | `("trade_size","sub")` | `trade_size`, `ask_size_{sub}`, `bid_size_{sub}`                                          |
diff --git a/docs/option_trade_classification.md b/docs/option_trade_classification.md
@@ -38,6 +38,11 @@ X = pd.read_parquet(gcs_loc, engine="pyarrow", filesystem=fs)
 ```
 Unfortunately, the dataset does not yet follow the [naming conventions](https://karelze.github.io/tclf/naming_conventions/) and is missing columns required by `tclf`. We take care of this next.😅
 
+```python
+clf.fit(X)
+>>> ValueError: Expected to find columns: ['ask_best', 'ask_size_best', 'bid_best', 'bid_size_best', 'trade_price', 'trade_size']. Check naming/presenence of columns. See: https://karelze.github.io/tclf/naming_conventions/
+```
+
 The calculation of the [depth rule](https://github.com/KarelZe/tclf/blob/main/src/tclf/classical_classifier.py#L362C1-L363C1) requires the columns `ask_{subset}`, `bid_{subset}`, and `trade_price`, as well as `ask_size_{subset}`, `bid_size_{subset}` and `trade_size`. The columns `BEST_ASK`, `BEST_BID`, `TRADE_PRICE`, and `TRADE_SIZE` are renamed to match our naming conventions of `ask_{subset}`, `bid_{subset}`, `trade_price`, and `trade_size`.
 
 As there is no `{ask/bid}_size_best` at the NBBO level (`subset="best"`), I copy the columns from the trading venue. This allows us to mimic the author's decision to filter for mid-spread at the NBBO level, but classify by the trade size relative to the ask/bid size at the exchange.

diff --git a/src/tclf/__init__.py b/src/tclf/__init__.py
@@ -1 +1 @@
-__version__ = "0.0.2"
+__version__ = "0.0.1"
diff --git a/src/tclf/classical_classifier.py b/src/tclf/classical_classifier.py
@@ -12,7 +12,10 @@
 import pandas as pd
 from sklearn.base import BaseEstimator, ClassifierMixin
 from sklearn.utils import check_random_state
-from sklearn.utils.validation import _check_sample_weight, check_is_fitted
+from sklearn.utils.validation import (
+    _check_sample_weight,
+    check_is_fitted,
+)
 
 from tclf.types import ArrayLike, MatrixLike
 
@@ -122,7 +125,7 @@ def _tick(self, subset: str) -> npt.NDArray:
         """Classify a trade as a buy (sell) if its trade price is above (below) the closest different price of a previous trade.
 
         Args:
-            subset (str): subset i. e., 'all' or 'ex'.
+            subset (str): subset i.e., 'all' or 'ex'.
 
         Returns:
             npt.NDArray: result of tick rule. Can be np.NaN.
@@ -139,7 +142,7 @@ def _rev_tick(self, subset: str) -> npt.NDArray:
         """Classify a trade as a sell (buy) if its trade price is below (above) the closest different price of a subsequent trade.
 
         Args:
-            subset (str): subset i. e.,'all' or 'ex'.
+            subset (str): subset i.e.,'all' or 'ex'.
 
         Returns:
             npt.NDArray: result of reverse tick rule. Can be np.NaN.
@@ -156,7 +159,7 @@ def _quote(self, subset: str) -> npt.NDArray:
         """Classify a trade as a buy (sell) if its trade price is above (below) the midpoint of the bid and ask spread. Trades executed at the midspread are not classified.
 
         Args:
-            subset (str): subset i. e., 'ex' or 'best'.
+            subset (str): subset i.e., 'ex' or 'best'.
 
         Returns:
             npt.NDArray: result of quote rule. Can be np.NaN.
@@ -175,7 +178,7 @@ def _lr(self, subset: str) -> npt.NDArray:
         Adapted from Lee and Ready (1991).
 
         Args:
-            subset (str): subset i. e., 'ex' or 'best'.
+            subset (str): subset i.e., 'ex' or 'best'.
 
         Returns:
             npt.ndarray: result of the lee and ready algorithm with tick rule.
@@ -190,7 +193,7 @@ def _rev_lr(self, subset: str) -> npt.NDArray:
         Adapted from Lee and Ready (1991).
 
         Args:
-            subset (str): subset i. e.,'ex' or 'best'.
+            subset (str): subset i.e.,'ex' or 'best'.
 
         Returns:
             npt.NDArray: result of the lee and ready algorithm with reverse tick
@@ -205,7 +208,7 @@ def _mid(self, subset: str) -> npt.NDArray:
         Midpoint is calculated as the average of the bid and ask spread if the spread is positive. Otherwise, np.NaN is returned.
 
         Args:
-            subset (str): subset i. e.,
+            subset (str): subset i.e.,
             'ex' or 'best'
         Returns:
             npt.NDArray: midpoints. Can be np.NaN.
@@ -220,7 +223,7 @@ def _is_at_ask_xor_bid(self, subset: str) -> pd.Series:
         """Check if the trade price is at the ask xor bid.
 
         Args:
-            subset (str): subset i. e.,
+            subset (str): subset i.e.,
             'ex' or 'best'.
 
         Returns:
@@ -236,7 +239,7 @@ def _is_at_upper_xor_lower_quantile(
         """Check if the trade price is at the ask xor bid.
 
         Args:
-            subset (str): subset i. e., 'ex'.
+            subset (str): subset i.e., 'ex'.
             quantiles (float, optional): percentage of quantiles. Defaults to 0.3.
 
         Returns:
@@ -260,7 +263,7 @@ def _emo(self, subset: str) -> npt.NDArray:
         Adapted from Ellis et al. (2000).
 
         Args:
-            subset (Literal[&quot;ex&quot;, &quot;best&quot;]): subset i. e., 'ex' or 'best'.
+            subset (Literal[&quot;ex&quot;, &quot;best&quot;]): subset i.e., 'ex' or 'best'.
 
         Returns:
             npt.NDArray: result of the emo algorithm with tick rule. Can be
@@ -276,7 +279,7 @@ def _rev_emo(self, subset: str) -> npt.NDArray:
         Adapted from Grauer et al. (2022).
 
         Args:
-            subset (str): subset i. e., 'ex' or 'best'.
+            subset (str): subset i.e., 'ex' or 'best'.
 
         Returns:
             npt.NDArray: result of the emo algorithm with reverse tick rule.
@@ -298,7 +301,7 @@ def _clnv(self, subset: str) -> npt.NDArray:
         Adapted from Chakrabarty et al. (2007).
 
         Args:
-            subset (str): subset i. e.,'ex' or 'best'.
+            subset (str): subset i.e.,'ex' or 'best'.
 
         Returns:
             npt.NDArray: result of the emo algorithm with tick rule. Can be
@@ -322,7 +325,7 @@ def _rev_clnv(self, subset: str) -> npt.NDArray:
         Similar to extension of emo algorithm proposed Grauer et al. (2022).
 
         Args:
-            subset (str): subset i. e., 'ex' or 'best'.
+            subset (str): subset i.e., 'ex' or 'best'.
 
         Returns:
             npt.NDArray: result of the emo algorithm with tick rule. Can be
@@ -340,7 +343,7 @@ def _trade_size(self, subset: str) -> npt.NDArray:
         Adapted from Grauer et al. (2022).
 
         Args:
-            subset (str): subset i. e., 'ex' or 'best'.
+            subset (str): subset i.e., 'ex' or 'best'.
 
         Returns:
             npt.NDArray: result of the trade size rule. Can be np.NaN.
@@ -366,7 +369,7 @@ def _depth(self, subset: str) -> npt.NDArray:
         Adapted from Grauer et al. (2022).
 
         Args:
-            subset (str): subset i. e., 'ex' or 'best'.
+            subset (str): subset i.e., 'ex' or 'best'.
 
         Returns:
             npt.NDArray: result of depth rule. Can be np.NaN.
@@ -392,6 +395,60 @@ def _nan(self, subset: str) -> npt.NDArray:
         """
         return np.full(shape=(self.X_.shape[0],), fill_value=np.nan)
 
+    def _validate_columns(self, found_cols: list[str]) -> None:
+        """Validate if all required columns are present.
+
+        Args:
+            found_cols (list[str]): columns present in dataframe.
+        """
+
+        def lookup_columns(func_str: str, sub: str) -> list[str]:
+            LR_LIKE = [
+                "trade_price",
+                f"price_{sub}_lag",
+                f"ask_{sub}",
+                f"bid_{sub}",
+            ]
+            REV_LR_LIKE = [
+                "trade_price",
+                f"price_{sub}_lead",
+                f"ask_{sub}",
+                f"bid_{sub}",
+            ]
+
+            LUT_REQUIRED_COLUMNS: dict[str, list[str]] = {
+                "nan": [],
+                "clnv": LR_LIKE,
+                "depth": [
+                    "trade_price",
+                    f"ask_{sub}",
+                    f"bid_{sub}",
+                    f"ask_size_{sub}",
+                    f"bid_size_{sub}",
+                ],
+                "emo": LR_LIKE,
+                "lr": LR_LIKE,
+                "quote": ["trade_price", f"ask_{sub}", f"bid_{sub}"],
+                "rev_clnv": REV_LR_LIKE,
+                "rev_emo": REV_LR_LIKE,
+                "rev_lr": REV_LR_LIKE,
+                "rev_tick": ["trade_price", f"price_{sub}_lead"],
+                "tick": ["trade_price", f"price_{sub}_lag"],
+                "trade_size": ["trade_size", f"ask_size_{sub}", f"bid_size_{sub}"],
+            }
+            return LUT_REQUIRED_COLUMNS[func_str]
+
+        required_cols_set = set()
+        for func_str, sub in self._layers:
+            func_col = lookup_columns(func_str, sub)
+            required_cols_set.update(func_col)
+
+        missing_cols = sorted(required_cols_set - set(found_cols))
+        if missing_cols:
+            raise ValueError(
+                f"Expected to find columns: {missing_cols}. Check naming/presenence of columns. See: https://karelze.github.io/tclf/naming_conventions/"
+            )
+
     def fit(
         self,
         X: MatrixLike,
@@ -464,6 +521,9 @@ def fit(
                     f"expected one of {ALLOWED_FUNC_STR}."
                 )
 
+        columns = self.columns_
+        self._validate_columns(columns)
+
         return self
 
     def predict(self, X: MatrixLike) -> npt.NDArray:

diff --git a/tests/test_classical_classifier.py b/tests/test_classical_classifier.py
@@ -175,6 +175,21 @@ def test_invalid_func(self, x_train: pd.DataFrame) -> None:
         with pytest.raises(ValueError, match=r"Unknown function string"):
             classifier.fit(x_train)
 
+    def test_missing_columns(self, x_train: pd.DataFrame) -> None:
+        """Test, if an error is raised, if required columns are missing.
+
+        An exception should be raised if required features are missing,
+        including the columns required for classification.
+        """
+        classifier = ClassicalClassifier(
+            layers=[("tick", "all"), ("quote", "ex")], random_state=42
+        )
+        with pytest.raises(
+            ValueError,
+            match=r"Expected to find columns: ['ask_ex', 'bid_ex', 'price_all_lag']*",
+        ):
+            classifier.fit(x_train[["trade_price", "trade_size"]])
+
     def test_invalid_col_length(self, x_train: pd.DataFrame) -> None:
         """Test, if only valid column length can be passed.