diff --git a/docs/noj_book.automl.html b/docs/noj_book.automl.html index c370627..a1a6f4b 100644 --- a/docs/noj_book.automl.html +++ b/docs/noj_book.automl.html @@ -367,61 +367,61 @@
#uuid "2c62eb2d-1785-478f-b66d-c3ae7b64348e" {:model-data {:majority-class 1.0, :distinct-labels (0.0 1.0)}, :options {:model-type :metamorph.ml/dummy-classifier}, :id #uuid "9d5dcb45-9a55-4313-af63-b2d87b32802d", :feature-columns [:sex :pclass :embarked], :target-columns [:survived], :target-categorical-maps {:survived #tech.v3.dataset.categorical.CategoricalMap{:lookup-table {"no" 0, "yes" 1}, :src-column :survived, :result-datatype :float64}}, :scicloj.metamorph.ml/unsupervised? nil}
}
+:metamorph/mode :fit
#uuid "81782be5-2536-4bdd-8f81-49ec3b6326f7" {:model-data {:majority-class 1.0, :distinct-labels (0.0 1.0)}, :options {:model-type :metamorph.ml/dummy-classifier}, :id #uuid "b1cd14fb-804d-499d-b8a1-aa7d160ebfaf", :feature-columns [:sex :pclass :embarked], :target-columns [:survived], :target-categorical-maps {:survived #tech.v3.dataset.categorical.CategoricalMap{:lookup-table {"no" 0, "yes" 1}, :src-column :survived, :result-datatype :float64}}, :scicloj.metamorph.ml/unsupervised? nil}
}
The ctx contains lots of information, so I only show its top level keys
keys ctx-after-train) (
:metamorph/data
(:metamorph/mode
- "2c62eb2d-1785-478f-b66d-c3ae7b64348e") #uuid
This context map has the “data”, the “mode” and an UUID for each operation (we had only one in this pipeline)
{:model-data {:majority-class 1.0, :distinct-labels (0.0 1.0)},
:options {:model-type :metamorph.ml/dummy-classifier},
- :id #uuid "9d5dcb45-9a55-4313-af63-b2d87b32802d",
+ :id #uuid "b1cd14fb-804d-499d-b8a1-aa7d160ebfaf",
:feature-columns [:sex :pclass :embarked],
:target-columns [:survived],
:target-categorical-maps
@@ -690,7 +690,7 @@
:metamorph/data
(:metamorph/mode
- "2c62eb2d-1785-478f-b66d-c3ae7b64348e") #uuid
+"81782be5-2536-4bdd-8f81-49ec3b6326f7") #uuid
For the dummy-model we do not see a trained-model
, but it “communicates” the majority class from the train data to use it for prediction. So the dummy-model
has ‘learned’ the majority class from its training data.
So we can get prediction result out of the ctx:
@@ -723,7 +723,7 @@:metamorph/data
(:metamorph/mode
- "97a96cdc-a3ab-400d-bbff-2b50917e2b12") #uuid
To show the power of pipelines, I start with doing the simplest possible pipeline, and expand then on it.
We can already chain train and test with usual functions:
@@ -1663,20 +1663,20 @@[:sex :pclass]
+[:sex :pclass :embarked]
{:model-type :sklearn.classification/random-forest-classifier}
+{:model-type :sklearn.classification/logistic-regression}
[:sex :pclass :embarked]
+[:sex :pclass]
{:model-type :sklearn.classification/logistic-regression}
+{:model-type :sklearn.classification/random-forest-classifier}
Note the slider control and the tooltips.
Here is an example with an actual correlation matrix.
diff --git a/docs/search.json b/docs/search.json index 395d54f..e4de628 100644 --- a/docs/search.json +++ b/docs/search.json @@ -24,7 +24,7 @@ "href": "noj_book.underlying_libraries.html", "title": "2 Underlying libraries", "section": "", - "text": "Noj consists of the following libraries:\n\nTablecloth - dataset processing on top of TMD\ntcutils - utility functions for Tablecloth datasets - 🛠 early stage\ntech.ml.dataset (TMD) - high-perfrormance table processing\ntmd-parquet - TMD bindings for Parquet format\ndtype-next - high-performance array-programming\nKindly - datavis standard\nFastmath - math & stats - alpha stage of version 3\nHanamicloth - easy layered graphics - 🛠 alpha version - should stabilize soon\nHanami - interactive datavis\nmetamorph.ml - machine learning platform\nscicloj.ml.tribuo - Tribuo machine learning models - see known issues ❗\nsome Tribuo modules added by default: general-linear and tree ensembles for regression/classification\nlibpython-clj - Python bindings\nkind-pyplot - Python plotting\nClojisR - R bindings\n\n\nsource: notebooks/noj_book/underlying_libraries.clj", + "text": "Noj consists of the following libraries:\n\nTablecloth - dataset processing on top of TMD\ntcutils - utility functions for Tablecloth datasets - 🛠 early stage\ntech.ml.dataset (TMD) - high-perfrormance table processing\ntmd-parquet - TMD bindings for Parquet format\ndtype-next - high-performance array-programming\nKindly - datavis standard\nFastmath - math & stats - alpha stage of version 3\nHanamicloth - easy layered graphics - 🛠 alpha version - should stabilize soon\nHanami - interactive datavis\nmetamorph.ml - machine learning platform\nscicloj.ml.tribuo - Tribuo machine learning models\nscicloj.ml.smile - Smile (v 2.6) machine learning models\nsklearn-clj -\nsome Tribuo modules added by default: general-linear and tree ensembles for regression/classification\nlibpython-clj - Python bindings\nkind-pyplot - Python plotting\nClojisR - R bindings\nsame-ish - approximate comparisons - useful for notebook testability\n\n\nsource: notebooks/noj_book/underlying_libraries.clj", "crumbs": [ "Overview", "2 Underlying libraries" @@ -200,7 +200,7 @@ "href": "noj_book.automl.html#the-metamorph-pipeline-abstraction", "title": "8 AutoML using metamorph pipelines", "section": "", - "text": "(require '[scicloj.metamorph.ml :as ml]\n '[scicloj.metamorph.core :as mm]\n '[tablecloth.api :as tc])\n\n\n\n(def titanic ml-basic/numeric-titanic-data)\n\n\n\n(def splits (first (tc/split->seq titanic)))\n\n\n(def train-ds (:train splits))\n\n\n(def test-ds (:test splits))\n\n\n\n\n(def my-pipeline\n (mm/pipeline\n (ml/model {:model-type :metamorph.ml/dummy-classifier})))\n\n\n\nmy-pipeline\n\n\n#function[clojure.core/partial/fn--5908]\n\n\n\n\n\n(def ctx-after-train\n (my-pipeline {:metamorph/data train-ds\n :metamorph/mode :fit}))\n\n\nctx-after-train\n\n{\n\n\n\n\n\n\n\n\n:metamorph/data\n\n\n\nGroup: 0 [711 4]:\n\n\n\n:sex\n:pclass\n:embarked\n:survived\n\n\n\n\n0.0\n2.0\n0.0\n0.0\n\n\n1.0\n1.0\n0.0\n1.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n2.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n2.0\n0.0\n\n\n1.0\n3.0\n1.0\n0.0\n\n\n1.0\n1.0\n2.0\n1.0\n\n\n1.0\n3.0\n0.0\n0.0\n\n\n0.0\n2.0\n0.0\n0.0\n\n\n...\n...\n...\n...\n\n\n0.0\n3.0\n0.0\n1.0\n\n\n1.0\n1.0\n2.0\n1.0\n\n\n0.0\n2.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n1.0\n\n\n1.0\n1.0\n2.0\n1.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n2.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n1.0\n1.0\n2.0\n1.0\n\n\n1.0\n3.0\n1.0\n1.0\n\n\n1.0\n1.0\n0.0\n1.0\n\n\n\n\n\n\n\n\n:metamorph/mode :fit#uuid \"2c62eb2d-1785-478f-b66d-c3ae7b64348e\" {:model-data {:majority-class 1.0, :distinct-labels (0.0 1.0)}, :options {:model-type :metamorph.ml/dummy-classifier}, :id #uuid \"9d5dcb45-9a55-4313-af63-b2d87b32802d\", :feature-columns [:sex :pclass :embarked], :target-columns [:survived], :target-categorical-maps {:survived #tech.v3.dataset.categorical.CategoricalMap{:lookup-table {\"no\" 0, \"yes\" 1}, :src-column :survived, :result-datatype :float64}}, :scicloj.metamorph.ml/unsupervised? nil}}\n\n\n(keys ctx-after-train)\n\n\n(:metamorph/data\n :metamorph/mode\n #uuid \"2c62eb2d-1785-478f-b66d-c3ae7b64348e\")\n\n\n\n(vals ctx-after-train)\n\n(Group: 0 [711 4]:\n\n\n\n:sex\n:pclass\n:embarked\n:survived\n\n\n\n\n0.0\n2.0\n0.0\n0.0\n\n\n1.0\n1.0\n0.0\n1.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n2.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n2.0\n0.0\n\n\n1.0\n3.0\n1.0\n0.0\n\n\n1.0\n1.0\n2.0\n1.0\n\n\n1.0\n3.0\n0.0\n0.0\n\n\n0.0\n2.0\n0.0\n0.0\n\n\n...\n...\n...\n...\n\n\n0.0\n3.0\n0.0\n1.0\n\n\n1.0\n1.0\n2.0\n1.0\n\n\n0.0\n2.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n1.0\n\n\n1.0\n1.0\n2.0\n1.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n2.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n1.0\n1.0\n2.0\n1.0\n\n\n1.0\n3.0\n1.0\n1.0\n\n\n1.0\n1.0\n0.0\n1.0\n\n\n\n:fit\n{:model-data {:majority-class 1.0, :distinct-labels (0.0 1.0)},\n :options {:model-type :metamorph.ml/dummy-classifier},\n :id #uuid \"9d5dcb45-9a55-4313-af63-b2d87b32802d\",\n :feature-columns [:sex :pclass :embarked],\n :target-columns [:survived],\n :target-categorical-maps\n {:survived\n {:lookup-table {\"no\" 0, \"yes\" 1},\n :src-column :survived,\n :result-datatype :float64}},\n :scicloj.metamorph.ml/unsupervised? nil}\n)\n\n\n\n(def ctx-after-predict\n (my-pipeline (assoc ctx-after-train\n :metamorph/mode :transform\n :metamorph/data test-ds)))\n\n\n(keys ctx-after-predict)\n\n\n(:metamorph/data\n :metamorph/mode\n #uuid \"2c62eb2d-1785-478f-b66d-c3ae7b64348e\")\n\n\n\n\n(-> ctx-after-predict :metamorph/data :survived)\n\n\n#tech.v3.dataset.column<float64>[178]\n:survived\n[1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000...]", + "text": "(require '[scicloj.metamorph.ml :as ml]\n '[scicloj.metamorph.core :as mm]\n '[tablecloth.api :as tc])\n\n\n\n(def titanic ml-basic/numeric-titanic-data)\n\n\n\n(def splits (first (tc/split->seq titanic)))\n\n\n(def train-ds (:train splits))\n\n\n(def test-ds (:test splits))\n\n\n\n\n(def my-pipeline\n (mm/pipeline\n (ml/model {:model-type :metamorph.ml/dummy-classifier})))\n\n\n\nmy-pipeline\n\n\n#function[clojure.core/partial/fn--5908]\n\n\n\n\n\n(def ctx-after-train\n (my-pipeline {:metamorph/data train-ds\n :metamorph/mode :fit}))\n\n\nctx-after-train\n\n{\n\n\n\n\n\n\n\n\n:metamorph/data\n\n\n\nGroup: 0 [711 4]:\n\n\n\n:sex\n:pclass\n:embarked\n:survived\n\n\n\n\n0.0\n3.0\n1.0\n0.0\n\n\n0.0\n2.0\n2.0\n1.0\n\n\n0.0\n1.0\n0.0\n0.0\n\n\n1.0\n3.0\n0.0\n1.0\n\n\n0.0\n3.0\n1.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n1.0\n2.0\n1.0\n\n\n1.0\n3.0\n1.0\n1.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n...\n...\n...\n...\n\n\n0.0\n1.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n1.0\n\n\n1.0\n2.0\n0.0\n0.0\n\n\n1.0\n2.0\n0.0\n1.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n2.0\n0.0\n\n\n1.0\n1.0\n0.0\n1.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n1.0\n0.0\n1.0\n\n\n\n\n\n\n\n\n:metamorph/mode :fit#uuid \"81782be5-2536-4bdd-8f81-49ec3b6326f7\" {:model-data {:majority-class 1.0, :distinct-labels (0.0 1.0)}, :options {:model-type :metamorph.ml/dummy-classifier}, :id #uuid \"b1cd14fb-804d-499d-b8a1-aa7d160ebfaf\", :feature-columns [:sex :pclass :embarked], :target-columns [:survived], :target-categorical-maps {:survived #tech.v3.dataset.categorical.CategoricalMap{:lookup-table {\"no\" 0, \"yes\" 1}, :src-column :survived, :result-datatype :float64}}, :scicloj.metamorph.ml/unsupervised? nil}}\n\n\n(keys ctx-after-train)\n\n\n(:metamorph/data\n :metamorph/mode\n #uuid \"81782be5-2536-4bdd-8f81-49ec3b6326f7\")\n\n\n\n(vals ctx-after-train)\n\n(Group: 0 [711 4]:\n\n\n\n:sex\n:pclass\n:embarked\n:survived\n\n\n\n\n0.0\n3.0\n1.0\n0.0\n\n\n0.0\n2.0\n2.0\n1.0\n\n\n0.0\n1.0\n0.0\n0.0\n\n\n1.0\n3.0\n0.0\n1.0\n\n\n0.0\n3.0\n1.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n1.0\n2.0\n1.0\n\n\n1.0\n3.0\n1.0\n1.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n...\n...\n...\n...\n\n\n0.0\n1.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n1.0\n\n\n1.0\n2.0\n0.0\n0.0\n\n\n1.0\n2.0\n0.0\n1.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n2.0\n0.0\n\n\n1.0\n1.0\n0.0\n1.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n3.0\n0.0\n0.0\n\n\n0.0\n1.0\n0.0\n1.0\n\n\n\n:fit\n{:model-data {:majority-class 1.0, :distinct-labels (0.0 1.0)},\n :options {:model-type :metamorph.ml/dummy-classifier},\n :id #uuid \"b1cd14fb-804d-499d-b8a1-aa7d160ebfaf\",\n :feature-columns [:sex :pclass :embarked],\n :target-columns [:survived],\n :target-categorical-maps\n {:survived\n {:lookup-table {\"no\" 0, \"yes\" 1},\n :src-column :survived,\n :result-datatype :float64}},\n :scicloj.metamorph.ml/unsupervised? nil}\n)\n\n\n\n(def ctx-after-predict\n (my-pipeline (assoc ctx-after-train\n :metamorph/mode :transform\n :metamorph/data test-ds)))\n\n\n(keys ctx-after-predict)\n\n\n(:metamorph/data\n :metamorph/mode\n #uuid \"81782be5-2536-4bdd-8f81-49ec3b6326f7\")\n\n\n\n\n(-> ctx-after-predict :metamorph/data :survived)\n\n\n#tech.v3.dataset.column<float64>[178]\n:survived\n[1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000...]", "crumbs": [ "Tutorials", "8 AutoML using metamorph pipelines" @@ -211,7 +211,7 @@ "href": "noj_book.automl.html#use-metamorph-pipelines-to-do-model-training-with-higher-level-api", "title": "8 AutoML using metamorph pipelines", "section": "8.2 Use metamorph pipelines to do model training with higher level API", - "text": "8.2 Use metamorph pipelines to do model training with higher level API\nAs user of metamorph.ml we do not need to deal with this low-level details of how metamorph works, we have convenience functions which hide this.\nThe following code will do the same as train, but return a context object, which contains the trained model, so it will execute the pipeline, and not only create it.\nIt uses a convenience function mm/fit which generates compliant context maps internally and executes the pipeline as well.\nThe ctx acts a collector of everything “learned” during :fit, mainly the trained model, but it could be as well other information learned from the data during :fit and to be applied at :transform .\n\n(def train-ctx\n (mm/fit titanic\n (ml/model {:model-type :metamorph.ml/dummy-classifier})))\n\n(The dummy-classifier model does not have a lot of state, so there is little to see)\n\n(keys train-ctx)\n\n\n(:metamorph/data\n :metamorph/mode\n #uuid \"97a96cdc-a3ab-400d-bbff-2b50917e2b12\")\n\nTo show the power of pipelines, I start with doing the simplest possible pipeline, and expand then on it.\nWe can already chain train and test with usual functions:\n\n(->>\n (ml/train train-ds {:model-type :metamorph.ml/dummy-classifier})\n (ml/predict test-ds)\n :survived)\n\n\n#tech.v3.dataset.column<float64>[178]\n:survived\n[1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000...]\n\nthe same with pipelines\n\n(def pipeline\n (mm/pipeline (ml/model {:model-type :metamorph.ml/dummy-classifier})))\n\n\n(->>\n (mm/fit-pipe train-ds pipeline)\n (mm/transform-pipe test-ds pipeline)\n :metamorph/data :survived)\n\n\n#tech.v3.dataset.column<float64>[178]\n:survived\n[1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000...]", + "text": "8.2 Use metamorph pipelines to do model training with higher level API\nAs user of metamorph.ml we do not need to deal with this low-level details of how metamorph works, we have convenience functions which hide this.\nThe following code will do the same as train, but return a context object, which contains the trained model, so it will execute the pipeline, and not only create it.\nIt uses a convenience function mm/fit which generates compliant context maps internally and executes the pipeline as well.\nThe ctx acts a collector of everything “learned” during :fit, mainly the trained model, but it could be as well other information learned from the data during :fit and to be applied at :transform .\n\n(def train-ctx\n (mm/fit titanic\n (ml/model {:model-type :metamorph.ml/dummy-classifier})))\n\n(The dummy-classifier model does not have a lot of state, so there is little to see)\n\n(keys train-ctx)\n\n\n(:metamorph/data\n :metamorph/mode\n #uuid \"39b8b911-7c11-4ce5-a3bd-0dbf5ead1963\")\n\nTo show the power of pipelines, I start with doing the simplest possible pipeline, and expand then on it.\nWe can already chain train and test with usual functions:\n\n(->>\n (ml/train train-ds {:model-type :metamorph.ml/dummy-classifier})\n (ml/predict test-ds)\n :survived)\n\n\n#tech.v3.dataset.column<float64>[178]\n:survived\n[1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000...]\n\nthe same with pipelines\n\n(def pipeline\n (mm/pipeline (ml/model {:model-type :metamorph.ml/dummy-classifier})))\n\n\n(->>\n (mm/fit-pipe train-ds pipeline)\n (mm/transform-pipe test-ds pipeline)\n :metamorph/data :survived)\n\n\n#tech.v3.dataset.column<float64>[178]\n:survived\n[1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000...]", "crumbs": [ "Tutorials", "8 AutoML using metamorph pipelines" @@ -244,7 +244,7 @@ "href": "noj_book.automl.html#finding-the-best-model-automatically", "title": "8 AutoML using metamorph pipelines", "section": "8.5 Finding the best model automatically", - "text": "8.5 Finding the best model automatically\nThe advantage of the pipelines is even more visible, if we want to have configurable pipelines, and do a grid search to find optimal settings.\nthe following will find the best model across:\n\n4 different model classes\n6 different selections of used features\nk-cross validate this with different test / train splits\n\n\n(defn make-pipe-fn [model-spec features]\n (mm/pipeline\n ;; store the used features in ctx, so we can retrieve them at the end\n (fn [ctx]\n (assoc ctx :used-features features))\n (mm/lift tc/select-columns (conj features :survived))\n {:metamorph/id :model} (ml/model model-spec)))\n\nCreate a 5-K cross validation split of the data:\n\n(def titanic-k-fold (tc/split->seq ml-basic/numeric-titanic-data :kfold {:seed 12345}))\n\n\n(-> titanic-k-fold count)\n\n\n5\n\nThe list of the model types we want to try:\n\n(def models [{ :model-type :xgboost/classification\n :round 10}\n {:model-type :sklearn.classification/decision-tree-classifier}\n {:model-type :sklearn.classification/logistic-regression}\n {:model-type :sklearn.classification/random-forest-classifier}\n {:model-type :metamorph.ml/dummy-classifier}\n {:model-type :scicloj.ml.tribuo/classification\n :tribuo-components [{:name \"logistic\"\n :type \"org.tribuo.classification.sgd.linear.LinearSGDTrainer\"}]\n :tribuo-trainer-name \"logistic\"}\n {:model-type :scicloj.ml.tribuo/classification\n :tribuo-components [{:name \"random-forest\"\n :type \"org.tribuo.classification.dtree.CARTClassificationTrainer\"\n :properties {:maxDepth \"8\"\n :useRandomSplitPoints \"false\"\n :fractionFeaturesInSplit \"0.5\"}}]\n :tribuo-trainer-name \"random-forest\"}])\n\nThis uses models from Smile and Tribuo, but could be any metamorph.ml compliant model ( library sklearn-clj wraps all python sklearn models, for example)\nThe list of feature combinations to try for each model:\n\n(def feature-combinations\n [[:sex :pclass :embarked]\n [:sex]\n [:pclass :embarked]\n [:embarked]\n [:sex :embarked]\n [:sex :pclass]])\n\ngenerate 24 pipeline functions:\n\n(def pipe-fns\n (for [model models\n feature-combination feature-combinations]\n (make-pipe-fn model feature-combination)))\n\n\n(count pipe-fns)\n\n\n42\n\nExecute all pipelines for all splits in the cross-validations and return best model by classification-accuracy\n\n(def evaluation-results\n (ml/evaluate-pipelines\n pipe-fns\n titanic-k-fold\n loss/classification-accuracy\n :accuracy))\n\nBy default it returns the best mode only\n\n(make-results-ds evaluation-results)\n\n\n_unnamed [1 3]:\n\n\n\n\n\n\n\n\n:used-features\n:mean-accuracy\n:options\n\n\n\n\n[:sex :pclass :embarked]\n0.81107726\n{:model-type :scicloj.ml.tribuo/classification,\n\n\n\n\n:tribuo-components\n\n\n\n\n[{:name random-forest,\n\n\n\n\n:type org.tribuo.classification.dtree.CARTClassificationTrainer,\n\n\n\n\n:properties\n\n\n\n\n{:maxDepth 8,\n\n\n\n\n:useRandomSplitPoints false,\n\n\n\n\n:fractionFeaturesInSplit 0.5}}],\n\n\n\n\n:tribuo-trainer-name random-forest}\n\n\n\n\nThe key observation is here, that the metamorph pipelines allow to not only grid-search over the model hyper-parameters, but as well over arbitrary pipeline variations, like which features to include. Both get handled in the same way.\nWe can get all results as well:\n\n(def evaluation-results-all\n (ml/evaluate-pipelines\n pipe-fns\n titanic-k-fold\n loss/classification-accuracy\n :accuracy\n {:map-fn :map\n :return-best-crossvalidation-only false\n :return-best-pipeline-only false}))\n\nIn total it creates and evaluates 4 models * 6 feature configurations * 5 CV = 120 models\n\n(-> evaluation-results-all flatten count)\n\n\n210\n\nWe can find the best as well by hand, it’s the first from the list, when sorted by accuracy.\n\n(-> (make-results-ds evaluation-results-all)\n (tc/unique-by)\n (tc/order-by [:mean-accuracy] :desc)\n (tc/head 20)\n (kind/table))\n\n\n\n\n\n\n\n\n\n\n\nused-features\nmean-accuracy\noptions\n\n\n\n\n\n[:sex :pclass :embarked]\n\n0.8110772551260077\n\n{:model-type :sklearn.classification/random-forest-classifier}\n\n\n\n\n[:sex :pclass :embarked]\n\n0.8110772551260077\n\n{:model-type :sklearn.classification/decision-tree-classifier}\n\n\n\n\n[:sex :pclass :embarked]\n\n0.8110772551260077\n\n{:model-type :xgboost/classification, :round 10}\n\n\n\n\n[:sex :pclass :embarked]\n\n0.8110772551260077\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"random-forest\",\n :type \"org.tribuo.classification.dtree.CARTClassificationTrainer\",\n :properties\n {:maxDepth \"8\",\n :useRandomSplitPoints \"false\",\n :fractionFeaturesInSplit \"0.5\"}}],\n :tribuo-trainer-name \"random-forest\"}\n\n\n\n\n[:sex :pclass]\n\n0.7863327620135847\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"logistic\",\n :type \"org.tribuo.classification.sgd.linear.LinearSGDTrainer\"}],\n :tribuo-trainer-name \"logistic\"}\n\n\n\n\n[:sex :embarked]\n\n0.7863327620135847\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"logistic\",\n :type \"org.tribuo.classification.sgd.linear.LinearSGDTrainer\"}],\n :tribuo-trainer-name \"logistic\"}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"logistic\",\n :type \"org.tribuo.classification.sgd.linear.LinearSGDTrainer\"}],\n :tribuo-trainer-name \"logistic\"}\n\n\n\n\n[:sex :embarked]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/random-forest-classifier}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/random-forest-classifier}\n\n\n\n\n[:sex :pclass]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/logistic-regression}\n\n\n\n\n[:sex :embarked]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/logistic-regression}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/logistic-regression}\n\n\n\n\n[:sex :embarked]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/decision-tree-classifier}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :xgboost/classification, :round 10}\n\n\n\n\n[:sex :embarked]\n\n0.7863327620135847\n\n{:model-type :xgboost/classification, :round 10}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/decision-tree-classifier}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"random-forest\",\n :type \"org.tribuo.classification.dtree.CARTClassificationTrainer\",\n :properties\n {:maxDepth \"8\",\n :useRandomSplitPoints \"false\",\n :fractionFeaturesInSplit \"0.5\"}}],\n :tribuo-trainer-name \"random-forest\"}\n\n\n\n\n[:sex :pclass :embarked]\n\n0.7852091665079668\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"logistic\",\n :type \"org.tribuo.classification.sgd.linear.LinearSGDTrainer\"}],\n :tribuo-trainer-name \"logistic\"}\n\n\n\n\n[:sex :pclass]\n\n0.7762267504602298\n\n{:model-type :sklearn.classification/random-forest-classifier}\n\n\n\n\n[:sex :pclass :embarked]\n\n0.7750777629657843\n\n{:model-type :sklearn.classification/logistic-regression}", + "text": "8.5 Finding the best model automatically\nThe advantage of the pipelines is even more visible, if we want to have configurable pipelines, and do a grid search to find optimal settings.\nthe following will find the best model across:\n\n4 different model classes\n6 different selections of used features\nk-cross validate this with different test / train splits\n\n\n(defn make-pipe-fn [model-spec features]\n (mm/pipeline\n ;; store the used features in ctx, so we can retrieve them at the end\n (fn [ctx]\n (assoc ctx :used-features features))\n (mm/lift tc/select-columns (conj features :survived))\n {:metamorph/id :model} (ml/model model-spec)))\n\nCreate a 5-K cross validation split of the data:\n\n(def titanic-k-fold (tc/split->seq ml-basic/numeric-titanic-data :kfold {:seed 12345}))\n\n\n(-> titanic-k-fold count)\n\n\n5\n\nThe list of the model types we want to try:\n\n(def models [{ :model-type :xgboost/classification\n :round 10}\n {:model-type :sklearn.classification/decision-tree-classifier}\n {:model-type :sklearn.classification/logistic-regression}\n {:model-type :sklearn.classification/random-forest-classifier}\n {:model-type :metamorph.ml/dummy-classifier}\n {:model-type :scicloj.ml.tribuo/classification\n :tribuo-components [{:name \"logistic\"\n :type \"org.tribuo.classification.sgd.linear.LinearSGDTrainer\"}]\n :tribuo-trainer-name \"logistic\"}\n {:model-type :scicloj.ml.tribuo/classification\n :tribuo-components [{:name \"random-forest\"\n :type \"org.tribuo.classification.dtree.CARTClassificationTrainer\"\n :properties {:maxDepth \"8\"\n :useRandomSplitPoints \"false\"\n :fractionFeaturesInSplit \"0.5\"}}]\n :tribuo-trainer-name \"random-forest\"}])\n\nThis uses models from Smile and Tribuo, but could be any metamorph.ml compliant model ( library sklearn-clj wraps all python sklearn models, for example)\nThe list of feature combinations to try for each model:\n\n(def feature-combinations\n [[:sex :pclass :embarked]\n [:sex]\n [:pclass :embarked]\n [:embarked]\n [:sex :embarked]\n [:sex :pclass]])\n\ngenerate 24 pipeline functions:\n\n(def pipe-fns\n (for [model models\n feature-combination feature-combinations]\n (make-pipe-fn model feature-combination)))\n\n\n(count pipe-fns)\n\n\n42\n\nExecute all pipelines for all splits in the cross-validations and return best model by classification-accuracy\n\n(def evaluation-results\n (ml/evaluate-pipelines\n pipe-fns\n titanic-k-fold\n loss/classification-accuracy\n :accuracy))\n\nBy default it returns the best mode only\n\n(make-results-ds evaluation-results)\n\n\n_unnamed [1 3]:\n\n\n\n\n\n\n\n\n:used-features\n:mean-accuracy\n:options\n\n\n\n\n[:sex :pclass :embarked]\n0.81107726\n{:model-type :scicloj.ml.tribuo/classification,\n\n\n\n\n:tribuo-components\n\n\n\n\n[{:name random-forest,\n\n\n\n\n:type org.tribuo.classification.dtree.CARTClassificationTrainer,\n\n\n\n\n:properties\n\n\n\n\n{:maxDepth 8,\n\n\n\n\n:useRandomSplitPoints false,\n\n\n\n\n:fractionFeaturesInSplit 0.5}}],\n\n\n\n\n:tribuo-trainer-name random-forest}\n\n\n\n\nThe key observation is here, that the metamorph pipelines allow to not only grid-search over the model hyper-parameters, but as well over arbitrary pipeline variations, like which features to include. Both get handled in the same way.\nWe can get all results as well:\n\n(def evaluation-results-all\n (ml/evaluate-pipelines\n pipe-fns\n titanic-k-fold\n loss/classification-accuracy\n :accuracy\n {:map-fn :map\n :return-best-crossvalidation-only false\n :return-best-pipeline-only false}))\n\nIn total it creates and evaluates 4 models * 6 feature configurations * 5 CV = 120 models\n\n(-> evaluation-results-all flatten count)\n\n\n210\n\nWe can find the best as well by hand, it’s the first from the list, when sorted by accuracy.\n\n(-> (make-results-ds evaluation-results-all)\n (tc/unique-by)\n (tc/order-by [:mean-accuracy] :desc)\n (tc/head 20)\n (kind/table))\n\n\n\n\n\n\n\n\n\n\n\nused-features\nmean-accuracy\noptions\n\n\n\n\n\n[:sex :pclass :embarked]\n\n0.8110772551260077\n\n{:model-type :sklearn.classification/random-forest-classifier}\n\n\n\n\n[:sex :pclass :embarked]\n\n0.8110772551260077\n\n{:model-type :sklearn.classification/decision-tree-classifier}\n\n\n\n\n[:sex :pclass :embarked]\n\n0.8110772551260077\n\n{:model-type :xgboost/classification, :round 10}\n\n\n\n\n[:sex :pclass :embarked]\n\n0.8110772551260077\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"random-forest\",\n :type \"org.tribuo.classification.dtree.CARTClassificationTrainer\",\n :properties\n {:maxDepth \"8\",\n :useRandomSplitPoints \"false\",\n :fractionFeaturesInSplit \"0.5\"}}],\n :tribuo-trainer-name \"random-forest\"}\n\n\n\n\n[:sex :pclass]\n\n0.7863327620135847\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"logistic\",\n :type \"org.tribuo.classification.sgd.linear.LinearSGDTrainer\"}],\n :tribuo-trainer-name \"logistic\"}\n\n\n\n\n[:sex :embarked]\n\n0.7863327620135847\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"logistic\",\n :type \"org.tribuo.classification.sgd.linear.LinearSGDTrainer\"}],\n :tribuo-trainer-name \"logistic\"}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"logistic\",\n :type \"org.tribuo.classification.sgd.linear.LinearSGDTrainer\"}],\n :tribuo-trainer-name \"logistic\"}\n\n\n\n\n[:sex :embarked]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/random-forest-classifier}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/random-forest-classifier}\n\n\n\n\n[:sex :pclass]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/logistic-regression}\n\n\n\n\n[:sex :embarked]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/logistic-regression}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/logistic-regression}\n\n\n\n\n[:sex :embarked]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/decision-tree-classifier}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :xgboost/classification, :round 10}\n\n\n\n\n[:sex :embarked]\n\n0.7863327620135847\n\n{:model-type :xgboost/classification, :round 10}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :sklearn.classification/decision-tree-classifier}\n\n\n\n\n[:sex]\n\n0.7863327620135847\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"random-forest\",\n :type \"org.tribuo.classification.dtree.CARTClassificationTrainer\",\n :properties\n {:maxDepth \"8\",\n :useRandomSplitPoints \"false\",\n :fractionFeaturesInSplit \"0.5\"}}],\n :tribuo-trainer-name \"random-forest\"}\n\n\n\n\n[:sex :pclass :embarked]\n\n0.7852091665079668\n\n{:model-type :scicloj.ml.tribuo/classification,\n :tribuo-components\n [{:name \"logistic\",\n :type \"org.tribuo.classification.sgd.linear.LinearSGDTrainer\"}],\n :tribuo-trainer-name \"logistic\"}\n\n\n\n\n[:sex :pclass :embarked]\n\n0.7750777629657843\n\n{:model-type :sklearn.classification/logistic-regression}\n\n\n\n\n[:sex :pclass]\n\n0.773973211451787\n\n{:model-type :sklearn.classification/random-forest-classifier}", "crumbs": [ "Tutorials", "8 AutoML using metamorph pipelines"