-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
10 changed files
with
164 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,17 @@ | ||
:PROPERTIES: | ||
:ID: b8194cd8-57bc-4f4a-9862-baa8d5599033 | ||
:END: | ||
#+title: k-Nearest Neighbors | ||
#+filetags: :ml:ai: | ||
ai: | ||
|
||
An instance of a [[id:f8ed9d28-324b-4657-84e4-29cf735a782f][non parameteric learning algorithm]]. It doesn't distill the training data into parameters but the data is retained as a part of the algorithm. | ||
|
||
* Basics | ||
|
||
When fed a new feature vector, it's clustered into the nearest existing group subject to a closeness criterion summarized as: | ||
- fetch the k-nearest existing feature vectors to the new vector | ||
- classify the new one into the cluster holding majority in the k fetches in case of classification | ||
- average the k fetches numerical label in case of regression to tag the new vector | ||
|
||
The closeness criterion most commonly used is the L2-norm (euclidean distance). | ||
A popular alternative is [[id:2ec4a33e-479d-466b-b2b1-0a5925c0222c][cosine similarity]] when you'd like to capture the notion of an angle between two vectors. | ||
Some other criteria that can be considered : Chebychev distance, Mahalanobis distance and Hamming Distance. | ||
|
||
The hyperparameters of the algorithm then can be defined to be the choice of the nearness criterion and the value of k. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
:PROPERTIES: | ||
:ID: 5ca10a46-d9b8-4a6b-8aab-34ec17d55049 | ||
:END: | ||
#+title: Feature Engineering | ||
#+filetags: :ml:ai: | ||
|
||
- preparing the dataset to be used for the learning algorithm | ||
- the goal is to convert the data in to features with high-predictive power and make them usable in the first place | ||
|
||
Some common feature engineering processes are: | ||
** One Hot Encoding | ||
- converting categoricals into separate booleans | ||
** Binning (Bucketting) | ||
- converting a continuous feature into multiple exclusive boolean buckets (based on value ranges) | ||
- 0 to 10, 10 to 20, and so on... , for instance. | ||
** Normalization | ||
- converting varying numerical ranges into a standard (-1 to 1 or 0 to 1). | ||
- aids learning algorithms computationally (avoid precision and overflow discrepancies) | ||
|
||
#+begin_src lisp | ||
(defun normalize (numerical-data-vector) | ||
(let* ((min (minimum numerical-data-vector)) | ||
(max (maximum numerical-data-vector)) | ||
(span (- max min))) | ||
(mapcar #'(lambda (feature) | ||
(/ (- feature min) | ||
span)) | ||
numerical-data-vector))) | ||
#+end_src | ||
|
||
** Standardization | ||
- aka z-score normalization | ||
- rescaling features so that they have the properties of a standard [[id:2f44701c-e3e4-4b02-a899-e91e747db41a][normal distribution]] (zero mean, unit variance) | ||
|
||
#+begin_src lisp | ||
(defun standardize (numerical-data-vector) | ||
(let* ((mu (mean numerical-data-vector)) | ||
(sigma (sqrt (variance numerical-data-vector)))) | ||
(mapcar #'(lambda (feature) | ||
(/ (- feature mu) | ||
sigma)) | ||
numerical-data-vector))) | ||
#+end_src | ||
|
||
** Dealing with Missing Features | ||
Possible approaches: | ||
- removing examples with missing features | ||
- using a learning algorithm that can deal with missing data | ||
- data imputation techniques | ||
** Data Imputation Techniques | ||
- replace by mean, median or other similar statistic | ||
- something outside the normal range to indicate imputation (-1 in a normal 2-5 range for instance) | ||
- something according to the range and not a statistic (0 for -1 to 1 for instance) | ||
|
||
A more advanced approach is modelling the imputation as a regression problem before proceeding with the actual task. In this case all the other features are used to predict the missing feature. | ||
|
||
In cases of a large dataset, one can introduce an extra indicator feature to signify missing data and then place a value of choice. | ||
|
||
- test more than 1 technique and proceed with what suits best | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
:PROPERTIES: | ||
:ID: 2f44701c-e3e4-4b02-a899-e91e747db41a | ||
:END: | ||
#+title: Normal Distribution | ||
#+filetags: :tbp:math: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
:PROPERTIES: | ||
:ID: c3e62ed9-31d6-4ceb-ad82-c4d0e9b48c77 | ||
:END: | ||
#+title: Algorithm Selection | ||
#+filetags: :ml:ai: | ||
|
||
* Factors to consider | ||
|