afternoon 0x2165

rajp152k · Sep 28, 2023 · 19c6de8 · 19c6de8
1 parent 143b487
commit 19c6de8
Show file tree

Hide file tree

Showing 10 changed files with 164 additions and 9 deletions.
diff --git a/Content/20230712131112-blogging.org b/Content/20230712131112-blogging.org
@@ -126,7 +126,6 @@ An index of all that I write about, published, work in progress and prospective.
    - there's the objective past (questionable what's objective) (an event that's one and done) and then there's your perspective about it (that is a stream that you carry on for life (I'm not accounting for forgetting right now if you journal, blog, create any sort of sensible content as it is easily accessible : (example : I could remember what stage of Life I was in based on the book that I was reading then : Once I saw the book : All memories fell into context like dominoes))).
    - you can only alter your perspective about the past in the present moment of the stream but never change your previous thoughts in the stream.
    - There is no mental time machine allowing you to manipulate your past memories : you simply partially overwrite stuff but never alter it's state in the past...
-
  - thinking of how I could structure these posts : the future also plays a part in making decisions - there is some level of certainty associated with the future if you chalk out your actions and have realistic expectations out of them.
    - being in a definite caloric deficit while following an recordable resistance training protocol will yield results that will be within a certain definite range around your expectations when you set out with the goal.
 ** Prospective
@@ -137,5 +136,4 @@ An index of all that I write about, published, work in progress and prospective.
  - meditative walks
  - mental games -> structuring an article mentally given a writing prompt is a pretty complex and satisfying mental game
  - if you're a physics aficionado like I am, consider observing your surroundings and coming up with mental mathematical models to represent reality.
-
 *** Why text is awesome (for logicians.) and semantically discrete images is what you should limit yourself to
diff --git a/Content/20230721111610-cosine_similarity.org b/Content/20230721111610-cosine_similarity.org
@@ -7,3 +7,18 @@
  - is a measure of closeness of two vectors.
  - a common use case is lifting a collection of real life [[id:b8178e96-18bd-43da-915b-11909971a316][datum]] objects into a dense vector space and being able to comment on their semantic closeness/farness using the notions of vector similarity.
 
+#+begin_src lisp
+  (defun dot-product (vec-a vec-b)
+    (assert (= (len vec-a) (len vec-b)))
+    (reduce #'+
+	    (mapcar #'* vec-a vec-b)
+	    0))
+
+  (defun l2-norm (vec)
+    (sqrt (reduce #'+ (mapcar #'square vec) 0)))
+
+  (defun cosine-similarity (vec-a vec-b)
+    (/ (dot-product vec-a vec-b)
+       (* (l2-norm vec-a)
+	  (l2-norm vec-b))))
+#+end_src
diff --git a/Content/20230911114632-the100pagemlbook.org b/Content/20230911114632-the100pagemlbook.org
@@ -5,7 +5,7 @@
 #+filetags: :book:ml:ai:
 
 The parent reference sentinel for this book is rooted as [[id:523db378-6e64-41a3-8890-ad782c67b5e9][The Hundred Page Machine Learning Book]] under the major machine learning node.
-I populate this node with the intention to index into other major nodes of the field and fill in some holes that are generic and require a book  for and end to end coverage cause tending to them in a non-project oriented scenario isn't worth the time.
+I populate this node with the intention to index into other major nodes of the field and fill in some holes that are generic and require a book  for an end to end coverage cause tending to them in a non-project oriented scenario isn't worth the time.
 
 * Introduction
 ** What is ML?
@@ -117,3 +117,12 @@ I populate this node with the intention to index into other major nodes of the f
 ** [[id:91729987-32db-482a-bc1b-91469579413b][Logistic Regression]]
 ** [[id:a2c424a5-d412-496c-abcb-1fd216548a02][Decision Trees]]
 ** [[id:b8194cd8-57bc-4f4a-9862-baa8d5599033][k-Nearest Neighbors]]
+* Anatomy of a Learning Algorithm
+Any learning algorithm is centered around certain basics:
+ - A [[id:d99d5a5f-93fc-4f3b-b72e-ea59037956f9][loss function]]
+ - an [[id:7b9be887-8c39-4a37-8217-f0e21a6cb64e][optimization]] ..
+   - criterion, inspired from the loss function
+   - routine, that finds a solution to the optimization criterion
+* Basic Practice 
+** [[id:5ca10a46-d9b8-4a6b-8aab-34ec17d55049][Feature Engineering]]
+** [[id:c3e62ed9-31d6-4ceb-ad82-c4d0e9b48c77][Algorithm Selection]]
diff --git a/Content/20230911123345-clustering.org b/Content/20230911123345-clustering.org
@@ -4,3 +4,16 @@
 #+title: Clustering
 #+filetags: :tbp:ml:ai:
 
+
+To be populated ...
+
+* Loss 
+
+The nature of a clustering can be evaluated via multiple perspectives and their combinations. This leads to several loss functions that are losely based on the two major criteria:
+ - larger Inter cluster distance is better
+ - small Intra-cluser diameter is better
+
+Some advanced metrics may even consider considering the shape of the clusters but I won't be exploring that here.
+
+checkout [[https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation][scikit learn clustering docs]] to know more about evaluating clustering performance.
+
diff --git a/Content/20230911161621-optimization.org b/Content/20230911161621-optimization.org
@@ -4,3 +4,6 @@
 #+title: Optimization
 #+filetags: :tbp:math:
 
+A branch of math dealing with optimizing a certain criterion (see [[id:d99d5a5f-93fc-4f3b-b72e-ea59037956f9][loss]]) by choosing corresponding optimal parameters.
+
+The most generic and popular optimization algorithm might be (Stochastic) [[id:a4761c32-806d-4a7f-ba18-27136a3de1fc][Gradient Descent]]
diff --git a/Content/20230912162836-k_nearest_neighbors.org b/Content/20230912162836-k_nearest_neighbors.org
@@ -1,6 +1,17 @@
-:PROPERTIES:
-:ID:       b8194cd8-57bc-4f4a-9862-baa8d5599033
-:END:
-#+title: k-Nearest Neighbors
-#+filetags: :ml:ai:
+ai:
+
+An instance of a [[id:f8ed9d28-324b-4657-84e4-29cf735a782f][non parameteric learning algorithm]]. It doesn't distill the training data into parameters but the data is retained as a part of the algorithm.
+
+* Basics
+
+When fed a new feature vector, it's clustered into the nearest existing group subject to a closeness criterion summarized as:
+ - fetch the k-nearest existing feature vectors to the new vector
+ - classify the new one into the cluster holding majority in the k fetches in case of classification
+ - average the k fetches numerical label in case of regression to tag the new vector
+
+The closeness criterion most commonly used is the L2-norm (euclidean distance).
+A popular alternative is [[id:2ec4a33e-479d-466b-b2b1-0a5925c0222c][cosine similarity]] when you'd like to capture the notion of an angle between two vectors. 
+Some other criteria that can be considered : Chebychev distance, Mahalanobis distance and Hamming Distance.
+
+The hyperparameters of the algorithm then can be defined to be the choice of the nearness criterion and the value of k.
 
diff --git a/Content/20230914153411-gradient_descent.org b/Content/20230914153411-gradient_descent.org
@@ -2,5 +2,37 @@
 :ID:       a4761c32-806d-4a7f-ba18-27136a3de1fc
 :END:
 #+title: Gradient Descent
-#+filetags: :tbp:ml:ai:
+#+filetags: :ml:ai:
+
+ - an iterative [[id:7b9be887-8c39-4a37-8217-f0e21a6cb64e][optimization]] algorithm used to minimize a function (see [[id:d99d5a5f-93fc-4f3b-b72e-ea59037956f9][loss]]).
+ - speaking briefly:
+   1. we start at a random point on the parameter-space vs loss contour
+   2. then we step down the hyper-hill, trying to avoid getting stuck in local hyper-troughs, repeating until we can report convergence when we reach a satisfactory (usually) hyper-valley
+      - note that we actually can't see the hyper-hill and need to calculate the loss everytime we step somewhere, akin to hiking in the dark.
+   3. the parameter-space step size is controllable via a hyper-parameter -> the learning rate
+
+ - when working with a convex optimization criterion, we're sure to find a global minimum. Whereas a settlement might be necessary with complex contours.
+
+
+** Impovements
+
+*** Stochastic Gradient Descent (SGD)
+:PROPERTIES:
+:ID:       e419c0a9-9753-48f1-82c4-f2004cc2e29c
+:END:
+Computing the actual loss for all of the training data can be very slow and doing so using stochastically selected smaller batches leads to the idea of stochastic gradient descent.
+*** Adagrad
+This scales the learning rate individually for each parameter (ADAptive GRADient descent) according to the history of gradients.
+ - the learning rate is smaller for large gradients and larger for smaller gradients as a consequence.
+*** Momentum
+Accelerate SGD by retaining a sense of past gradients to impart some inertia to the optimizatio process
+ - helps deals with oscillations and move more meaningfully
+
+*** RMSProp (Root Mean Square Propagation):
+RMSProp (like Adagrad) adapts the learning rate for each parameter based on the past gradients, helping to stabilize and speed up the training process.
+ - note that RMSprop is an improvement over Adagrad and deals with the diminishing learning rate issue.
+ - read more at [[https://en.wikipedia.org/wiki/Stochastic_gradient_descent][this wikipedia page]]
+
+*** Adam (Adaptive Moment Estimation):
+Adam combines the benefits of momentum and RMSProp, using both past gradients and their magnitudes to adjust learning rates, making it a versatile and efficient optimization algorithm.
 
diff --git a/Content/20230928154934-feature_engineering.org b/Content/20230928154934-feature_engineering.org
@@ -0,0 +1,60 @@
+:PROPERTIES:
+:ID:       5ca10a46-d9b8-4a6b-8aab-34ec17d55049
+:END:
+#+title: Feature Engineering
+#+filetags: :ml:ai:
+
+ - preparing the dataset to be used for the learning algorithm
+ - the goal is to convert the data in to features with high-predictive power and make them usable in the first place
+
+Some common feature engineering processes are:
+** One Hot Encoding
+ - converting categoricals into separate booleans
+** Binning (Bucketting)
+ - converting a continuous feature into multiple exclusive boolean buckets (based on value ranges)
+   - 0 to 10, 10 to 20, and so on... , for instance.
+** Normalization
+ - converting varying numerical ranges into a standard (-1 to 1 or 0 to 1).
+ - aids learning algorithms computationally (avoid precision and overflow discrepancies)
+
+#+begin_src lisp
+  (defun normalize (numerical-data-vector)
+    (let* ((min (minimum numerical-data-vector))
+	   (max (maximum numerical-data-vector))
+	   (span (- max min)))
+      (mapcar #'(lambda (feature)
+		  (/ (- feature min)
+		     span))
+	      numerical-data-vector)))
+#+end_src
+
+** Standardization
+ - aka z-score normalization
+ - rescaling features so that they have the properties of a standard [[id:2f44701c-e3e4-4b02-a899-e91e747db41a][normal distribution]] (zero mean, unit variance)
+
+#+begin_src lisp
+  (defun standardize (numerical-data-vector)
+    (let* ((mu (mean numerical-data-vector))
+	   (sigma (sqrt (variance numerical-data-vector))))
+      (mapcar #'(lambda (feature)
+		  (/ (- feature mu)
+		     sigma))
+	      numerical-data-vector)))
+#+end_src
+
+** Dealing with Missing Features
+Possible approaches:
+ - removing examples with missing features
+ - using a learning algorithm that can deal with missing data
+ - data imputation techniques
+** Data Imputation Techniques
+ - replace by mean, median or other similar statistic
+ - something outside the normal range to indicate imputation (-1 in a normal 2-5 range for instance)
+ - something according to the range and not a statistic (0 for -1 to 1 for instance)
+
+A more advanced approach is modelling the imputation as a regression problem before proceeding with the actual task. In this case all the other features are used to predict the missing feature.
+
+In cases of a large dataset, one can introduce an extra indicator feature to signify missing data and then place a value of choice.
+
+ - test more than 1 technique and proceed with what suits best
+
diff --git a/Content/20230928155802-normal_distribution.org b/Content/20230928155802-normal_distribution.org
@@ -0,0 +1,6 @@
+:PROPERTIES:
+:ID:       2f44701c-e3e4-4b02-a899-e91e747db41a
+:END:
+#+title: Normal Distribution
+#+filetags: :tbp:math:
+
diff --git a/Content/20230928161331-algorithm_selection.org b/Content/20230928161331-algorithm_selection.org
@@ -0,0 +1,8 @@
+:PROPERTIES:
+:ID:       c3e62ed9-31d6-4ceb-ad82-c4d0e9b48c77
+:END:
+#+title: Algorithm Selection
+#+filetags: :ml:ai:
+
+* Factors to consider
+