Matrix Calculus for Machine Learning and Beyond #11919

guevara · 2024-12-09T07:48:00Z

Matrix Calculus for Machine Learning and Beyond

https://ift.tt/CLslUX7

Selected lecture notes are available.

Lecture 1

Outline

Part 1: Overview, applications, and motivation.

Part 2: Rethinking derivatives as linear operators: f(x + dx) - f(x) = df = f′(x)[dx] — f′ is the linear operator that gives the change df in the output from a “tiny” change dx in the inputs, to first order in dx (i.e. dropping higher-order terms). When we have a scalar function f(x) ∈ ℝ of vector inputs x ∈ ℝⁿ, then this gives us a “row vector” f′(x) since f′(x)dx is a scalar, which we interpret as the transpose of the gradient ∇f (which we call a “column” vector), i.e. df = (∇f) ⋅ dx = (∇f)ᵀdx. When we have a vector function f(x) ∈ ℝᵐ of vector inputs x ∈ ℝⁿ, then f’(x) is a linear operator that takes n inputs to m outputs, which we can think of as an m × n matrix called the Jacobian matrix (typically covered only superficially in 18.02 Multivariable Calculus.)

Lecture Notes

Lecture 2

Outline

Part 1: Continued discussing derivatives as linear operators, starting with Jacobian matrices. Reviewed the sum rule d(f + g) = df + dg, the product rule d(fg) = (df)g + f(dg), and the chain rule for f(g(x)) (f’(x) = g’(h(x))h’(x), where this is a composition of two linear operations, performing h’ then g’ — g’h’ ≠ h’g’!). For functions from vectors to vectors, the chain rule is simply the product of Jacobians. Moreover, as soon as you compose 3 or more functions, it can make a huge difference whether you multiply the Jacobians from left to right (“reverse-mode”, or “backpropagation”, or “adjoint differentiation”) or right to left (“forward-mode”). Showed, for example, that if you have many inputs but a single output (as is common in machine learning and other types of optimization problems), that it is vastly more efficient to multiply left-to-right than right-to-left, and such “backpropagation algorithms” are a key factor in the practicality of large-scale optimization. Finally, began talking about functions in more general vector spaces, such as functions with matrix inputs and/or outputs. For example, considered f(A) = A³, giving d(A³) = f′(A)[dA] = A²(dA) + A(dA)A + (dA)A² (≠3A²dA!), and f(A) = A⁻¹, giving d(A⁻¹) = -A⁻¹(dA)A⁻¹.

Part 2: Began going into more detail on matrix-valued functions, and their relationship to the “Jacobian matrix” picture. Converting f′(A) to a conventional “Jacobian matrix” in such cases requires converting matrices A into column vectors vec(A), a process called “vectorization” of the matrix (by a common convention: simply stacking the matrix by columns). Linear operators like f′(A)[dA] = AdA + dAA can then be expressed as “ordinary” matrices via Kronecker products.

Lecture Notes

Lecture 3

Outline

Part 1: Continued from Lecture 2: matrix functions, Jacobians, vectorizations, and Kronecker products. More examples of matrix functions, including LU factorization and 2 × 2 eigenproblems.

Part 2: Finite-difference methods: viewing f(x + δx) – f(x) as an approximation for f’(x)δx on a computer. This is extremely useful as a quick check of a hand-derived derivative (which is very error-prone for complicated functions), and can also be used as a replacement for analytical derivatives in a pinch. Analyzed two sources of error: truncation error (from the non-infinitesimal δx) and roundoff error (from the finite precision of computer arithmetic).

Lecture Notes

Lecture 4

Outline

Part 0: To begin with, spent a few minutes talking about the last few sections of the Finite Difference (Jupyter notebook) from last lecture: higher-order finite-difference rules and finite differences in higher dimensions (e.g. for gradients).

Part 1: Generalizing gradients to scalar functions f(x) for x in arbitrary vector spaces x ∈ V. The key thing is that we need not just a vector space, but an inner product x ⋅ y (a “dot product”, also denoted ⟨x,y⟩ or ⟨x|y⟩); V is then formally called a Hilbert space. Then, for any scalar function, since df = f’(x)[dx] is a linear operator mapping dx ∈ V to scalars df ∈ ℝ (a “linear form”), it turns out that it must be a dot product of dx with “something”, and we call that “something” the gradient! That is, once we define a dot product, then for any scalar function f(x) we can define ∇f by f’(x)[dx] = ∇f ⋅ dx. So ∇f is always something with the same “shape” as x (the steepest-ascent direction).

Defined the most obvious inner product of m × n matrices: the Frobenius inner product A\(\cdot\)B = sum(A\(\cdot\ast\)B) = trace(AᵀB) = vec(A)ᵀvec(B), the sum of the products of the matrix entries. This also gives us the “Frobenius norm” ‖A‖² = A ⋅ A = trace(AᵀA) = ‖vec(A)‖², the square root of the sum of the squares of the entries. Using this, we can now take the derivatives of various scalar functions of matrices, e.g. we considered

f(A) = ‖A‖ ⥰ ∇f = A/‖A‖
f(A) = xᵀAy ⥰ ∇f = xyᵀ (for constant x, y)
f(A) = det(A) ⥰ ∇f = det(A)(A⁻¹)ᵀ = adjugate(A)ᵀ: we will prove this later

Part 2: Applications of derivatives to multivariate root-finding and optimization. A key fact enabling large-scale optimization, i.e. min f(x) where f is a scalar function of many parameters x, is that computing ∇f has about the same cost as f, using what is variously called “reverse-mode” or “adjoint” or “backpropagation” differentiation algorithms, which essentially boil down to evaluating the chain rule left to right. Went through a few examples of this, oriented more at engineering/physics optimization (and “topology optimization”).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix Calculus for Machine Learning and Beyond #11919

Matrix Calculus for Machine Learning and Beyond #11919

guevara commented Dec 9, 2024

Lecture 1

Outline

Lecture Notes

Further Readings

Lecture 2

Outline

Lecture Notes

Further Readings

Lecture 3

Outline

Lecture Notes

Further Readings

Lecture 4

Outline

Lecture Notes

Further Readings (Part 2)

Lecture 5

Lecture Notes

Further Readings (Part 1)

Further Readings (Part 2)

Further Readings (Part 3)

Lecture 6

Lecture Notes

Further Readings (Part 1)

Further Readings (Part 2)

Lecture 7

Lecture Notes

Further Readings (Part 1)

Further Readings (Part 2)

Lecture 8

Lecture Notes

Further Readings (Part 1)

Further Readings (Part 2)

Further Readings (Part 3)

Matrix Calculus for Machine Learning and Beyond #11919

Matrix Calculus for Machine Learning and Beyond #11919

Comments

guevara commented Dec 9, 2024

Lecture 1

Outline

Lecture Notes

Further Readings

Lecture 2

Outline

Lecture Notes

Further Readings

Lecture 3

Outline

Lecture Notes

Further Readings

Lecture 4

Outline

Lecture Notes

Further Readings (Part 2)

Lecture 5

Lecture Notes

Further Readings (Part 1)

Further Readings (Part 2)

Further Readings (Part 3)

Lecture 6

Lecture Notes

Further Readings (Part 1)

Further Readings (Part 2)

Lecture 7

Lecture Notes

Further Readings (Part 1)

Further Readings (Part 2)

Lecture 8

Lecture Notes

Further Readings (Part 1)

Further Readings (Part 2)

Further Readings (Part 3)