diff --git a/docs/advanced/cpp_integration.html b/docs/advanced/cpp_integration.html index 1c7bd99..cd5c6d2 100644 --- a/docs/advanced/cpp_integration.html +++ b/docs/advanced/cpp_integration.html @@ -65,7 +65,7 @@ - + @@ -222,14 +222,20 @@ @@ -272,55 +278,39 @@ @@ -345,7 +351,7 @@

Parameters

Returns:

-

np.narray Randomly generated matrix of shape size with entries in { -1, 1 }.

+

: Randomly generated matrix of shape size with entries in { -1, 1 }.

diff --git a/docs/reference/trace.xtrace.html b/docs/reference/trace.xtrace.html index 2ee8f39..286f2eb 100644 --- a/docs/reference/trace.xtrace.html +++ b/docs/reference/trace.xtrace.html @@ -181,6 +181,12 @@ The Lanczos Method + + diff --git a/docs/search.json b/docs/search.json index a402d11..f13d9b3 100644 --- a/docs/search.json +++ b/docs/search.json @@ -71,7 +71,7 @@ "href": "reference/random.normal.html", "title": "random.normal", "section": "", - "text": "random.normal(size, rng='splitmix64', seed=-1, dtype=np.float32)\nGenerates random vectors from the standard normal distribution.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsize\nint or tuple\nOutput shape to generate.\nrequired\n\n\nrng\nstr = “splitmix64”\nRandom number generator to use.\n'splitmix64'\n\n\nseed\nint = -1\nSeed for the generator. Use -1 to for random (non-deterministic) behavior.\n-1\n\n\ndtype\ndtype = float32\nFloating point dtype for the output. Must be float32 or float64.\nnp.float32\n\n\n\n\n\n\nnp.narray Randomly generated matrix of shape size with entries in { -1, 1 }.", + "text": "random.normal(size, rng='splitmix64', seed=-1, dtype=np.float32)\nGenerates random vectors from the standard normal distribution.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsize\nint or tuple\nOutput shape to generate.\nrequired\n\n\nrng\nstr = “splitmix64”\nRandom number generator to use.\n'splitmix64'\n\n\nseed\nint = -1\nSeed for the generator. Use -1 to for random (non-deterministic) behavior.\n-1\n\n\ndtype\ndtype = float32\nFloating point dtype for the output. Must be float32 or float64.\nnp.float32\n\n\n\n\n\n\n: Randomly generated matrix of shape size with entries in { -1, 1 }.", "crumbs": [ "API Reference", "Random", @@ -584,12 +584,12 @@ { "objectID": "theory/lanczos.html#pseudocode", "href": "theory/lanczos.html#pseudocode", - "title": "The Lanczos method", + "title": "SLQ Trace guide", "section": "Pseudocode", - "text": "Pseudocode\nThere are several ways to implement the Lanczos method, some of which are “better” than others. Below is pseudocode equivalent to Paige’s A27 variant, which has been shown to have a variety of attractive properties.\nThere are many extensions that modify the Lanczos method to make it more robust, more computationally efficient, etc., though many of these have non-trivial implications on the space and time complexities.", + "text": "Pseudocode\nTo clarify that that means, here’s an abstract presentation of the generic SLQ procedure:\n\n\n\\begin{algorithm} \\caption{Stochastic Lanczos Quadrature} \\begin{algorithmic} \\Input Symmetric operator ($A \\in \\mathbb{R}^{n \\times n}$) \\Require Number of queries ($n_v$), Degree of quadrature ($k$) \\Function{SLQ}{$A$, $n_v$, $k$} \\State $\\Gamma \\gets 0$ \\For{$j = 1, 2, \\dots, n_v$} \\State $v_i \\sim \\mathcal{D}$ where $\\mathcal{D}$ satisfies $\\mathbb{E}(v v^\\top) = I$ \\State $T^{(j)}(\\alpha, \\beta)$ $\\gets$ $\\mathrm{Lanczos}(A,v_j,k+1)$ \\State $[\\Theta, Y] \\gets \\mathrm{eigh\\_tridiag}(T^{(j)}(\\alpha, \\beta))$ \\State $\\tau_i \\gets \\langle e_1, y_i \\rangle$ \\State < Do something with the node/weight pairs $(\\theta_i, \\tau_i^2)$ > \\EndFor \\EndFunction \\end{algorithmic} \\end{algorithm}", "crumbs": [ "Theory", - "Lanczos" + "The Lanczos Method" ] }, { @@ -882,7 +882,7 @@ { "objectID": "theory/lanczos.html#complexity-analysis", "href": "theory/lanczos.html#complexity-analysis", - "title": "The Lanczos method", + "title": "SLQ Trace guide", "section": "Complexity analysis", "text": "Complexity analysis\nElegant and as theoretically founded as the Lanczos method may be, is it efficient in practice?\nLet’s start by establishing a baseline on its complexity:\n\nTheorem 1 (Parlett 1994) Given a symmetric rank-r matrix A \\in \\mathbb{R}^{n \\times n} whose operator x \\mapsto A x requires O(\\eta) time and O(\\nu) space, the Lanczos method computes \\Lambda(A) in O(\\max\\{\\eta, n\\}\\cdot r) time and O(\\max\\{\\nu, n\\}) space, when computation is done in exact arithmetic\n\nAs its clear from the theorem, if we specialize it such that r = n and \\eta = \\nu = n, then the Lanczos method requires just O(n^2) time and O(n) space to execute. In other words, the Lanczos method drops both the time and space complexity4 of obtaining spectral information by order of magnitude over similar eigen-algorithms that decompose A directly.\nTo see why this is true, note that a symmetric tridiagonal matrix is fully characterized by its diagonal and subdiagonal terms, which requires just O(n) space. If we assume that v \\mapsto Av \\sim O(n), then carrying out the recurrence clearly takes at most O(n^2) time, since there are most n such vectors \\{q_i\\}_{i=1}^n to generate!\nNow, if we need to store all of Y or Q explicitly, we clearly need O(n^2) space to do so. However, if we only need the eigen-values \\Lambda(A) (and not their eigen-vectors U), then we may execute the recurrence keeping at most three vectors \\{q_{j-1}, q_{j}, q_{j+1}\\} in memory at any given time. Since each of these is O(n) is size, the claim of O(n) space is justified!", "crumbs": [ @@ -957,7 +957,7 @@ "href": "theory/intro.html", "title": "Introduction to trace estimation with primate", "section": "", - "text": "primate contains a variety of methods for estimating quantities from matrix functions. One popular such quantity is the trace of f(A), for some generic f: [a,b] \\to \\mathbb{R} defined on the spectrum of A: \\mathrm{tr}(f(A)) \\triangleq \\mathrm{tr}(U f(\\Lambda) U^{\\intercal}), \\quad \\quad f : [a,b] \\to \\mathbb{R}\nMany quantities of common interest can be expressed in as traces of matrix functions with the right spectral function specialization. A few such specialization include:\n\\begin{align*}\nf &= \\mathrm{id} \\quad &\\Longleftrightarrow& \\quad &\\mathrm{tr}(A) \\\\\nf &= f^{-1} \\quad &\\Longleftrightarrow& \\quad &\\mathrm{tr}(A^{-1}) \\\\\nf &= \\log \\quad &\\Longleftrightarrow& \\quad &\\mathrm{logdet}(A) \\\\\nf &= \\mathrm{exp} \\quad &\\Longleftrightarrow& \\quad &\\mathrm{exp}(A) \\\\\nf &= \\mathrm{sign} \\quad &\\Longleftrightarrow& \\quad &\\mathrm{rank}(A) \\\\\n&\\vdots \\quad &\\hphantom{\\Longleftrightarrow}& \\quad & \\vdots &\n\\end{align*}\nIn this introduction, the basics of randomized trace estimation are introduced, with a focus on matrix-free algorithms how the readily extend to estimating the traces of matrix functions.", + "text": "primate contains a variety of methods for estimating quantities from matrix functions. One popular such quantity is the trace of f(A), for some generic f: [a,b] \\to \\mathbb{R} defined on the spectrum of A: \\mathrm{tr}(f(A)) \\triangleq \\mathrm{tr}(U f(\\Lambda) U^{\\intercal}), \\quad \\quad f : [a,b] \\to \\mathbb{R}\nMany quantities of common interest can be expressed in as traces of matrix functions with the right choice of f. A few such function specializations include:\n\\begin{align*}\nf &= \\mathrm{id} \\quad &\\Longleftrightarrow& \\quad &\\mathrm{tr}(A) \\\\\nf &= f^{-1} \\quad &\\Longleftrightarrow& \\quad &\\mathrm{tr}(A^{-1}) \\\\\nf &= \\log \\quad &\\Longleftrightarrow& \\quad &\\mathrm{logdet}(A) \\\\\nf &= \\mathrm{exp} \\quad &\\Longleftrightarrow& \\quad &\\mathrm{exp}(A) \\\\\nf &= \\mathrm{sign} \\quad &\\Longleftrightarrow& \\quad &\\mathrm{rank}(A) \\\\\n&\\vdots \\quad &\\hphantom{\\Longleftrightarrow}& \\quad & \\vdots &\n\\end{align*}\nIn this introduction, the basics of randomized trace estimation are introduced, with a focus on how to use matrix-free algorithms estimate traces of matrix functions.", "crumbs": [ "Theory", "Introduction" @@ -979,7 +979,7 @@ "href": "theory/intro.html#the-implicit-trace-estimation-problem", "title": "Introduction to trace estimation with primate", "section": "The implicit trace estimation problem", - "text": "The implicit trace estimation problem\nThe implicit trace estimation problem can be stated as follows:\n\nGiven access to a square matrix A \\in \\mathbb{R}^{n \\times n} via its matrix-vector product operator x \\mapsto Ax, estimate its trace \\mathrm{tr}(A) = \\sum\\limits_{i=1}^n A_{ii}\n\nObserve that although method (3) fits squarely in this category, its pretty inefficient from a computational standpoint—after all, most of the entries of e_i are zero, thus most inner product computations do not contribute to the trace estimate at all.\nOne idea, attributed to A. Girard, is to replace e_i with random zero-mean vectors, e.g. random sign vectors v \\in \\{-1, +1\\}^{n} or random normal vectors v \\sim \\mathcal{N}(\\mu=0, \\sigma=1):\n\\mathtt{tr}(A) \\approx \\frac{1}{n_v}\\sum\\limits_{i=1}^{n_v} v_i^\\top A v_i \nIt was shown more formally later by M.F. Huchinson that if these random vectors v \\in \\mathbb{R}^n satisfy \\mathbb{E}[vv^T] = I then this approach indeed forms an unbiased estimator of \\mathrm{tr}(A). To see this, its sufficient to combine the linearity of expectation, the cyclic-property of the trace function, and the aforementioned isotropy conditions: \\mathtt{tr}(A) = \\mathtt{tr}(A I) = \\mathtt{tr}(A \\mathbb{E}[v v^T]) = \\mathbb{E}[\\mathtt{tr}(Avv^T)] = \\mathbb{E}[\\mathtt{tr}(v^T A v)] = \\mathbb{E}[v^T A v]\nNaturally we expect the approximation to gradually improve as more vectors are sampled, converging as n_v \\to \\infty. Let’s see if we can check this: we start by collecting a few sample quadratic forms\n\ndef isotropic(n):\n yield from [np.random.choice([-1,+1], size=A.shape[1]) for i in range(n)]\nestimates = np.array([v @ A @ v for v in isotropic(500)])\n\nPlotting the estimator formed by averaging the first k samples along with it’s error, we get the following figures for k = 50 and k = 500, respectively:\n\n\n/var/folders/0l/b3dbb2_d2bb4y3wbbfk0wt_80000gn/T/ipykernel_34141/588126951.py:32: UserWarning:\n\n\nYou are attempting to set `plot.legend.label_text_font_size` on a plot that has zero legends added, this will have no effect.\n\nBefore legend properties can be set, you must add a Legend explicitly, or call a glyph method with a legend parameter set.\n\n\n/var/folders/0l/b3dbb2_d2bb4y3wbbfk0wt_80000gn/T/ipykernel_34141/588126951.py:33: UserWarning:\n\n\nYou are attempting to set `plot.legend.label_height` on a plot that has zero legends added, this will have no effect.\n\nBefore legend properties can be set, you must add a Legend explicitly, or call a glyph method with a legend parameter set.\n\n\n/var/folders/0l/b3dbb2_d2bb4y3wbbfk0wt_80000gn/T/ipykernel_34141/588126951.py:34: UserWarning:\n\n\nYou are attempting to set `plot.legend.glyph_height` on a plot that has zero legends added, this will have no effect.\n\nBefore legend properties can be set, you must add a Legend explicitly, or call a glyph method with a legend parameter set.\n\n\n/var/folders/0l/b3dbb2_d2bb4y3wbbfk0wt_80000gn/T/ipykernel_34141/588126951.py:35: UserWarning:\n\n\nYou are attempting to set `plot.legend.spacing` on a plot that has zero legends added, this will have no effect.\n\nBefore legend properties can be set, you must add a Legend explicitly, or call a glyph method with a legend parameter set.\n\n\n/var/folders/0l/b3dbb2_d2bb4y3wbbfk0wt_80000gn/T/ipykernel_34141/588126951.py:37: UserWarning:\n\n\nYou are attempting to set `plot.legend.orientation` on a plot that has zero legends added, this will have no effect.\n\nBefore legend properties can be set, you must add a Legend explicitly, or call a glyph method with a legend parameter set.\n\n\n/var/folders/0l/b3dbb2_d2bb4y3wbbfk0wt_80000gn/T/ipykernel_34141/588126951.py:38: UserWarning:\n\n\nYou are attempting to set `plot.legend.background_fill_alpha` on a plot that has zero legends added, this will have no effect.\n\nBefore legend properties can be set, you must add a Legend explicitly, or call a glyph method with a legend parameter set.\n\n\n\n\n\n \n\n\n\n\n\nSure enough, the estimator quickly gets within an error rate of about 1-5% away from the actual trace, getting generally closer as more samples are collected. On the other hand, improvement is not gaurenteed, and securing additional precision beyond < 1% becomes increasingly difficult due to the variance of the sample estimates.", + "text": "The implicit trace estimation problem\nThe implicit trace estimation problem can be stated as follows:\n\nGiven access to a square matrix A \\in \\mathbb{R}^{n \\times n} via its matrix-vector product operator x \\mapsto Ax, estimate its trace \\mathrm{tr}(A) = \\sum\\limits_{i=1}^n A_{ii}\n\nThough method (3) may be used to solve the implicit trace estimation problem exactly, its pretty inefficient from a computational standpoint—after all, most of the entries of e_i are zero, thus most inner product computations do not contribute to the trace estimate at all.\nOne idea, attributed to A. Girard, is to replace e_i with random zero-mean vectors, e.g. random sign vectors v \\in \\{-1, +1\\}^{n} or random normal vectors v \\sim \\mathcal{N}(\\mu=0, \\sigma=1):\n\\mathtt{tr}(A) \\approx \\frac{1}{n_v}\\sum\\limits_{i=1}^{n_v} v_i^\\top A v_i \nIt was shown more formally later by M.F. Huchinson that if these random vectors v \\in \\mathbb{R}^n satisfy \\mathbb{E}[vv^T] = I then this approach indeed forms an unbiased estimator of \\mathrm{tr}(A). To see this, its sufficient to combine the linearity of expectation, the cyclic-property of the trace function, and the aforementioned isotropy conditions: \\mathtt{tr}(A) = \\mathtt{tr}(A I) = \\mathtt{tr}(A \\mathbb{E}[v v^T]) = \\mathbb{E}[\\mathtt{tr}(Avv^T)] = \\mathbb{E}[\\mathtt{tr}(v^T A v)] = \\mathbb{E}[v^T A v]\nNaturally we expect the approximation to gradually improve as more vectors are sampled, converging as n_v \\to \\infty. Let’s see if we can check this: we start by collecting a few sample quadratic forms\n\ndef isotropic(n):\n yield from [np.random.choice([-1,+1], size=A.shape[1]) for i in range(n)]\nestimates = np.array([v @ A @ v for v in isotropic(500)])\n\nPlotting both the estimator formed by averaging the first k samples along with its absolute error, we get the following figures for k = 50 and k = 500, respectively:\n\n\n\n \n\n\n\n\n\nSure enough, the estimator quickly gets within an error rate of about 1-5% away from the actual trace, getting generally closer as more samples are collected. On the other hand, improvement is not gaurenteed, and securing additional precision much less than 1% becomes increasingly difficult due to the randomness and variance of the sample estimates.", "crumbs": [ "Theory", "Introduction" @@ -1001,7 +1001,7 @@ "href": "theory/intro.html#ex-logdet-computation", "title": "Introduction to trace estimation with primate", "section": "Ex: Logdet computation", - "text": "Ex: Logdet computation\nThe basic way to approximate the trace of a matrix function is to simply set fun to the name the spectral function. For example, to approximate the log-determinant:\n\nfrom primate.trace import hutch\n\n## Log-determinant\nlogdet_test = hutch(A, fun=\"log\", maxiter=25)\nlogdet_true = np.sum(np.log(np.linalg.eigvalsh(A)))\n\nprint(f\"Logdet (exact): {logdet_true}\")\nprint(f\"Logdet (approx): {logdet_test}\")\n\nLogdet (exact): -148.3218440234385\nLogdet (approx): -150.71530622226584\n\n\nEven using n / 6 matvecs, we get a decent approximation. But how good is this estimate?\nTo get a slightly better idea, you can set verbose=True:\n\nest = hutch(A, fun=\"log\", maxiter=25, verbose=True)\n\nGirard-Hutchinson estimator (fun=log, deg=20, quad=fttr)\nEst: -145.469 +/- 7.37 (95% CI), CV: 3%, Evals: 25 [R]\n\n\nThe first line of the statement contains fixed about the estimator, including the type of estimator (Girard-Hutchinson), the function applied to the spectrum (log), the degree of the Krylov expansion (20), and the quadrature method used (fttr).\nThe second line prints the runtime information about the samples, such as the final trace estimate (``) 2. Its margin of error 2. The coefficient of variation (aka the relative std. deviation) 3. The number of samples used + their distribution prefix (‘R’ for rademacher)\n\nest = hutch(A, fun=\"log\", maxiter=100, plot=True)", + "text": "Ex: Logdet computation\nThe basic way to approximate the trace of a matrix function is to simply set fun to the name the spectral function. For example, to approximate the log-determinant:\n\nfrom primate.trace import hutch\n\n## Log-determinant\nlogdet_test = hutch(A, fun=\"log\", maxiter=25)\nlogdet_true = np.sum(np.log(np.linalg.eigvalsh(A)))\n\nprint(f\"Logdet (exact): {logdet_true}\")\nprint(f\"Logdet (approx): {logdet_test}\")\n\nLogdet (exact): -148.3218440234385\nLogdet (approx): -148.3090874345191\n\n\nEven using n / 6 matvecs, we get a decent approximation. But how good is this estimate?\nTo get a slightly better idea, you can set verbose=True:\n\nest = hutch(A, fun=\"log\", maxiter=25, seed=5, verbose=True)\n\nGirard-Hutchinson estimator (fun=log, deg=20, quad=fttr)\nEst: -146.830 +/- 5.59 (95% CI), CV: 2%, Evals: 25 [R] (seed: 5)\n\n\nThe first line of the statement contains fixed about the estimator, including the type of estimator (Girard-Hutchinson), the function applied to the spectrum (log), the degree of the Krylov expansion (20), and the quadrature method used (fttr).\nThe second line prints the runtime information about the samples, such as the final trace estimate, its margin of error and coefficient of variation (aka the relative std. deviation), and the number of samples used + their distribution prefix (‘R’ for rademacher).\nAnother way to get an idea of how well the estimator is converging is to pass the plot=True option. Here, we pass the same seed for reproducibility with the above call.\n\nest = hutch(A, fun=\"log\", maxiter=100, seed=5, plot=True)\n\n\n \n\n\n\n\n\nSimilar information reported by the verbose=True flag is plotted, now cumulatively applied to every sample. The left plot shows the sample point estimates alongside yellow bands denoting the margin of error associated with the 95% confidence interval (CI) at every sample index. The right plot show convergence information, e.g. the running coefficient of variation is shown on the top-right and an upper bound on the absolute error is derived from the CI is shown on the bottom right.\nWhile the margin of error is typically a valid, conservative estimate of estimators error, in some situations these bounds may not be valid as they are based solely on the CLT implications for normally distributed random variables. In particular, there is an inherent amount of error associated with the Lanczos-based quadrature approximations that is not taken into account in the error bound, which may invalidate the corresponding CI’s.", "crumbs": [ "Theory", "Introduction" @@ -1071,7 +1071,7 @@ "href": "reference/random.normal.html#returns", "title": "random.normal", "section": "", - "text": "np.narray Randomly generated matrix of shape size with entries in { -1, 1 }.", + "text": ": Randomly generated matrix of shape size with entries in { -1, 1 }.", "crumbs": [ "API Reference", "Random", @@ -1095,7 +1095,7 @@ "href": "reference/operator.matrix_function.html", "title": "operator.matrix_function", "section": "", - "text": "operator.matrix_function(A, fun='identity', deg=20, rtol=None, orth=0, **kwargs)\nConstructs a LinearOperator approximating the action of the matrix function fun(A).\nThis function uses the Lanczos quadrature method to approximate the action of matrix function: v \\mapsto f(A)v, \\quad f(A) = U f(\\lambda) U^T The resulting operator may be used in conjunction with other trace estimation methods, such as the hutch or xtrace estimators.\nThe weights of the quadrature may be computed using either the Golub-Welsch (GW) or Forward Three Term Recurrence algorithms (FTTR) (see the quad parameter). For a description of the other parameters, see the Lanczos function.\n\n\n\n\n\n\nNote\n\n\n\nTo compute the weights of the quadrature, the GW computation uses implicit symmetric QR steps with Wilkinson shifts, while the FTTR algorithm uses the explicit expression for orthogonal polynomials. While both require O(\\mathrm{deg}^2) time to execute, the former requires O(\\mathrm{deg}^2) space but is highly accurate, while the latter uses only O(1) space at the cost of stability. If deg is large and accuracy is not a priority, the fttr method is preferred.\n\n\n\n\nA : ndarray, sparray, or LinearOperator real symmetric operator. fun : str or Callable, default=“identity” real-valued function defined on the spectrum of A. deg : int, default=20 Degree of the Krylov expansion. rtol : float, default=1e-8 Relative tolerance to consider two Lanczos vectors are numerically orthogonal. orth: int, default=0 Number of additional Lanczos vectors to orthogonalize against when building the Krylov basis. kwargs : dict, optional additional key-values to parameterize the chosen function ‘fun’.\n\n\n\noperator : LinearOperator Operator approximating the action of fun on the spectrum of A", + "text": "operator.matrix_function(A, fun='identity', deg=20, rtol=None, orth=0, **kwargs)\nConstructs a LinearOperator approximating the action of the matrix function fun(A).\nThis function uses the Lanczos quadrature method to approximate the action of matrix function: v \\mapsto f(A)v, \\quad f(A) = U f(\\lambda) U^T The resulting operator may be used in conjunction with other trace and vector-based estimation methods, such as hutch or xtrace.\nThe weights of the quadrature may be computed using either the Golub-Welsch (GW) or Forward Three Term Recurrence algorithms (FTTR) (see the quad parameter). For a description of the other parameters, see the Lanczos function.\n\n\n\n\n\n\nNote\n\n\n\nTo compute the weights of the quadrature, the GW computation uses implicit symmetric QR steps with Wilkinson shifts, while the FTTR algorithm uses the explicit expression for orthogonal polynomials. While both require O(\\mathrm{deg}^2) time to execute, the former requires O(\\mathrm{deg}^2) space but is highly accurate, while the latter uses only O(1) space at the cost of stability. If deg is large, fttr is preferred.\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nscipy.sparse.linalg.LinearOperator\na LinearOperator approximating the action of fun on the spectrum of A", "crumbs": [ "API Reference", "Matrix Function" @@ -1117,7 +1117,7 @@ "href": "reference/operator.matrix_function.html#returns", "title": "operator.matrix_function", "section": "", - "text": "operator : LinearOperator Operator approximating the action of fun on the spectrum of A", + "text": "Type\nDescription\n\n\n\n\nscipy.sparse.linalg.LinearOperator\na LinearOperator approximating the action of fun on the spectrum of A", "crumbs": [ "API Reference", "Matrix Function" @@ -1206,5 +1206,202 @@ "API Reference", "Hutch" ] + }, + { + "objectID": "basic/performance.html", + "href": "basic/performance.html", + "title": "Performance", + "section": "", + "text": "primate provides a variety of efficient algorithms for estimating quantities derived from matrix functions. These algorithms are largely implemented in C++ to minimize overhead, and for some computational problems primate can out-perform the standard algorithms for estimating spectral quantities by several orders of magnitude. Nonetheless, there are some performance-related caveats to be aware of.\n\n\n\nfrom scipy.linalg import toeplitz\nfrom primate.trace import hutch \n\nc = np.random.uniform(size=100, low=0, high=1)\nT = toeplitz(c)\n\n# np.sum(np.reciprocal(np.linalg.eigvalsh(T)))\nnp.sum(np.abs(np.linalg.eigvalsh(T)))\nest, info = hutch(T, fun=np.abs, maxiter=200, pdf='rademacher', rng=\"mt\", deg=20, seed=-1, plot=True)\n\nest, info = hutch(T, fun=np.abs, maxiter=800, pdf='rademacher', rng=\"mt\", quad=\"golub_welsch\", deg=200, seed=-1, plot=True)\nest, info = hutch(C, fun=np.abs, maxiter=800, pdf='rademacher', rng=\"mt\", quad=\"golub_welsch\", deg=200, seed=-1, plot=True)\n\n\n\nnp.sum(np.linalg.svd(T)[1])\n\nZ = np.zeros(T.shape)\nC = np.block([[Z, T], [T.T, Z]])\nhutch(C, fun=np.abs, maxiter=200, pdf='rademacher', rng=\"mt\", quad=\"golub_welsch\", deg=200, seed=-1, plot=False)\n\n\nhutch(T, fun=np.abs, maxiter=200, pdf='rademacher', deg=50, seed=-1, plot=False)\n\np = info['figure']\n\nfrom primate.trace import xtrace\nfrom primate.operator import matrix_function\nM = matrix_function(T, \"abs\")\nxtrace(M)\n\ns = info['samples']\nv = np.random.choice([-1, 1], size=T.shape[0])\nT @ v\n\ns[np.abs(s) <= np.linalg.norm(T)].mean()\n\n\nfrom primate.diagonalize import lanczos\nfrom scipy.linalg import eigvalsh_tridiagonal\na, b = lanczos(T, deg=499, orth=150)\nnp.sum(np.abs(eigvalsh_tridiagonal(a,b)))\n\n\nimport timeit \ntimeit.timeit(lambda: hutch(A, maxiter=20, deg=5, fun=\"log\", quad=\"fttr\"), number = 1000)\ntimeit.timeit(lambda: np.sum(np.log(np.linalg.eigvalsh(A))), number = 1000)" + }, + { + "objectID": "theory/intro.html#footnotes", + "href": "theory/intro.html#footnotes", + "title": "Introduction to trace estimation with primate", + "section": "Footnotes", + "text": "Footnotes\n\n\nTechnically, we could solve a system of equations via either Gaussian elimination or, when A is rank-deficient, via least squares: (A + \\epsilon I)^{-1}v \\simeq u \\Leftrightarrow \\mathrm{argmin}_{u \\in \\mathbb{R}^n} \\lVert v - (A + \\epsilon I)u \\rVert^2. Then again, this could be just as expensive—if not more so—than just executing the Lanczos iteration!↩︎", + "crumbs": [ + "Theory", + "Introduction" + ] + }, + { + "objectID": "theory/lanczos.html#but-how-do-i-code-it-up", + "href": "theory/lanczos.html#but-how-do-i-code-it-up", + "title": "SLQ Trace guide", + "section": "… but how do I code it up?", + "text": "… but how do I code it up?\nTo clarify that that means, here’s an abstract presentation of the generic SLQ procedure:\n\n\n\\begin{algorithm} \\caption{Stochastic Lanczos Quadrature} \\begin{algorithmic} \\Input Symmetric operator ($A \\in \\mathbb{R}^{n \\times n}$) \\Require Number of queries ($n_v$), Degree of quadrature ($k$) \\Function{SLQ}{$A$, $n_v$, $k$} \\State $\\Gamma \\gets 0$ \\For{$j = 1, 2, \\dots, n_v$} \\State $v_i \\sim \\mathcal{D}$ where $\\mathcal{D}$ satisfies $\\mathbb{E}(v v^\\top) = I$ \\State $T^{(j)}(\\alpha, \\beta)$ $\\gets$ $\\mathrm{Lanczos}(A,v_j,k+1)$ \\State $[\\Theta, Y] \\gets \\mathrm{eigh\\_tridiag}(T^{(j)}(\\alpha, \\beta))$ \\State $\\tau_i \\gets \\langle e_1, y_i \\rangle$ \\State < Do something with the node/weight pairs $(\\theta_i, \\tau_i^2)$ > \\EndFor \\EndFunction \\end{algorithmic} \\end{algorithm}", + "crumbs": [ + "Theory", + "The Lanczos Method" + ] + }, + { + "objectID": "theory/lanczos.html#beating-the-bound", + "href": "theory/lanczos.html#beating-the-bound", + "title": "SLQ Trace guide", + "section": "Beating the bound", + "text": "Beating the bound\nElegant and as theoretically founded as the Lanczos method may be, is it efficient in practice?\nLet’s start by establishing a baseline on its complexity:\n\nTheorem 1 (Parlett 1994) Given a symmetric rank-r matrix A \\in \\mathbb{R}^{n \\times n} whose operator x \\mapsto A x requires O(\\eta) time and O(\\nu) space, the Lanczos method computes \\Lambda(A) in O(\\max\\{\\eta, n\\}\\cdot r) time and O(\\max\\{\\nu, n\\}) space, when computation is done in exact arithmetic\n\nAs its clear from the theorem, if we specialize it such that r = n and \\eta = \\nu = n, then the Lanczos method requires just O(n^2) time and O(n) space to execute. In other words, the Lanczos method drops both the time and space complexity4 of obtaining spectral information by order of magnitude over similar eigen-algorithms that decompose A directly.\nTo see why this is true, note that a symmetric tridiagonal matrix is fully characterized by its diagonal and subdiagonal terms, which requires just O(n) space. If we assume that v \\mapsto Av \\sim O(n), then carrying out the recurrence clearly takes at most O(n^2) time, since there are most n such vectors \\{q_i\\}_{i=1}^n to generate!\nNow, if we need to store all of Y or Q explicitly, we clearly need O(n^2) space to do so. However, if we only need the eigen-values \\Lambda(A) (and not their eigen-vectors U), then we may execute the recurrence keeping at most three vectors \\{q_{j-1}, q_{j}, q_{j+1}\\} in memory at any given time. Since each of these is O(n) is size, the claim of O(n) space is justified!", + "crumbs": [ + "Theory", + "The Lanczos Method" + ] + }, + { + "objectID": "theory/intro.html#basics-trace-computation", + "href": "theory/intro.html#basics-trace-computation", + "title": "Introduction to trace estimation with primate", + "section": "Basics: Trace computation", + "text": "Basics: Trace computation\n\nSuppose we wanted to compute the trace of some matrix A. By the very definition of the trace, we have a variety of means to do it: \\mathrm{tr}(A) \\triangleq \\sum\\limits_{i=1}^n A_{ii} = \\sum\\limits_{i=1}^n \\lambda_i = \\sum\\limits_{i=1}^n e_i^T A e_i \nwhere, in the right-most sum, we’re using e_i to represent the ith identity vector: A_{ii} = e_i^T A e_i, \\quad e_i = [0, \\dots, 0, \\underbrace{1}_{i}, 0, \\dots, 0] \nLet’s see some code. First, we start with a simple (random) positive-definite matrix A \\in \\mathbb{R}^{n \\times n}.\n\nfrom primate.random import symmetric\nA = symmetric(150, pd = True)\n\nBy default, symmetric normalizes A such that the eigenvalues are uniformly distributed in the interval [0, 1]. For reference, the spectrum of A looks as follows:\n\n\n\n \n\n\n\n\n\nNow, using numpy, we can verify all of these trace definitions are indeed equivalent:\n\n\nimport numpy as np\neye = lambda i: np.ravel(np.eye(1, 150, k=i))\nprint(f\"Direct: {np.sum(A.diagonal()):.8f}\")\nprint(f\"Eigen: {np.sum(np.linalg.eigvalsh(A)):.8f}\")\nprint(f\"Matvec: {sum([eye(i) @ A @ eye(i) for i in range(150)]):.8f}\")\n\nDirect: 75.69739746\nEigen: 75.69739746\nMatvec: 75.69739746\n\n\nWhile (1) trivially yields \\mathrm{tr}(A) in (optimal) O(n) time, there exist situations where the diagonal entries of A are not available—particularly in the large-scale regime. In contrast, though both (2) and (3) are less efficient, they are nonetheless viable alternatives to obtain \\mathrm{tr}(A).", + "crumbs": [ + "Theory", + "Introduction" + ] + }, + { + "objectID": "theory/intro.html#implicit-trace-estimation", + "href": "theory/intro.html#implicit-trace-estimation", + "title": "Introduction to trace estimation with primate", + "section": "Implicit trace estimation", + "text": "Implicit trace estimation\nThe implicit trace estimation problem can be stated as follows:\n\nGiven access to a square matrix A \\in \\mathbb{R}^{n \\times n} via its matrix-vector product operator x \\mapsto Ax, estimate its trace \\mathrm{tr}(A) = \\sum\\limits_{i=1}^n A_{ii}\n\nNote we no longer assume we have direct access to the diagonal here, leaving us with the Eigen or Matvec methods. The Eigen method is expensive, and though the Matvec method solves the problem exactly, its pretty inefficient from a computational standpoint—most of the entries of e_i are zero, thus most inner product computations do not contribute to the trace estimate at all.\nOne idea, attributed to A. Girard, is to replace e_i with random zero-mean vectors, e.g. random sign vectors v \\in \\{-1, +1\\}^{n} or random normal vectors v \\sim \\mathcal{N}(\\mu=0, \\sigma=1):\n\\mathtt{tr}(A) \\approx \\frac{1}{n_v}\\sum\\limits_{i=1}^{n_v} v_i^\\top A v_i \nIt was shown more formally later by M.F. Huchinson that if these random vectors v \\in \\mathbb{R}^n satisfy \\mathbb{E}[vv^T] = I then this approach indeed forms an unbiased estimator of \\mathrm{tr}(A). To see this, its sufficient to combine the linearity of expectation, the cyclic-property of the trace function, and the aforementioned isotropy conditions: \\mathtt{tr}(A) = \\mathtt{tr}(A I) = \\mathtt{tr}(A \\mathbb{E}[v v^T]) = \\mathbb{E}[\\mathtt{tr}(Avv^T)] = \\mathbb{E}[\\mathtt{tr}(v^T A v)] = \\mathbb{E}[v^T A v]\nNaturally we expect the approximation to gradually improve as more vectors are sampled, converging as n_v \\to \\infty. Let’s see if we can check this: we start by collecting a few sample quadratic forms\n\ndef isotropic(n):\n yield from [np.random.choice([-1,+1], size=A.shape[1]) for i in range(n)]\nestimates = np.array([v @ A @ v for v in isotropic(500)])\n\nPlotting both the estimator formed by averaging the first k samples along with its absolute error, we get the following figures for k = 50 and k = 500, respectively:\n\n\n\n \n\n\n\n\n\nSure enough, the estimator quickly gets within an error rate of about 1-5% away from the actual trace, getting generally closer as more samples are collected. On the other hand, improvement is not gaurenteed, and securing additional precision much less than 1% becomes increasingly difficult due to the randomness and variance of the sample estimates.", + "crumbs": [ + "Theory", + "Introduction" + ] + }, + { + "objectID": "theory/lanczos.html#beating-the-complexity-bounds", + "href": "theory/lanczos.html#beating-the-complexity-bounds", + "title": "The Lanczos method", + "section": "Beating the complexity bounds", + "text": "Beating the complexity bounds\nElegant and as theoretically founded as the Lanczos method may be, is it efficient in practice?\nLet’s start by establishing a baseline on its complexity:\n\nTheorem 1 (Parlett 1994) Given a symmetric rank-r matrix A \\in \\mathbb{R}^{n \\times n} whose operator x \\mapsto A x requires O(\\eta) time and O(\\nu) space, the Lanczos method computes \\Lambda(A) in O(\\max\\{\\eta, n\\}\\cdot r) time and O(\\max\\{\\nu, n\\}) space, when computation is done in exact arithmetic\n\nAs its clear from the theorem, if we specialize it such that r = n and \\eta = \\nu = n, then the Lanczos method requires just O(n^2) time and O(n) space to execute. In other words, the Lanczos method drops both the time and space complexity4 of obtaining spectral information by order of magnitude over similar eigen-algorithms that decompose A directly.\nTo see why this is true, note that a symmetric tridiagonal matrix is fully characterized by its diagonal and subdiagonal terms, which requires just O(n) space. If we assume that v \\mapsto Av \\sim O(n), then carrying out the recurrence clearly takes at most O(n^2) time, since there are most n such vectors \\{q_i\\}_{i=1}^n to generate!\nNow, if we need to store all of Y or Q explicitly, we clearly need O(n^2) space to do so. However, if we only need the eigen-values \\Lambda(A) (and not their eigen-vectors U), then we may execute the recurrence keeping at most three vectors \\{q_{j-1}, q_{j}, q_{j+1}\\} in memory at any given time. Since each of these is O(n) is size, the claim of O(n) space is justified!", + "crumbs": [ + "Theory", + "The Lanczos Method" + ] + }, + { + "objectID": "theory/intro.html#example-logdet-computation", + "href": "theory/intro.html#example-logdet-computation", + "title": "Introduction to trace estimation with primate", + "section": "Example: Logdet computation", + "text": "Example: Logdet computation\nPutting all of the theory above together, a simple way to estimate the trace of a matrix function is to average Lanczos quadrature estimates:\n\\mathrm{tr}(f(A)) \\approx \\frac{n}{n_v} \\sum\\limits_{i=1}^{n_v} e_1^T f(T_m) e_1 = \\frac{n}{n_v} \\sum\\limits_{j=1}^{n_v} \\left ( \\sum\\limits_{i=1}^m \\tau_i^{(j)} f(\\theta_i)^{(j)} \\right )\nTo enable this kind of estimator in hutch, simply set the fun argument to either the name of a built-in function or an arbitrary Callable. For example, to approximate the log-determinant:\n\n## Log-determinant\nlogdet_test = hutch(A, fun=\"log\", maxiter=25)\nlogdet_true = np.sum(np.log(np.linalg.eigvalsh(A)))\n\nprint(f\"Logdet (exact): {logdet_true}\")\nprint(f\"Logdet (approx): {logdet_test}\")\n\nLogdet (exact): -148.3218440234385\nLogdet (approx): -148.71526016046562\n\n\nHere we see that using n / 6 evaluatons of e_1^T f(T_m) e_1, we get a decent approximation of logdet(A). To get a better idea of the quality of this estimate, you can set verbose=True:\n\nest = hutch(A, fun=\"log\", maxiter=25, seed=5, verbose=True)\n\nGirard-Hutchinson estimator (fun=log, deg=20, quad=fttr)\nEst: -143.207 +/- 5.40 (95% CI), CV: 2%, Evals: 25 [R] (seed: 5)\n\n\nThe first line of the statement contains fixed about the estimator, including the type of estimator (Girard-Hutchinson), the function applied to the spectrum (log), the degree of the Lanczos quadrature approximation (20), and the quadrature method used (fttr).\nThe second line prints the runtime information about the samples, such as the final trace estimate, its margin of error and coefficient of variation (aka the relative std. deviation), and the number of samples used + their distribution prefix (‘R’ for rademacher).\nAnother way to get an idea of how well the estimator is converging is to pass the plot=True option. Here, we pass the same seed for reproducibility with the above call.\n\nest = hutch(A, fun=\"log\", maxiter=100, seed=5, plot=True)\n\n\n \n\n\n\n\n\nSimilar information as reported by the verbose=True flag is shown cumulatively applied to every sample. The left plot shows the sample point estimates alongside yellow bands denoting the margin of error associated with the 95% confidence interval (CI) at every sample index. The right plot show convergence information, e.g. the running coefficient of variation is shown on the top-right and an upper bound on the absolute error is derived from the CI is shown on the bottom right.\nWhile the margin of error is typically a valid, conservative estimate of estimators error, in some situations these bounds may not be valid as they are based solely on the CLT implications for normally distributed random variables. In particular, there is an inherent amount of error associated with the Lanczos-based quadrature approximations that is not taken into account in the error bound, which may invalidate the corresponding CI’s.", + "crumbs": [ + "Theory", + "Introduction" + ] + }, + { + "objectID": "theory/intro.html#implementation", + "href": "theory/intro.html#implementation", + "title": "Introduction to trace estimation with primate", + "section": "Implementation", + "text": "Implementation\nRather than manually averaging samples, trace estimates can be obtained via the hutch function:\n\nfrom primate.trace import hutch\nprint(f\"Trace estimate: {hutch(A, maxiter=50)}\")\n\nTrace estimate: 76.03071528930039\n\n\nThough its estimation is identical to averaging quadratic forms v^T A v of isotropic vectors as above, hutch comes with a few extra features to simplify its usage. In particular, all v \\mapsto v^T A v evaluations are done in parallel, there are a variety of isotropic distributions and random number generators (RNGs) to choose from, and the estimator may be suppleid various convergence criteria to allow for early-stopping; see the hutch documentation page for more details.", + "crumbs": [ + "Theory", + "Introduction" + ] + }, + { + "objectID": "theory/intro.html#estimator-implementation", + "href": "theory/intro.html#estimator-implementation", + "title": "Introduction to trace estimation with primate", + "section": "Estimator Implementation", + "text": "Estimator Implementation\nRather than manually averaging samples, trace estimates can be obtained via the hutch function:\n\nfrom primate.trace import hutch\nprint(f\"Trace estimate: {hutch(A, maxiter=50)}\")\n\nTrace estimate: 76.32787158595767\n\n\nThough its estimation is identical to averaging quadratic forms v^T A v of isotropic vectors as above, hutch comes with a few extra features to simplify its usage. In particular, all v \\mapsto v^T A v evaluations are done in parallel, there are a variety of isotropic distributions and random number generators (RNGs) to choose from, and the estimator may be suppleid various convergence criteria to allow for early-stopping; see the hutch documentation page for more details.", + "crumbs": [ + "Theory", + "Introduction" + ] + }, + { + "objectID": "theory/intro.html#stochastic-lanczos-quadrature", + "href": "theory/intro.html#stochastic-lanczos-quadrature", + "title": "Introduction to trace estimation with primate", + "section": "Stochastic Lanczos Quadrature", + "text": "Stochastic Lanczos Quadrature\nWhile there are some real-world applications for estimating \\mathrm{tr}(A), in many setting the trace of a given matrix is not a very informative quantity. On the other hand, as shown in the beginning, there are many situations where we might be interested in estimating the trace of a matrix function f(A). It is natural to consider extending the Hutchinson estimator above with something like:\n \\mathrm{tr}(f(A)) \\approx \\frac{n}{n_v} \\sum\\limits_{i=1}^{n_v} v_i^T f(A) v_i \nOf course, the remaining difficult lies in computing quadratic forms v \\mapsto v^T f(A) v efficiently. The approach taken by Ubaru, Saad, and Seghouane (2017) is to notice that sums of these quantities can transformed into a Riemann-Stieltjes integral:\n v^T f(A) v = v^T U f(\\Lambda) U^T v = \\sum\\limits_{i}^n f(\\lambda_i) \\mu_i^2 = \\int\\limits_{a}^b f(t) \\mu(t)\nwhere scalars \\mu_i \\in \\mathbb{R} constitute a cumulative, piecewise constant function \\mu : \\mathbb{R} \\to \\mathbb{R}_+:\n\n\\mu_i = (U^T v)_i, \\quad \\mu(t) = \\begin{cases}\n0, & \\text{if } t < a = \\lambda_1 \\\\\n\\sum_{j=1}^{i-1} \\mu_j^2, & \\text{if } \\lambda_{i-1} \\leq t < \\lambda_i, i = 2, \\dots, n \\\\\n\\sum_{j=1}^n \\mu_j^2, & \\text{if } b = \\lambda_n \\leq t\n\\end{cases}\n\nAs with any definite integral, one would ideally like to approximate its value via some quadrature rule. An exemplary such approximation is the m-point Gaussian quadrature rule:\n \\int\\limits_{a}^b f(t) \\mu(t) \\approx \\sum\\limits_{k=1}^m \\omega_k f(\\eta_k) \nwith weights \\{\\omega_k\\} and nodes \\{ \\eta_k\\}. On one hand, Gaussian Quadrature (GQ) is just one of many quadrature rules that could be used to approximate v^T A v. On the other hand, GQ is often preferred over other rules due to its exactness on polynomials up to degree 2m - 1.\nOf course, both the weights and nodes of GQ rule are unknown for the integral above. This is in strong contrast to the typical setting, wherein \\omega is assumed uniform or otherwise known ahead-of-time. A surprising and non-trivial fact is that both the weights \\{\\omega_k\\} and nodes \\{ \\eta_k\\} of the m-point GQ above are directly computable the degree m matrix T_m produced by the Lanczos method:\n Q_m^T A Q_m = T_m = Y \\Theta Y^T \\quad \\Leftrightarrow \\quad (\\eta_i, \\omega_i) = (\\theta_i, \\tau_i), \\; \\tau_i = (e_1^T y_i)^2 \nIn other words, the Ritz values \\{\\theta_i\\} of T_m are exactly the nodes of GQ quadrature rule applied to \\mu(t), while the first components of the Ritz vectors \\tau_i (squared) are exactly the weights. For more information about this non-trivial fact, see Golub and Meurant (2009).", + "crumbs": [ + "Theory", + "Introduction" + ] + }, + { + "objectID": "theory/slq_pseudo.html", + "href": "theory/slq_pseudo.html", + "title": "SLQ Trace guide", + "section": "", + "text": "To clarify that that means, here’s an abstract presentation of the generic SLQ procedure:\n\n\n\\begin{algorithm} \\caption{Stochastic Lanczos Quadrature} \\begin{algorithmic} \\Input Symmetric operator ($A \\in \\mathbb{R}^{n \\times n}$) \\Require Number of queries ($n_v$), Degree of quadrature ($k$) \\Function{SLQ}{$A$, $n_v$, $k$} \\State $\\Gamma \\gets 0$ \\For{$j = 1, 2, \\dots, n_v$} \\State $v_i \\sim \\mathcal{D}$ where $\\mathcal{D}$ satisfies $\\mathbb{E}(v v^\\top) = I$ \\State $T^{(j)}(\\alpha, \\beta)$ $\\gets$ $\\mathrm{Lanczos}(A,v_j,k+1)$ \\State $[\\Theta, Y] \\gets \\mathrm{eigh\\_tridiag}(T^{(j)}(\\alpha, \\beta))$ \\State $\\tau_i \\gets \\langle e_1, y_i \\rangle$ \\State < Do something with the node/weight pairs $(\\theta_i, \\tau_i^2)$ > \\EndFor \\EndFunction \\end{algorithmic} \\end{algorithm}" + }, + { + "objectID": "theory/matrix_functions.html", + "href": "theory/matrix_functions.html", + "title": "Matrix function estimation with primate", + "section": "", + "text": "In the introduction, the basics of implicit trace estimation were introduced, wherein it shown both in theory and with code using primate how to estimate the trace of matrix functions:\nf(A) = U f(\\Lambda) U^T, \\quad f: [a,b] \\to \\mathbb{R}\nIn particular, the introduction briefly covered how the Lanczos method is intimately connected to Gaussian Quadrature, and how this connection enables fast randomized trace estimation of f(A).\nIn this post, I’ll cover how to approximate the action v \\mapsto f(A)v for any function f with primate with its matrix_function API, and how to compose this functionality with other trace estimators.", + "crumbs": [ + "Theory", + "Matrix functions" + ] + }, + { + "objectID": "basic/todo.html", + "href": "basic/todo.html", + "title": "primate", + "section": "", + "text": "# a, b = 0.8, 2\n# x = np.random.uniform(low=0, high=10, size=40)\n# eps = np.random.normal(loc=0, scale=1.0, size=40)\n# y = (a * x + b) + eps\n\n# p = figure(width=350, height=200)\n# p.scatter(x, y)\n# show(p)\n\n# X = np.c_[x,y]\n# from scipy.linalg import lstsq\n# from scipy.optimize import least_squares, leastsq, minimize, minimize_scalar\n# def L(beta: tuple):\n# b, a = beta\n# return np.linalg.norm(y - a * x + b)**2\n\n# def L(beta: tuple):\n# b, a = beta\n# return (y - (a * x + b))\n\n\n\n# res = minimize_scalar(L, x0=(0,1))\n# b_opt, a_opt = res.x\n# L([a_opt,b_opt])\n\n# res = least_squares(L, x0=(0,1))\n# b_opt, a_opt = res.x\n# f_opt = lambda x: x * a_opt + b_opt \n# p = figure(width=350, height=200)\n# p.scatter(x, y)\n# p.line(x=[0, 10], y=[b_opt, f_opt(10)])\n# show(p)\n\n# ## Normal equations...\n# XI = np.c_[np.ones(X.shape[0]), X]\n\n# c, b_opt, a_opt = (np.linalg.inv((XI.T @ XI)) @ XI.T) @ y\n# f_opt = lambda x: x * a_opt + b_opt \n# p = figure(width=350, height=200)\n# p.scatter(x, y)\n# p.line(x=[0, 10], y=[b_opt, f_opt(10)])\n# show(p)" + }, + { + "objectID": "theory/matrix_functions.html#establishing-a-baseline", + "href": "theory/matrix_functions.html#establishing-a-baseline", + "title": "Matrix function estimation with primate", + "section": "Establishing a baseline", + "text": "Establishing a baseline\nprimate allows the user to construct a LinearOperator directly from an existing matrix-like object A and callable fun through its matrix_function API.\n\nAs a baseline example, consider the action that adds an \\epsilon amount of mass to the diagonal of A:\nv \\mapsto (A + \\epsilon I)v\nFor any fixed \\epsilon, imitating this matrix action can be done in three different ways:\n\nObtain u = Av and then add u \\gets u + \\epsilon \\cdot v\nForm (A + \\epsilon I) and explicitly carry out the multiplication\nMultiply v by f(A) induced by f(\\lambda) = \\lambda + \\epsilon\n\nLets see what the code to accomplish this using (3) looks like:\n\nfrom primate.random import symmetric\nfrom primate.operator import matrix_function\n\n## Random matrix + input vector\nA = symmetric(150, pd = True)\nv = np.random.uniform(size=A.shape[1])\n\n## Ground truth v |-> f(A)v\nLambda, U = np.linalg.eigh(A)\nv_truth = (U @ np.diag(Lambda + 0.10) @ U.T) @ v\n\n## Lanczos approximation\nM = matrix_function(A, fun = lambda x: x + 0.10)\nv_test = np.ravel(M @ v)\n\nvn, vn_approx = np.linalg.norm(v_truth), np.linalg.norm(v_test)\nprint(f\"f(A)v norm: {vn:.6f}\")\nprint(f\"Approx norm: {vn_approx:.6f}\")\nprint(f\"Max diff: {np.max(np.abs(v_truth - np.ravel(v_test))):.6e}\")\nprint(f\"cosine sim: {np.dot(v_test, v_truth) / (vn * vn_approx):6e}\")\n\nf(A)v norm: 4.392660\nApprox norm: 4.392660\nMax diff: 5.218048e-15\ncosine sim: 1.000000e+00\n\n\nObserve M matches the ground truth v \\mapsto (A + \\epsilon I)v. In this way, one benefit of using matrix_function is that it allows one to approximate f(A) by thinking only about what is happening at the spectral level (as opposed to the matrix level). Of course, this example is a bit non-convincing as there are simpler ways of achieving the same result as shown by (1) and (2), for example:\n\nnp.allclose(A @ v + 0.10 * v, v_truth)\n\nTrue\n\n\nBaseline established.", + "crumbs": [ + "Theory", + "Matrix functions" + ] + }, + { + "objectID": "theory/matrix_functions.html#when-v-mapsto-fav-is-not-known", + "href": "theory/matrix_functions.html#when-v-mapsto-fav-is-not-known", + "title": "Matrix function estimation with primate", + "section": "When v \\mapsto f(A)v is not known", + "text": "When v \\mapsto f(A)v is not known\nOn the other hand, there are plenty of situations where one doesn’t have access to these simpler expressions. For example, consider the map:\nv \\mapsto (A + \\epsilon I)^{-1} v\nSuch expressions pop up in a variety of settings, such as in Tikhonov regularization, in Schatten-norm estimation Ubaru, Saad, and Seghouane (2017), in the Cholesky factorization of PSD matrices, and so on. Unlike the previous setting, we cannot readily access v \\mapsto f(A)v unless we explicitly compute the full spectral decomposition of A or the inverse of A, both of which are expensive to obtain.\n\n## Alternative: v_truth = np.linalg.inv((A + 0.10 * np.eye(A.shape[0]))) @ v\nv_truth = (U @ np.diag(np.reciprocal(Lambda + 0.10)) @ U.T) @ v\n\n## Lanczos approximation\nM = matrix_function(A, fun = lambda x: 1.0 / (x + 0.10))\nv_test = np.ravel(M @ v)\n\nvn, vn_approx = np.linalg.norm(v_truth), np.linalg.norm(v_test)\nprint(f\"f(A)v norm: {vn:.6f}\")\nprint(f\"Approx norm: {vn_approx:.6f}\")\nprint(f\"Max diff: {np.max(np.abs(v_truth - np.ravel(v_test))):.6e}\")\nprint(f\"cosine sim: {np.dot(v_test, v_truth) / (vn * vn_approx):6e}\")\n\nf(A)v norm: 32.982583\nApprox norm: 32.982583\nMax diff: 1.626085e-05\ncosine sim: 1.000000e+00\n\n\nThere is a larger degree of error compared to the base as evidenced by the \\lVert \\cdot \\rVert_\\infty-normed difference between v_truth and v_test, however this is to be expected, as in general approximating the action v \\mapsto A^{-1} v will always be more difficult that v \\mapsto A v, even if A is well-conditioned.", + "crumbs": [ + "Theory", + "Matrix functions" + ] + }, + { + "objectID": "theory/matrix_functions.html#section", + "href": "theory/matrix_functions.html#section", + "title": "Matrix function estimation with primate", + "section": "", + "text": "Unfortunately, by its very definition, if A \\in \\mathbb{R}^{n \\times n} then obtaining f(A) \\in \\mathbb{R}^{n \\times n} explicitly can be very expensive. One well-known approach to sidestep this difficulty is to use the degree-k Lanczos expansion to iteratively approximate the action x \\mapsto f(A)x:\n Q^T A Q = T \\quad \\Leftrightarrow \\quad f(A)x \\approx \\lVert x \\rVert \\cdot Q f(T) e_1 \nDespite its rich theory, most results about the Lanczos method only hold in exact arithmetic; historically, practitioners have been hesitant about using the Lanczos method for eigen-estimation application due to its well-known instability with respect to roundoff and cancellation errors. Such errors typically manifest as loss of orthogonality between Lanczos vectors, which can cause a host of issues related to eigen- estimations, such as muddled convergence and spurious Ritz values. While there exist fixes to these issues, such fixes typically involve difficult to implement convergence heuristics or expensive re-orthogonalization techniques.\nOn the other hand, there is mounting empirical evidence (e.g. Ghorbani, Krishnan, and Xiao (2019)) that suggests using the (plain) Lanczos method in randomized setting is often more than sufficient for several eigenvalue-related estimation tasks. In fact, it’s been recently shown[^2] that the using the Lanczos method to approximate the action x \\mapsto f(A)x has the following guarantee:\n\\|f(\\mathbf{A}) \\mathbf{x}-\\mathbf{y}\\| \\leq 2\\|\\mathbf{x}\\| \\cdot \\min _{\\substack{\\text { polynomial } p \\\\ \\text { with degree }<k}}\\left(\\max _{x \\in\\left[\\lambda_{\\min }(\\mathbf{A}), \\lambda_{\\max }(\\mathbf{A})\\right]}|f(x)-p(x)|\\right)\nIn other words, up to a factor of 2, the error \\|f(\\mathbf{A}) \\mathbf{x}-\\mathbf{y}\\| is bounded by the uniform error of the best polynomial approximation to f with degree < k. For general matrix functions, this implies that finite-precision Lanczos essentially matches strongest known exact arithmetic bounds.", + "crumbs": [ + "Theory", + "Matrix functions" + ] + }, + { + "objectID": "theory/matrix_functions.html#matrix-function-approximation", + "href": "theory/matrix_functions.html#matrix-function-approximation", + "title": "Matrix function estimation with primate", + "section": "Matrix function approximation", + "text": "Matrix function approximation\nIf A \\in \\mathbb{R}^{n \\times n} and n is large, obtaining f(A) \\in \\mathbb{R}^{n \\times n} explicitly can be very expensive. One way to sidestep this difficulty is to approximate v \\mapsto f(A)v using the degree-k Lanczos expansion:\n Q^T A Q = T \\quad \\Leftrightarrow \\quad f(A)v \\approx \\lVert x \\rVert \\cdot Q f(T) e_1 \nIt’s been shown by (Musco, Musco, and Sidford 2018) that this approximation has the following guarantee:\n\\|f(\\mathbf{A}) \\mathbf{x}-\\mathbf{y}\\| \\leq 2\\|\\mathbf{x}\\| \\cdot \\min _{\\substack{\\text { polynomial } p \\\\ \\text { with degree }<k}}\\left(\\max _{x \\in\\left[\\lambda_{\\min }(\\mathbf{A}), \\lambda_{\\max }(\\mathbf{A})\\right]}|f(x)-p(x)|\\right)\nIn other words, up to a factor of 2, the error \\|f(\\mathbf{A}) \\mathbf{x}-\\mathbf{y}\\| is bounded by the uniform error of the best polynomial approximation to f with degree < k. For general matrix functions, this implies that finite-precision Lanczos essentially matches strongest known exact arithmetic bounds.\nThis suggests an idea: can we convert any matrix-like object into a LinearOperator that transparently converts matrix-vector products Av into products with matrix-functions f(A)v?", + "crumbs": [ + "Theory", + "Matrix functions" + ] + }, + { + "objectID": "theory/matrix_functions.html#back-to-trace-estimation", + "href": "theory/matrix_functions.html#back-to-trace-estimation", + "title": "Matrix function estimation with primate", + "section": "Back to trace estimation", + "text": "Back to trace estimation\nThere are several use-cases wherein one might be interested in the output f(A)v itself—see Musco, Musco, and Sidford (2018) and references within for some examples. One use-case is the extension of existing matvec-dependent implicit trace estimators to matrix-function trace estimators.\n\nfrom primate.trace import hutch, xtrace\nM = matrix_function(A, fun=\"log\")\nprint(f\"Logdet exact: {np.sum(np.log(np.linalg.eigvalsh(A))):6e}\")\nprint(f\"Logdet Hutch: {hutch(A, fun='log'):6e}\")\nprint(f\"Logdet XTrace: {xtrace(M):6e}\")\n\nLogdet exact: -1.483218e+02\nLogdet Hutch: -1.477567e+02\nLogdet XTrace: -1.483155e+02\n\n\nAs with the hutch estimators applied to matrix functions, note that the action v \\mapto f(A)v is subject to the approximation errors studied by Musco, Musco, and Sidford (2018), making such extensions limited functions that are well-approximated by Lanczos.", + "crumbs": [ + "Theory", + "Matrix functions" + ] } ] \ No newline at end of file diff --git a/docs/src/_quarto.yml b/docs/src/_quarto.yml index e4aca51..fe6d55d 100644 --- a/docs/src/_quarto.yml +++ b/docs/src/_quarto.yml @@ -42,11 +42,10 @@ website: contents: - text: Introduction href: theory/intro.qmd + - text: Matrix functions + href: theory/matrix_functions.qmd - text: The Lanczos Method href: theory/lanczos.qmd - # - text: SLQ - # href: theory/slq.qmd - - text: - section: Advanced contents: - href: advanced/cpp_integration.qmd diff --git a/docs/src/basic/performance.qmd b/docs/src/basic/performance.qmd index cb1c425..aed563c 100644 --- a/docs/src/basic/performance.qmd +++ b/docs/src/basic/performance.qmd @@ -1,13 +1,15 @@ --- title: "Performance" +execute: + eval: false --- `primate` provides a variety of efficient algorithms for estimating quantities derived from matrix functions. These algorithms are largely implemented in C++ to minimize overhead, and for some computational problems `primate` can out-perform the standard algorithms for estimating spectral quantities by several orders of magnitude. Nonetheless, there are some performance-related caveats to be aware of. -The main + + -To illustrate this, we give an example below using Toeplitz matrices. ```{python} from scipy.linalg import toeplitz from primate.trace import hutch diff --git a/docs/src/basic/todo.qmd b/docs/src/basic/todo.qmd new file mode 100644 index 0000000..3cf3761 --- /dev/null +++ b/docs/src/basic/todo.qmd @@ -0,0 +1,46 @@ + +```{python} +# a, b = 0.8, 2 +# x = np.random.uniform(low=0, high=10, size=40) +# eps = np.random.normal(loc=0, scale=1.0, size=40) +# y = (a * x + b) + eps + +# p = figure(width=350, height=200) +# p.scatter(x, y) +# show(p) + +# X = np.c_[x,y] +# from scipy.linalg import lstsq +# from scipy.optimize import least_squares, leastsq, minimize, minimize_scalar +# def L(beta: tuple): +# b, a = beta +# return np.linalg.norm(y - a * x + b)**2 + +# def L(beta: tuple): +# b, a = beta +# return (y - (a * x + b)) + + + +# res = minimize_scalar(L, x0=(0,1)) +# b_opt, a_opt = res.x +# L([a_opt,b_opt]) + +# res = least_squares(L, x0=(0,1)) +# b_opt, a_opt = res.x +# f_opt = lambda x: x * a_opt + b_opt +# p = figure(width=350, height=200) +# p.scatter(x, y) +# p.line(x=[0, 10], y=[b_opt, f_opt(10)]) +# show(p) + +# ## Normal equations... +# XI = np.c_[np.ones(X.shape[0]), X] + +# c, b_opt, a_opt = (np.linalg.inv((XI.T @ XI)) @ XI.T) @ y +# f_opt = lambda x: x * a_opt + b_opt +# p = figure(width=350, height=200) +# p.scatter(x, y) +# p.line(x=[0, 10], y=[b_opt, f_opt(10)]) +# show(p) +``` \ No newline at end of file diff --git a/docs/src/reference/operator.matrix_function.qmd b/docs/src/reference/operator.matrix_function.qmd index 45a7cde..ce46746 100644 --- a/docs/src/reference/operator.matrix_function.qmd +++ b/docs/src/reference/operator.matrix_function.qmd @@ -6,8 +6,8 @@ Constructs a `LinearOperator` approximating the action of the matrix function `f This function uses the Lanczos quadrature method to approximate the action of matrix function: $$ v \mapsto f(A)v, \quad f(A) = U f(\lambda) U^T $$ -The resulting operator may be used in conjunction with other trace estimation methods, such as -the `hutch` or `xtrace` estimators. +The resulting operator may be used in conjunction with other trace and vector-based estimation methods, +such as `hutch` or `xtrace`. The weights of the quadrature may be computed using either the Golub-Welsch (GW) or Forward Three Term Recurrence algorithms (FTTR) (see the `quad` parameter). For a description @@ -17,29 +17,11 @@ of the other parameters, see the Lanczos function. To compute the weights of the quadrature, the GW computation uses implicit symmetric QR steps with Wilkinson shifts, while the FTTR algorithm uses the explicit expression for orthogonal polynomials. While both require $O(\mathrm{deg}^2)$ time to execute, the former requires $O(\mathrm{deg}^2)$ space but is highly accurate, while the latter uses only $O(1)$ space at the cost of stability. -If `deg` is large and accuracy is not a priority, the `fttr` method is preferred. +If `deg` is large, `fttr` is preferred. ::: +## Returns - -## Parameters: - - A : ndarray, sparray, or LinearOperator - real symmetric operator. - fun : str or Callable, default="identity" - real-valued function defined on the spectrum of `A`. - deg : int, default=20 - Degree of the Krylov expansion. - rtol : float, default=1e-8 - Relative tolerance to consider two Lanczos vectors are numerically orthogonal. - orth: int, default=0 - Number of additional Lanczos vectors to orthogonalize against when building the Krylov basis. - kwargs : dict, optional - additional key-values to parameterize the chosen function 'fun'. - - - -## Returns: - - operator : LinearOperator - Operator approximating the action of `fun` on the spectrum of `A` \ No newline at end of file +| Type | Description | +|------------------------------------|-----------------------------------------------------------------------------| +| scipy.sparse.linalg.LinearOperator | a `LinearOperator` approximating the action of `fun` on the spectrum of `A` | \ No newline at end of file diff --git a/docs/src/reference/random.normal.qmd b/docs/src/reference/random.normal.qmd index 669caa0..b95570c 100644 --- a/docs/src/reference/random.normal.qmd +++ b/docs/src/reference/random.normal.qmd @@ -15,5 +15,5 @@ Generates random vectors from the standard normal distribution. ## Returns: -np.narray +: Randomly generated matrix of shape `size` with entries in { -1, 1 }. \ No newline at end of file diff --git a/docs/src/references.bib b/docs/src/references.bib index e69de29..90ed16f 100644 --- a/docs/src/references.bib +++ b/docs/src/references.bib @@ -0,0 +1,34 @@ + +@inproceedings{ghorbani2019investigation, + title={An investigation into neural net optimization via hessian eigenvalue density}, + author={Ghorbani, Behrooz and Krishnan, Shankar and Xiao, Ying}, + booktitle={International Conference on Machine Learning}, + pages={2232--2241}, + year={2019}, + organization={PMLR} +} +@article{ubaru2017fast, + title={Fast estimation of approximate matrix ranks using spectral densities}, + author={Ubaru, Shashanka and Saad, Yousef and Seghouane, Abd-Krim}, + journal={Neural computation}, + volume={29}, + number={5}, + pages={1317--1351}, + year={2017}, + publisher={MIT Press} +} +@inproceedings{musco2018stability, + title={Stability of the Lanczos method for matrix function approximation}, + author={Musco, Cameron and Musco, Christopher and Sidford, Aaron}, + booktitle={Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms}, + pages={1605--1624}, + year={2018}, + organization={SIAM} +} +@book{golub2009matrices, + title={Matrices, moments and quadrature with applications}, + author={Golub, Gene H and Meurant, G{\'e}rard}, + volume={30}, + year={2009}, + publisher={Princeton University Press} +} \ No newline at end of file diff --git a/docs/src/theory/intro.qmd b/docs/src/theory/intro.qmd index a3933c6..189ce41 100644 --- a/docs/src/theory/intro.qmd +++ b/docs/src/theory/intro.qmd @@ -1,11 +1,12 @@ --- title: "Introduction to trace estimation with `primate`" +bibliography: ../references.bib --- `primate` contains a variety of methods for estimating quantities from _matrix functions_. One popular such quantity is the [trace](https://en.wikipedia.org/wiki/Trace_(linear_algebra)) of $f(A)$, for some generic $f: [a,b] \to \mathbb{R}$ defined on the spectrum of $A$: $$ \mathrm{tr}(f(A)) \triangleq \mathrm{tr}(U f(\Lambda) U^{\intercal}), \quad \quad f : [a,b] \to \mathbb{R}$$ -Many quantities of common interest can be expressed in as traces of matrix functions with the right spectral function specialization. A few such specialization include: +Many quantities of common interest can be expressed in as traces of matrix functions with the right choice of $f$. A few such function specializations include: $$ \begin{align*} @@ -18,7 +19,7 @@ f &= \mathrm{sign} \quad &\Longleftrightarrow& \quad &\mathrm{rank}(A) \\ \end{align*} $$ -In this introduction, the basics of randomized trace estimation are introduced, with a focus on matrix-free algorithms how the readily extend to estimating the traces of _matrix functions_. +In this introduction, the basics of randomized trace estimation are introduced, with a focus on how to use matrix-free algorithms estimate traces of _matrix functions_. @@ -33,7 +34,7 @@ import numpy as np np.random.seed(1234) ``` -## Basics: Trace estimation +## Basics: Trace computation @@ -78,15 +79,15 @@ print(f"Eigen: {np.sum(np.linalg.eigvalsh(A)):.8f}") print(f"Matvec: {sum([eye(i) @ A @ eye(i) for i in range(150)]):.8f}") ``` -While (1) trivially yields $\mathrm{tr}(A)$ in (optimal) $O(n)$ time, there exist situations where the diagonal entries of `A` are not available---particularly in the large-scale regime. In contrast, though both (2) and (3) are less efficient, they are nonetheless viable altenrnatives to obtain $\mathrm{tr}(A)$. +While (1) trivially yields $\mathrm{tr}(A)$ in (optimal) $O(n)$ time, there exist situations where the diagonal entries of `A` are not available---particularly in the large-scale regime. In contrast, though both (2) and (3) are less efficient, they are nonetheless viable alternatives to obtain $\mathrm{tr}(A)$. -## The implicit trace estimation problem +## _Implicit_ trace estimation The _implicit_ trace estimation problem can be stated as follows: > Given access to a square matrix $A \in \mathbb{R}^{n \times n}$ via its matrix-vector product operator $x \mapsto Ax$, estimate its trace $\mathrm{tr}(A) = \sum\limits_{i=1}^n A_{ii}$ -Observe that although method (3) fits squarely in this category, its pretty inefficient from a computational standpoint---after all, most of the entries of $e_i$ are zero, thus most inner product computations do not contribute to the trace estimate at all. +Note we no longer assume we have direct access to the diagonal here, leaving us with the `Eigen` or `Matvec` methods. The `Eigen` method is expensive, and though the `Matvec` method solves the problem exactly, its pretty inefficient from a computational standpoint---most of the entries of $e_i$ are zero, thus most inner product computations do not contribute to the trace estimate at all. One idea, attributed to [A. Girard](https://link.springer.com/article/10.1007/BF01395775), is to replace $e_i$ with random _zero-mean_ vectors, e.g. random sign vectors $v \in \{-1, +1\}^{n}$ or random normal vectors $v \sim \mathcal{N}(\mu=0, \sigma=1)$: @@ -103,7 +104,7 @@ def isotropic(n): estimates = np.array([v @ A @ v for v in isotropic(500)]) ``` -Plotting the estimator formed by averaging the first $k$ samples along with it's error, we get the following figures for $k = 50$ and $k = 500$, respectively: +Plotting both the estimator formed by averaging the first $k$ samples along with its absolute error, we get the following figures for $k = 50$ and $k = 500$, respectively: ```{python} #| echo: false #| output: true @@ -152,8 +153,18 @@ for fig in [p,p2]: show(row(column(p,q), column(p2,q2))) ``` -Sure enough, the estimator quickly gets within an error rate of about 1-5% away from the actual trace, getting generally closer as more samples are collected. On the other hand, improvement is not gaurenteed, and securing additional precision beyond < 1% becomes increasingly difficult due to the variance of the sample estimates. +Sure enough, the estimator quickly gets within an error rate of about 1-5% away from the actual trace, getting generally closer as more samples are collected. On the other hand, improvement is not gaurenteed, and securing additional precision much less than 1% becomes increasingly difficult due to the randomness and variance of the sample estimates. +## Estimator Implementation + +Rather than manually averaging samples, trace estimates can be obtained via the `hutch` function: + +```{python} +from primate.trace import hutch +print(f"Trace estimate: {hutch(A, maxiter=50)}") +``` + +Though its estimation is identical to averaging quadratic forms $v^T A v$ of isotropic vectors as above, `hutch` comes with a few extra features to simplify its usage. In particular, all $v \mapsto v^T A v$ evaluations are done in parallel, there are a variety of isotropic distributions and random number generators (RNGs) to choose from, and the estimator may be suppleid various convergence criteria to allow for early-stopping; see the [hutch documentation page](../reference/trace.hutch.qmd) for more details. -# Matrix functions - -While there are some real-world applications for estimating $\mathrm{tr}(A)$, in many setting the trace of a given matrix is not a very informative quantity. On the other hand, there are innumerable applications that rely on spectral information. +## Stochastic Lanczos Quadrature +While there are some real-world applications for estimating $\mathrm{tr}(A)$, in many setting the trace of a given matrix is not a very informative quantity. On the other hand, as shown in the beginning, there are many situations where we might be interested in estimating the trace of a _matrix function_ $f(A)$. It is natural to consider extending the Hutchinson estimator above with something like: -```{python} -# from primate.operator import matrix_function +$$ \mathrm{tr}(f(A)) \approx \frac{n}{n_v} \sum\limits_{i=1}^{n_v} v_i^T f(A) v_i $$ -# a, b = 0.8, 2 -# x = np.random.uniform(low=0, high=10, size=40) -# eps = np.random.normal(loc=0, scale=1.0, size=40) -# y = (a * x + b) + eps +Of course, the remaining difficult lies in computing quadratic forms $v \mapsto v^T f(A) v$ efficiently. The approach taken by @ubaru2017fast is to notice that sums of these quantities can transformed into a _Riemann-Stieltjes integral_: -# p = figure(width=350, height=200) -# p.scatter(x, y) -# show(p) +$$ v^T f(A) v = v^T U f(\Lambda) U^T v = \sum\limits_{i}^n f(\lambda_i) \mu_i^2 = \int\limits_{a}^b f(t) \mu(t)$$ -# X = np.c_[x,y] -# from scipy.linalg import lstsq -# from scipy.optimize import least_squares, leastsq, minimize, minimize_scalar -# def L(beta: tuple): -# b, a = beta -# return np.linalg.norm(y - a * x + b)**2 +where scalars $\mu_i \in \mathbb{R}$ constitute a cumulative, piecewise constant function $\mu : \mathbb{R} \to \mathbb{R}_+$: -# def L(beta: tuple): -# b, a = beta -# return (y - (a * x + b)) +$$ +\mu_i = (U^T v)_i, \quad \mu(t) = \begin{cases} +0, & \text{if } t < a = \lambda_1 \\ +\sum_{j=1}^{i-1} \mu_j^2, & \text{if } \lambda_{i-1} \leq t < \lambda_i, i = 2, \dots, n \\ +\sum_{j=1}^n \mu_j^2, & \text{if } b = \lambda_n \leq t +\end{cases} +$$ +As with any definite integral, one would ideally like to approximate its value via some [quadrature rule](https://en.wikipedia.org/wiki/Numerical_integration#Methods_for_one-dimensional_integrals). An exemplary such approximation is the $m$-point Gaussian quadrature rule: +$$ \int\limits_{a}^b f(t) \mu(t) \approx \sum\limits_{k=1}^m \omega_k f(\eta_k) $$ -# res = minimize_scalar(L, x0=(0,1)) -# b_opt, a_opt = res.x -# L([a_opt,b_opt]) +with weights $\{\omega_k\}$ and nodes $\{ \eta_k\}$. On one hand, Gaussian Quadrature (GQ) is just one of many quadrature rules that could be used to approximate $v^T A v$. On the other hand, GQ is often preferred over other rules due to its _exactness_ on polynomials up to degree $2m - 1$. -# res = least_squares(L, x0=(0,1)) -# b_opt, a_opt = res.x -# f_opt = lambda x: x * a_opt + b_opt -# p = figure(width=350, height=200) -# p.scatter(x, y) -# p.line(x=[0, 10], y=[b_opt, f_opt(10)]) -# show(p) +Of course, both the weights and nodes of GQ rule are unknown for the integral above. This is in strong contrast to the typical setting, wherein $\omega$ is assumed uniform or otherwise known ahead-of-time. A surprising and non-trivial fact is that both the weights $\{\omega_k\}$ and nodes $\{ \eta_k\}$ of the $m$-point GQ above are directly computable the degree $m$ matrix $T_m$ produced by the Lanczos method: -# ## Normal equations... -# XI = np.c_[np.ones(X.shape[0]), X] +$$ Q_m^T A Q_m = T_m = Y \Theta Y^T \quad \Leftrightarrow \quad (\eta_i, \omega_i) = (\theta_i, \tau_i), \; \tau_i = (e_1^T y_i)^2 $$ -# c, b_opt, a_opt = (np.linalg.inv((XI.T @ XI)) @ XI.T) @ y -# f_opt = lambda x: x * a_opt + b_opt -# p = figure(width=350, height=200) -# p.scatter(x, y) -# p.line(x=[0, 10], y=[b_opt, f_opt(10)]) -# show(p) +In other words, the Ritz values $\{\theta_i\}$ of $T_m$ are exactly the nodes of GQ quadrature rule applied to $\mu(t)$, while the first components of the Ritz vectors $\tau_i$ (squared) are exactly the weights. For more information about this non-trivial fact, see @golub2009matrices. -``` +## Example: Logdet computation +Putting all of the theory above together, a simple way to estimate the trace of a matrix function is to average Lanczos quadrature estimates: -## Ex: Logdet computation +$$\mathrm{tr}(f(A)) \approx \frac{n}{n_v} \sum\limits_{i=1}^{n_v} e_1^T f(T_m) e_1 = \frac{n}{n_v} \sum\limits_{j=1}^{n_v} \left ( \sum\limits_{i=1}^m \tau_i^{(j)} f(\theta_i)^{(j)} \right )$$ -The basic way to approximate the trace of a matrix function is to simply set `fun` to the name the spectral function. For example, to approximate the log-determinant: +To enable this kind of estimator in `hutch`, simply set the `fun` argument to either the name of a built-in function or an arbitrary Callable. For example, to approximate the log-determinant: ```{python} -from primate.trace import hutch - ## Log-determinant logdet_test = hutch(A, fun="log", maxiter=25) logdet_true = np.sum(np.log(np.linalg.eigvalsh(A))) @@ -239,22 +229,26 @@ print(f"Logdet (exact): {logdet_true}") print(f"Logdet (approx): {logdet_test}") ``` -Even using $n / 6$ matvecs, we get a decent approximation. But how good is this estimate? - -To get a slightly better idea, you can set `verbose=True`: +Here we see that using $n / 6$ evaluatons of $e_1^T f(T_m) e_1$, we get a decent approximation of `logdet(A)`. To get a better idea of the quality of this estimate, you can set `verbose=True`: ```{python} -est = hutch(A, fun="log", maxiter=25, verbose=True) +est = hutch(A, fun="log", maxiter=25, seed=5, verbose=True) ``` -The first line of the statement contains fixed about the estimator, including the type of estimator (`Girard-Hutchinson`), the function applied to the spectrum (`log`), the degree of the Krylov expansion (`20`), and the quadrature method used (`fttr`). +The first line of the statement contains fixed about the estimator, including the type of estimator (`Girard-Hutchinson`), the function applied to the spectrum (`log`), the degree of the Lanczos quadrature approximation (`20`), and the quadrature method used (`fttr`). -The second line prints the runtime information about the samples, such as the final trace estimate (``) -2. Its [margin of error](https://en.wikipedia.org/wiki/Margin_of_error) -2. The [coefficient of variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) (aka the relative std. deviation) -3. The number of samples used + their distribution prefix ('R' for rademacher) +The second line prints the runtime information about the samples, such as the final trace estimate, its [margin of error](https://en.wikipedia.org/wiki/Margin_of_error) and [coefficient of variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) (aka the relative std. deviation), and the number of samples used + their distribution prefix ('R' for rademacher). + +Another way to get an idea of how well the estimator is converging is to pass the `plot=True` option. Here, we pass the same `seed` for reproducibility with the above call. ```{python} -est = hutch(A, fun="log", maxiter=100, plot=True) +est = hutch(A, fun="log", maxiter=100, seed=5, plot=True) ``` +Similar information as reported by the `verbose=True` flag is shown cumulatively applied to every sample. The left plot shows the sample point estimates alongside yellow bands denoting the margin of error associated with the 95% confidence interval (CI) at every sample index. The right plot show convergence information, e.g. the running coefficient of variation is shown on the top-right and an upper bound on the absolute error is derived from the CI is shown on the bottom right. + +While the margin of error is typically a valid, conservative estimate of estimators error, in some situations these bounds may not be valid as they are based solely on the [CLT](https://en.wikipedia.org/wiki/Central_limit_theorem) implications for normally distributed random variables. In particular, there is an inherent amount of error associated with the Lanczos-based quadrature approximations that is not taken into account in the error bound, which may invalidate the corresponding CI's. + + + +[^1]: Technically, we could solve a system of equations via either Gaussian elimination or, when $A$ is rank-deficient, via least squares: $(A + \epsilon I)^{-1}v \simeq u \Leftrightarrow \mathrm{argmin}_{u \in \mathbb{R}^n} \lVert v - (A + \epsilon I)u \rVert^2$. Then again, this could be just as expensive---if not more so---than just executing the Lanczos iteration! \ No newline at end of file diff --git a/docs/src/theory/lanczos.qmd b/docs/src/theory/lanczos.qmd index 18438ee..9c4ec3e 100644 --- a/docs/src/theory/lanczos.qmd +++ b/docs/src/theory/lanczos.qmd @@ -108,7 +108,7 @@ $$ -## Complexity analysis +## Beating the complexity bounds Elegant and as theoretically founded as the Lanczos method may be, is it efficient in practice? @@ -128,6 +128,11 @@ To see why this is true, note that a symmetric tridiagonal matrix is fully chara Now, if we need to store all of $Y$ or $Q$ explicitly, we clearly need $O(n^2)$ space to do so. However, if we only need the eigen-_values_ $\Lambda(A)$ (and not their eigen-vectors $U$), then we may execute the recurrence keeping at most three vectors $\{q_{j-1}, q_{j}, q_{j+1}\}$ in memory at any given time. Since each of these is $O(n)$ is size, the claim of $O(n)$ space is justified! + + + + +## Establishing a baseline + +`primate` allows the user to construct a `LinearOperator` directly from an existing matrix-like object `A` and callable `fun` through its `matrix_function` API. + + + +As a baseline example, consider the action that adds an $\epsilon$ amount of mass to the diagonal of $A$: + +$$v \mapsto (A + \epsilon I)v$$ + +For any fixed $\epsilon$, imitating this matrix action can be done in three different ways: + + 1. Obtain $u = Av$ and then add $u \gets u + \epsilon \cdot v$ + 2. Form $(A + \epsilon I)$ and explicitly carry out the multiplication + 3. Multiply $v$ by $f(A)$ induced by $f(\lambda) = \lambda + \epsilon$ + +Lets see what the code to accomplish this using (3) looks like: +```{python} +from primate.random import symmetric +from primate.operator import matrix_function + +## Random matrix + input vector +A = symmetric(150, pd = True) +v = np.random.uniform(size=A.shape[1]) + +## Ground truth v |-> f(A)v +Lambda, U = np.linalg.eigh(A) +v_truth = (U @ np.diag(Lambda + 0.10) @ U.T) @ v + +## Lanczos approximation +M = matrix_function(A, fun = lambda x: x + 0.10) +v_test = np.ravel(M @ v) + +vn, vn_approx = np.linalg.norm(v_truth), np.linalg.norm(v_test) +print(f"f(A)v norm: {vn:.6f}") +print(f"Approx norm: {vn_approx:.6f}") +print(f"Max diff: {np.max(np.abs(v_truth - np.ravel(v_test))):.6e}") +print(f"cosine sim: {np.dot(v_test, v_truth) / (vn * vn_approx):6e}") +``` + +Observe $M$ matches the ground truth $v \mapsto (A + \epsilon I)v$. In this way, one benefit of using `matrix_function` is that it allows one to approximate $f(A)$ by thinking only about what is happening at the spectral level (as opposed to the matrix level). Of course, this example is a bit non-convincing as there are simpler ways of achieving the same result as shown by (1) and (2), for example: + +```{python} +np.allclose(A @ v + 0.10 * v, v_truth) +``` + +Baseline established. + +## When $v \mapsto f(A)v$ is not known + +On the other hand, there are plenty of situations where one _doesn't have access to these simpler expressions_. For example, consider the map: + +$$v \mapsto (A + \epsilon I)^{-1} v$$ + +Such expressions pop up in a variety of settings, such as in [Tikhonov regularization](https://en.wikipedia.org/wiki/Ridge_regression), in Schatten-norm estimation @ubaru2017fast, in the [Cholesky factorization of PSD matrices](https://scikit-sparse.readthedocs.io/en/latest/cholmod.html#sksparse.cholmod.Factor.cholesky_AAt_inplace), and so on. Unlike the previous setting, we cannot readily access $v \mapsto f(A)v$ unless we explicitly compute the full spectral decomposition of $A$ or the inverse of $A$, both of which are expensive to obtain. + +```{python} +## Alternative: v_truth = np.linalg.inv((A + 0.10 * np.eye(A.shape[0]))) @ v +v_truth = (U @ np.diag(np.reciprocal(Lambda + 0.10)) @ U.T) @ v + +## Lanczos approximation +M = matrix_function(A, fun = lambda x: 1.0 / (x + 0.10)) +v_test = np.ravel(M @ v) + +vn, vn_approx = np.linalg.norm(v_truth), np.linalg.norm(v_test) +print(f"f(A)v norm: {vn:.6f}") +print(f"Approx norm: {vn_approx:.6f}") +print(f"Max diff: {np.max(np.abs(v_truth - np.ravel(v_test))):.6e}") +print(f"cosine sim: {np.dot(v_test, v_truth) / (vn * vn_approx):6e}") +``` + +There is a larger degree of error compared to the base as evidenced by the $\lVert \cdot \rVert_\infty$-normed difference between `v_truth` and `v_test`, however this is to be expected, as in general approximating the action $v \mapsto A^{-1} v$ will always be more difficult that $v \mapsto A v$, even if $A$ is well-conditioned. + +## Back to trace estimation + +There are several use-cases wherein one might be interested in the output $f(A)v$ itself---see @musco2018stability and references within for some examples. One use-case is the extension of existing `matvec`-dependent implicit trace estimators to matrix-function trace estimators. + +```{python} +from primate.trace import hutch, xtrace +M = matrix_function(A, fun="log") +print(f"Logdet exact: {np.sum(np.log(np.linalg.eigvalsh(A))):6e}") +print(f"Logdet Hutch: {hutch(A, fun='log'):6e}") +print(f"Logdet XTrace: {xtrace(M):6e}") +``` + +As with the hutch estimators applied to matrix functions, note that the action $v \mapto f(A)v$ is subject to the approximation errors studied by @musco2018stability, making such extensions limited functions that are well-approximated by Lanczos. + diff --git a/docs/src/theory/slq.qmd b/docs/src/theory/slq_pseudo.qmd similarity index 100% rename from docs/src/theory/slq.qmd rename to docs/src/theory/slq_pseudo.qmd diff --git a/docs/theory/intro.html b/docs/theory/intro.html index e2c1159..1c2bd6d 100644 --- a/docs/theory/intro.html +++ b/docs/theory/intro.html @@ -54,7 +54,27 @@ @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } - +/* CSS for citations */ +div.csl-bib-body { } +div.csl-entry { + clear: both; + margin-bottom: 0em; +} +.hanging-indent div.csl-entry { + margin-left:2em; + text-indent:-2em; +} +div.csl-left-margin { + min-width:2em; + float:left; +} +div.csl-right-inline { + margin-left:2em; + padding-left:1em; +} +div.csl-indent { + margin-left: 2em; +} @@ -234,6 +254,12 @@ The Lanczos Method + + @@ -334,12 +360,11 @@

On this page

@@ -366,7 +391,7 @@

Introduction to trace estimation with primate

primate contains a variety of methods for estimating quantities from matrix functions. One popular such quantity is the trace of f(A), for some generic f: [a,b] \to \mathbb{R} defined on the spectrum of A: \mathrm{tr}(f(A)) \triangleq \mathrm{tr}(U f(\Lambda) U^{\intercal}), \quad \quad f : [a,b] \to \mathbb{R}

-

Many quantities of common interest can be expressed in as traces of matrix functions with the right spectral function specialization. A few such specialization include:

+

Many quantities of common interest can be expressed in as traces of matrix functions with the right choice of f. A few such function specializations include:

\begin{align*} f &= \mathrm{id} \quad &\Longleftrightarrow& \quad &\mathrm{tr}(A) \\ @@ -377,9 +402,9 @@

Introduction to trace estimation with primate

-

In this introduction, the basics of randomized trace estimation are introduced, with a focus on matrix-free algorithms how the readily extend to estimating the traces of matrix functions.

+

In this introduction, the basics of randomized trace estimation are introduced, with a focus on how to use matrix-free algorithms estimate traces of matrix functions.

-
+
-
-

Basics: Trace estimation

+
+

Basics: Trace computation

Suppose we wanted to compute the trace of some matrix A. By the very definition of the trace, we have a variety of means to do it: \mathrm{tr}(A) \triangleq \sum\limits_{i=1}^n A_{ii} = \sum\limits_{i=1}^n \lambda_i = \sum\limits_{i=1}^n e_i^T A e_i

where, in the right-most sum, we’re using e_i to represent the ith identity vector: A_{ii} = e_i^T A e_i, \quad e_i = [0, \dots, 0, \underbrace{1}_{i}, 0, \dots, 0]

Let’s see some code. First, we start with a simple (random) positive-definite matrix A \in \mathbb{R}^{n \times n}.

-
+
from primate.random import symmetric
 A = symmetric(150, pd = True)

By default, symmetric normalizes A such that the eigenvalues are uniformly distributed in the interval [0, 1]. For reference, the spectrum of A looks as follows:

-
+
-
+
-

Sure enough, the estimator quickly gets within an error rate of about 1-5% away from the actual trace, getting generally closer as more samples are collected. On the other hand, improvement is not gaurenteed, and securing additional precision beyond < 1% becomes increasingly difficult due to the variance of the sample estimates.

+

Sure enough, the estimator quickly gets within an error rate of about 1-5% away from the actual trace, getting generally closer as more samples are collected. On the other hand, improvement is not gaurenteed, and securing additional precision much less than 1% becomes increasingly difficult due to the randomness and variance of the sample estimates.

+
+
+

Estimator Implementation

+

Rather than manually averaging samples, trace estimates can be obtained via the hutch function:

+
+
from primate.trace import hutch
+print(f"Trace estimate: {hutch(A, maxiter=50)}")
+
+
Trace estimate: 76.32787158595767
+
+
+

Though its estimation is identical to averaging quadratic forms v^T A v of isotropic vectors as above, hutch comes with a few extra features to simplify its usage. In particular, all v \mapsto v^T A v evaluations are done in parallel, there are a variety of isotropic distributions and random number generators (RNGs) to choose from, and the estimator may be suppleid various convergence criteria to allow for early-stopping; see the hutch documentation page for more details.

-
-

Matrix functions

-

While there are some real-world applications for estimating \mathrm{tr}(A), in many setting the trace of a given matrix is not a very informative quantity. On the other hand, there are innumerable applications that rely on spectral information.

-
-
# from primate.operator import matrix_function
-
-# a, b = 0.8, 2
-# x = np.random.uniform(low=0, high=10, size=40)
-# eps = np.random.normal(loc=0, scale=1.0, size=40)
-# y = (a * x + b) + eps
-
-# p = figure(width=350, height=200)
-# p.scatter(x, y)
-# show(p)
-
-# X = np.c_[x,y]
-# from scipy.linalg import lstsq
-# from scipy.optimize import least_squares, leastsq, minimize, minimize_scalar
-# def L(beta: tuple):
-#   b, a = beta
-#   return np.linalg.norm(y - a * x + b)**2
-
-# def L(beta: tuple):
-#   b, a = beta
-#   return (y - (a * x + b))
-
-
-
-# res = minimize_scalar(L, x0=(0,1))
-# b_opt, a_opt = res.x
-# L([a_opt,b_opt])
-
-# res = least_squares(L, x0=(0,1))
-# b_opt, a_opt = res.x
-# f_opt = lambda x: x * a_opt + b_opt 
-# p = figure(width=350, height=200)
-# p.scatter(x, y)
-# p.line(x=[0, 10], y=[b_opt, f_opt(10)])
-# show(p)
-
-# ## Normal equations...
-# XI = np.c_[np.ones(X.shape[0]), X]
-
-# c, b_opt, a_opt = (np.linalg.inv((XI.T @ XI)) @ XI.T) @ y
-# f_opt = lambda x: x * a_opt + b_opt 
-# p = figure(width=350, height=200)
-# p.scatter(x, y)
-# p.line(x=[0, 10], y=[b_opt, f_opt(10)])
-# show(p)
-
-
-

Ex: Logdet computation

-

The basic way to approximate the trace of a matrix function is to simply set fun to the name the spectral function. For example, to approximate the log-determinant:

-
-
from primate.trace import hutch
-
-## Log-determinant
-logdet_test = hutch(A, fun="log", maxiter=25)
-logdet_true = np.sum(np.log(np.linalg.eigvalsh(A)))
-
-print(f"Logdet (exact):  {logdet_true}")
-print(f"Logdet (approx): {logdet_test}")
+
+

Stochastic Lanczos Quadrature

+

While there are some real-world applications for estimating \mathrm{tr}(A), in many setting the trace of a given matrix is not a very informative quantity. On the other hand, as shown in the beginning, there are many situations where we might be interested in estimating the trace of a matrix function f(A). It is natural to consider extending the Hutchinson estimator above with something like:

+

\mathrm{tr}(f(A)) \approx \frac{n}{n_v} \sum\limits_{i=1}^{n_v} v_i^T f(A) v_i

+

Of course, the remaining difficult lies in computing quadratic forms v \mapsto v^T f(A) v efficiently. The approach taken by Ubaru, Saad, and Seghouane (2017) is to notice that sums of these quantities can transformed into a Riemann-Stieltjes integral:

+

v^T f(A) v = v^T U f(\Lambda) U^T v = \sum\limits_{i}^n f(\lambda_i) \mu_i^2 = \int\limits_{a}^b f(t) \mu(t)

+

where scalars \mu_i \in \mathbb{R} constitute a cumulative, piecewise constant function \mu : \mathbb{R} \to \mathbb{R}_+:

+

+\mu_i = (U^T v)_i, \quad \mu(t) = \begin{cases} +0, & \text{if } t < a = \lambda_1 \\ +\sum_{j=1}^{i-1} \mu_j^2, & \text{if } \lambda_{i-1} \leq t < \lambda_i, i = 2, \dots, n \\ +\sum_{j=1}^n \mu_j^2, & \text{if } b = \lambda_n \leq t +\end{cases} +

+

As with any definite integral, one would ideally like to approximate its value via some quadrature rule. An exemplary such approximation is the m-point Gaussian quadrature rule:

+

\int\limits_{a}^b f(t) \mu(t) \approx \sum\limits_{k=1}^m \omega_k f(\eta_k)

+

with weights \{\omega_k\} and nodes \{ \eta_k\}. On one hand, Gaussian Quadrature (GQ) is just one of many quadrature rules that could be used to approximate v^T A v. On the other hand, GQ is often preferred over other rules due to its exactness on polynomials up to degree 2m - 1.

+

Of course, both the weights and nodes of GQ rule are unknown for the integral above. This is in strong contrast to the typical setting, wherein \omega is assumed uniform or otherwise known ahead-of-time. A surprising and non-trivial fact is that both the weights \{\omega_k\} and nodes \{ \eta_k\} of the m-point GQ above are directly computable the degree m matrix T_m produced by the Lanczos method:

+

Q_m^T A Q_m = T_m = Y \Theta Y^T \quad \Leftrightarrow \quad (\eta_i, \omega_i) = (\theta_i, \tau_i), \; \tau_i = (e_1^T y_i)^2

+

In other words, the Ritz values \{\theta_i\} of T_m are exactly the nodes of GQ quadrature rule applied to \mu(t), while the first components of the Ritz vectors \tau_i (squared) are exactly the weights. For more information about this non-trivial fact, see Golub and Meurant (2009).

+
+
+

Example: Logdet computation

+

Putting all of the theory above together, a simple way to estimate the trace of a matrix function is to average Lanczos quadrature estimates:

+

\mathrm{tr}(f(A)) \approx \frac{n}{n_v} \sum\limits_{i=1}^{n_v} e_1^T f(T_m) e_1 = \frac{n}{n_v} \sum\limits_{j=1}^{n_v} \left ( \sum\limits_{i=1}^m \tau_i^{(j)} f(\theta_i)^{(j)} \right )

+

To enable this kind of estimator in hutch, simply set the fun argument to either the name of a built-in function or an arbitrary Callable. For example, to approximate the log-determinant:

+
+
## Log-determinant
+logdet_test = hutch(A, fun="log", maxiter=25)
+logdet_true = np.sum(np.log(np.linalg.eigvalsh(A)))
+
+print(f"Logdet (exact):  {logdet_true}")
+print(f"Logdet (approx): {logdet_test}")
Logdet (exact):  -148.3218440234385
-Logdet (approx): -150.71530622226584
+Logdet (approx): -148.71526016046562
-

Even using n / 6 matvecs, we get a decent approximation. But how good is this estimate?

-

To get a slightly better idea, you can set verbose=True:

-
-
est = hutch(A, fun="log", maxiter=25, verbose=True)
+

Here we see that using n / 6 evaluatons of e_1^T f(T_m) e_1, we get a decent approximation of logdet(A). To get a better idea of the quality of this estimate, you can set verbose=True:

+
+
est = hutch(A, fun="log", maxiter=25, seed=5, verbose=True)
Girard-Hutchinson estimator (fun=log, deg=20, quad=fttr)
-Est: -145.469 +/- 7.37 (95% CI), CV: 3%, Evals: 25 [R]
+Est: -143.207 +/- 5.40 (95% CI), CV: 2%, Evals: 25 [R] (seed: 5)
-

The first line of the statement contains fixed about the estimator, including the type of estimator (Girard-Hutchinson), the function applied to the spectrum (log), the degree of the Krylov expansion (20), and the quadrature method used (fttr).

-

The second line prints the runtime information about the samples, such as the final trace estimate (``) 2. Its margin of error 2. The coefficient of variation (aka the relative std. deviation) 3. The number of samples used + their distribution prefix (‘R’ for rademacher)

-
-
est = hutch(A, fun="log", maxiter=100, plot=True)
+

The first line of the statement contains fixed about the estimator, including the type of estimator (Girard-Hutchinson), the function applied to the spectrum (log), the degree of the Lanczos quadrature approximation (20), and the quadrature method used (fttr).

+

The second line prints the runtime information about the samples, such as the final trace estimate, its margin of error and coefficient of variation (aka the relative std. deviation), and the number of samples used + their distribution prefix (‘R’ for rademacher).

+

Another way to get an idea of how well the estimator is converging is to pass the plot=True option. Here, we pass the same seed for reproducibility with the above call.

+
+
est = hutch(A, fun="log", maxiter=100, seed=5, plot=True)
-
+
- - -M.quad(np.ones(M.shape[0])) -``` --> - - - - -
- +

References

+
+Golub, Gene H, and Gérard Meurant. 2009. Matrices, Moments and Quadrature with Applications. Vol. 30. Princeton University Press. +
+
+Ubaru, Shashanka, Yousef Saad, and Abd-Krim Seghouane. 2017. “Fast Estimation of Approximate Matrix Ranks Using Spectral Densities.” Neural Computation 29 (5): 1317–51. +
+
- + @@ -197,6 +197,12 @@ The Lanczos Method
+ + @@ -300,7 +306,7 @@

On this page

  • Lanczos on a bumper sticker
  • The “iteration” part
  • Wait, isn’t T arbitrary?
  • -
  • Complexity analysis
  • +
  • Beating the complexity bounds
  • @@ -392,8 +398,8 @@

    Wait, isn’t …this is actually a QR factorization, which is essentially unique! Indeed, the Implicit Q Theorem asserts that if an upper Hessenburg matrix T \in \mathbb{R}^{n \times n} has only positive elements on its first subdiagonal and there exists an orthogonal matrix Q such that Q^T A Q = T, then Q and T are uniquely determined3 by (A, q_1).

    -
    -

    Complexity analysis

    +
    +

    Beating the complexity bounds

    Elegant and as theoretically founded as the Lanczos method may be, is it efficient in practice?

    Let’s start by establishing a baseline on its complexity:

    @@ -402,6 +408,9 @@

    Complexity analysisAs its clear from the theorem, if we specialize it such that r = n and \eta = \nu = n, then the Lanczos method requires just O(n^2) time and O(n) space to execute. In other words, the Lanczos method drops both the time and space complexity4 of obtaining spectral information by order of magnitude over similar eigen-algorithms that decompose A directly.

    To see why this is true, note that a symmetric tridiagonal matrix is fully characterized by its diagonal and subdiagonal terms, which requires just O(n) space. If we assume that v \mapsto Av \sim O(n), then carrying out the recurrence clearly takes at most O(n^2) time, since there are most n such vectors \{q_i\}_{i=1}^n to generate!

    Now, if we need to store all of Y or Q explicitly, we clearly need O(n^2) space to do so. However, if we only need the eigen-values \Lambda(A) (and not their eigen-vectors U), then we may execute the recurrence keeping at most three vectors \{q_{j-1}, q_{j}, q_{j+1}\} in memory at any given time. Since each of these is O(n) is size, the claim of O(n) space is justified!

    + +
    + + +
    + +
    +
    +
    +

    + + + + +
    + + +
    + + + +
    + +
    +
    +

    Matrix function estimation with primate

    +
    + + + +
    + + + + +
    + + + +
    + + +
    +
    + +
    +
    +

    In the introduction, the basics of implicit trace estimation were introduced, wherein it shown both in theory and with code using primate how to estimate the trace of matrix functions:

    +

    f(A) = U f(\Lambda) U^T, \quad f: [a,b] \to \mathbb{R}

    +

    In particular, the introduction briefly covered how the Lanczos method is intimately connected to Gaussian Quadrature, and how this connection enables fast randomized trace estimation of f(A).

    +

    In this post, I’ll cover how to approximate the action v \mapsto f(A)v for any function f with primate with its matrix_function API, and how to compose this functionality with other trace estimators.

    +
    +

    Matrix function approximation

    +

    If A \in \mathbb{R}^{n \times n} and n is large, obtaining f(A) \in \mathbb{R}^{n \times n} explicitly can be very expensive. One way to sidestep this difficulty is to approximate v \mapsto f(A)v using the degree-k Lanczos expansion:

    +

    Q^T A Q = T \quad \Leftrightarrow \quad f(A)v \approx \lVert x \rVert \cdot Q f(T) e_1

    +

    It’s been shown by (Musco, Musco, and Sidford 2018) that this approximation has the following guarantee:

    +

    \|f(\mathbf{A}) \mathbf{x}-\mathbf{y}\| \leq 2\|\mathbf{x}\| \cdot \min _{\substack{\text { polynomial } p \\ \text { with degree }<k}}\left(\max _{x \in\left[\lambda_{\min }(\mathbf{A}), \lambda_{\max }(\mathbf{A})\right]}|f(x)-p(x)|\right)

    +

    In other words, up to a factor of 2, the error \|f(\mathbf{A}) \mathbf{x}-\mathbf{y}\| is bounded by the uniform error of the best polynomial approximation to f with degree < k. For general matrix functions, this implies that finite-precision Lanczos essentially matches strongest known exact arithmetic bounds.

    +

    This suggests an idea: can we convert any matrix-like object into a LinearOperator that transparently converts matrix-vector products Av into products with matrix-functions f(A)v?

    + +
    +
    +

    Establishing a baseline

    +

    primate allows the user to construct a LinearOperator directly from an existing matrix-like object A and callable fun through its matrix_function API.

    + +

    As a baseline example, consider the action that adds an \epsilon amount of mass to the diagonal of A:

    +

    v \mapsto (A + \epsilon I)v

    +

    For any fixed \epsilon, imitating this matrix action can be done in three different ways:

    +
      +
    1. Obtain u = Av and then add u \gets u + \epsilon \cdot v
    2. +
    3. Form (A + \epsilon I) and explicitly carry out the multiplication
    4. +
    5. Multiply v by f(A) induced by f(\lambda) = \lambda + \epsilon
    6. +
    +

    Lets see what the code to accomplish this using (3) looks like:

    +
    +
    from primate.random import symmetric
    +from primate.operator import matrix_function
    +
    +## Random matrix + input vector
    +A = symmetric(150, pd = True)
    +v = np.random.uniform(size=A.shape[1])
    +
    +## Ground truth v |-> f(A)v
    +Lambda, U = np.linalg.eigh(A)
    +v_truth = (U @ np.diag(Lambda + 0.10) @ U.T) @ v
    +
    +## Lanczos approximation
    +M = matrix_function(A, fun = lambda x: x + 0.10)
    +v_test = np.ravel(M @ v)
    +
    +vn, vn_approx = np.linalg.norm(v_truth), np.linalg.norm(v_test)
    +print(f"f(A)v norm:  {vn:.6f}")
    +print(f"Approx norm: {vn_approx:.6f}")
    +print(f"Max diff:    {np.max(np.abs(v_truth - np.ravel(v_test))):.6e}")
    +print(f"cosine sim:  {np.dot(v_test, v_truth) / (vn * vn_approx):6e}")
    +
    +
    f(A)v norm:  4.392660
    +Approx norm: 4.392660
    +Max diff:    5.218048e-15
    +cosine sim:  1.000000e+00
    +
    +
    +

    Observe M matches the ground truth v \mapsto (A + \epsilon I)v. In this way, one benefit of using matrix_function is that it allows one to approximate f(A) by thinking only about what is happening at the spectral level (as opposed to the matrix level). Of course, this example is a bit non-convincing as there are simpler ways of achieving the same result as shown by (1) and (2), for example:

    +
    +
    np.allclose(A @ v + 0.10 * v, v_truth)
    +
    +
    True
    +
    +
    +

    Baseline established.

    +
    +
    +

    When v \mapsto f(A)v is not known

    +

    On the other hand, there are plenty of situations where one doesn’t have access to these simpler expressions. For example, consider the map:

    +

    v \mapsto (A + \epsilon I)^{-1} v

    +

    Such expressions pop up in a variety of settings, such as in Tikhonov regularization, in Schatten-norm estimation Ubaru, Saad, and Seghouane (2017), in the Cholesky factorization of PSD matrices, and so on. Unlike the previous setting, we cannot readily access v \mapsto f(A)v unless we explicitly compute the full spectral decomposition of A or the inverse of A, both of which are expensive to obtain.

    +
    +
    ## Alternative: v_truth = np.linalg.inv((A + 0.10 * np.eye(A.shape[0]))) @ v
    +v_truth = (U @ np.diag(np.reciprocal(Lambda + 0.10)) @ U.T) @ v
    +
    +## Lanczos approximation
    +M = matrix_function(A, fun = lambda x: 1.0 / (x + 0.10))
    +v_test = np.ravel(M @ v)
    +
    +vn, vn_approx = np.linalg.norm(v_truth), np.linalg.norm(v_test)
    +print(f"f(A)v norm:  {vn:.6f}")
    +print(f"Approx norm: {vn_approx:.6f}")
    +print(f"Max diff:    {np.max(np.abs(v_truth - np.ravel(v_test))):.6e}")
    +print(f"cosine sim:  {np.dot(v_test, v_truth) / (vn * vn_approx):6e}")
    +
    +
    f(A)v norm:  32.982583
    +Approx norm: 32.982583
    +Max diff:    1.626085e-05
    +cosine sim:  1.000000e+00
    +
    +
    +

    There is a larger degree of error compared to the base as evidenced by the \lVert \cdot \rVert_\infty-normed difference between v_truth and v_test, however this is to be expected, as in general approximating the action v \mapsto A^{-1} v will always be more difficult that v \mapsto A v, even if A is well-conditioned.

    +
    +
    +

    Back to trace estimation

    +

    There are several use-cases wherein one might be interested in the output f(A)v itself—see Musco, Musco, and Sidford (2018) and references within for some examples. One use-case is the extension of existing matvec-dependent implicit trace estimators to matrix-function trace estimators.

    +
    +
    from primate.trace import hutch, xtrace
    +M = matrix_function(A, fun="log")
    +print(f"Logdet exact:  {np.sum(np.log(np.linalg.eigvalsh(A))):6e}")
    +print(f"Logdet Hutch:  {hutch(A, fun='log'):6e}")
    +print(f"Logdet XTrace: {xtrace(M):6e}")
    +
    +
    Logdet exact:  -1.483218e+02
    +Logdet Hutch:  -1.477567e+02
    +Logdet XTrace: -1.483155e+02
    +
    +
    +

    As with the hutch estimators applied to matrix functions, note that the action v \mapto f(A)v is subject to the approximation errors studied by Musco, Musco, and Sidford (2018), making such extensions limited functions that are well-approximated by Lanczos.

    + + + +
    + +

    References

    +
    +Musco, Cameron, Christopher Musco, and Aaron Sidford. 2018. “Stability of the Lanczos Method for Matrix Function Approximation.” In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1605–24. SIAM. +
    +
    +Ubaru, Shashanka, Yousef Saad, and Abd-Krim Seghouane. 2017. “Fast Estimation of Approximate Matrix Ranks Using Spectral Densities.” Neural Computation 29 (5): 1317–51. +
    +
    + + +
    +
    + +
    + + + + + + + \ No newline at end of file diff --git a/docs/theory/slq_pseudo.html b/docs/theory/slq_pseudo.html new file mode 100644 index 0000000..dcef9a2 --- /dev/null +++ b/docs/theory/slq_pseudo.html @@ -0,0 +1,919 @@ + + + + + + + + + +primate - SLQ Trace guide + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    + + +
    + +
    + + +
    + + + +
    + +
    +
    +

    SLQ Trace guide

    +
    + + + +
    + + + + +
    + + + +
    + + +

    To clarify that that means, here’s an abstract presentation of the generic SLQ procedure:

    +
    +
    +\begin{algorithm} \caption{Stochastic Lanczos Quadrature} \begin{algorithmic} \Input Symmetric operator ($A \in \mathbb{R}^{n \times n}$) \Require Number of queries ($n_v$), Degree of quadrature ($k$) \Function{SLQ}{$A$, $n_v$, $k$} \State $\Gamma \gets 0$ \For{$j = 1, 2, \dots, n_v$} \State $v_i \sim \mathcal{D}$ where $\mathcal{D}$ satisfies $\mathbb{E}(v v^\top) = I$ \State $T^{(j)}(\alpha, \beta)$ $\gets$ $\mathrm{Lanczos}(A,v_j,k+1)$ \State $[\Theta, Y] \gets \mathrm{eigh\_tridiag}(T^{(j)}(\alpha, \beta))$ \State $\tau_i \gets \langle e_1, y_i \rangle$ \State < Do something with the node/weight pairs $(\theta_i, \tau_i^2)$ > \EndFor \EndFunction \end{algorithmic} \end{algorithm} +
    +
    + + + +
    + +
    +
    + +
    + + + + + + + \ No newline at end of file diff --git a/src/primate/operator.py b/src/primate/operator.py index c5fd1c8..9182d99 100644 --- a/src/primate/operator.py +++ b/src/primate/operator.py @@ -18,9 +18,8 @@ def matrix_function( """Constructs a `LinearOperator` approximating the action of the matrix function `fun(A)`. This function uses the Lanczos quadrature method to approximate the action of matrix function: - $$ v \mapsto f(A)v, \quad f(A) = U f(\lambda) U^T $$ - The resulting operator may be used in conjunction with other trace and vector-based estimation methods, - such as the `hutch` or `xtrace` estimators. + $$ v \\mapsto f(A)v, \\quad f(A) = U f(\\lambda) U^T $$ + The resulting operator may be used in conjunction with other estimation methods, such as `hutch` or `xtrace`. The weights of the quadrature may be computed using either the Golub-Welsch (GW) or Forward Three Term Recurrence algorithms (FTTR) (see the `quad` parameter). For a description @@ -29,7 +28,7 @@ def matrix_function( :::{.callout-note} To compute the weights of the quadrature, the GW computation uses implicit symmetric QR steps with Wilkinson shifts, while the FTTR algorithm uses the explicit expression for orthogonal polynomials. While both require $O(\\mathrm{deg}^2)$ time to execute, - the former requires $O(\mathrm{deg}^2)$ space but is highly accurate, while the latter uses only $O(1)$ space at the cost of stability. + the former requires $O(\\mathrm{deg}^2)$ space but is highly accurate, while the latter uses only $O(1)$ space at the cost of stability. If `deg` is large, `fttr` is preferred. ::: diff --git a/src/primate/trace.py b/src/primate/trace.py index 231c430..e7f940e 100644 --- a/src/primate/trace.py +++ b/src/primate/trace.py @@ -36,7 +36,7 @@ def hutch( This function uses up to `maxiter` random isotropic vectors to estimate of the trace of $f(A)$, where: $$\\mathrm{tr}(f(A)) = \\mathrm{tr}(U f(\\Lambda) U^T) = \\sum\\limits_{i=1}^n f(\\lambda_i) $$ - The estimator is obtained by averaging quadratic forms $v \mapsto v^T f(A)v$, rescaling as necessary. + The estimator is obtained by averaging quadratic forms $v \\mapsto v^T f(A)v$, rescaling as necessary. This estimator may be used to quickly approximate of a variety of quantities, such as the trace inverse, the log-determinant, the numerical rank, etc. See the [online documentation](https://peekxc.github.io/primate/) for more details. diff --git a/tests/test_lanczos.py b/tests/test_lanczos.py index 98e0fcd..10c6746 100644 --- a/tests/test_lanczos.py +++ b/tests/test_lanczos.py @@ -138,41 +138,36 @@ def test_quadrature_methods(): quad2 = np.sum(_lanczos.quadrature(a, b, n, 1).prod(axis=1)) assert np.isclose(quad1, quad2, atol=quad1*0.05) -def test_quadrature_toeplitz(): - from primate.diagonalize import lanczos, _lanczos - from scipy.linalg import toeplitz - np.random.seed(1234) - - A = toeplitz(np.arange(1,8)).astype(np.float32) - v0 = np.random.uniform(size=A.shape[1]) - n = A.shape[1] - a, b = lanczos(A, deg=n) - a, b = a, np.append([0], b) - quad1 = np.sum(_lanczos.quadrature(a, b, n, 0).prod(axis=1)) - quad2 = np.sum(_lanczos.quadrature(a, b, n, 1).prod(axis=1)) - assert np.isclose(quad1, quad2, atol=np.abs(quad1)*0.50) - - ## ground truth - ew, ev = np.linalg.eigh(A) - mu_0 = np.sum(np.abs(ew)) - mu_0 * np.ravel(ev[0,:])**2 - - np.array([orth_poly(ew[0], i, mu_0, a, b)**2 for i in range(n)]) - _lanczos.quadrature(a, b, n, 0)[:,1] - - # np.linalg.norm([orth_poly(ew[0], i, mu_0, a, b) for i in range(n)]) - ## - # - +# def test_quadrature_toeplitz(): +# from primate.diagonalize import lanczos, _lanczos +# from scipy.linalg import toeplitz +# np.random.seed(1234) + +# A = toeplitz(np.arange(1,8)).astype(np.float32) +# v0 = np.random.uniform(size=A.shape[1]) +# n = A.shape[1] +# a, b = lanczos(A, deg=n) +# a, b = a, np.append([0], b) +# quad1 = np.sum(_lanczos.quadrature(a, b, n, 0).prod(axis=1)) +# quad2 = np.sum(_lanczos.quadrature(a, b, n, 1).prod(axis=1)) +# assert np.isclose(quad1, quad2, atol=np.abs(quad1)*0.50) + +# ## ground truth +# ew, ev = np.linalg.eigh(A) +# mu_0 = np.sum(np.abs(ew)) +# mu_0 * np.ravel(ev[0,:])**2 +# np.array([orth_poly(ew[0], i, mu_0, a, b)**2 for i in range(n)]) +# _lanczos.quadrature(a, b, n, 0)[:,1] +# np.linalg.norm([orth_poly(ew[0], i, mu_0, a, b) for i in range(n)]) # ## to mimick # arr = np.array([orth_poly(ew[0], i, mu_0, a, b)**2 for i in range(n)]) - n = 50 - c = np.random.uniform(size=n, low=0, high=1) - A = toeplitz(c) - v0 = np.random.uniform(size=A.shape[1]) - a, b = lanczos(A, deg=n) - a, b = a, np.append([0], b) - quad1 = np.sum(_lanczos.quadrature(a, b, n, 0).prod(axis=1)) - quad2 = np.sum(_lanczos.quadrature(a, b, n, 1).prod(axis=1)) - assert np.isclose(quad1, quad2, atol=quad1*0.50) + # n = 50 + # c = np.random.uniform(size=n, low=0, high=1) + # A = toeplitz(c) + # v0 = np.random.uniform(size=A.shape[1]) + # a, b = lanczos(A, deg=n) + # a, b = a, np.append([0], b) + # quad1 = np.sum(_lanczos.quadrature(a, b, n, 0).prod(axis=1)) + # quad2 = np.sum(_lanczos.quadrature(a, b, n, 1).prod(axis=1)) + # assert np.isclose(quad1, quad2, atol=quad1*0.50)