Kernel Regularized Least Squares (KRLS) is a kernel-based, complexity-penalized method developed by Hainmueller and Hazlett (2013), and designed to minimize parametric assumptions while maintaining interpretive clarity. Here, we introduce bigKRLS
, an updated version of the original KRLS R package with algorithmic and implementation improvements designed to optimize speed and memory usage. These improvements allow users to straightforwardly estimate pairwise regression models with KRLS once N > 2500. Since April 15, 2017, bigKRLS
has been available on CRAN. You may also be interested in our working paper, which has been accepted by Political Analysis, and which demonstrates the utility of bigKRLS
by analyzing the 2016 US presidential election. Our replication materials can be found on Dataverse and our Github repo contains examples too.
-
C++ integration. We re-implement most major computations in the model in
C++
via Rcpp and RcppArmadillo. These changes produce up to a 50% runtime decrease compared to the originalR
implementation. -
Leaner algorithm. Because of the Tikhonov regularization and parameter tuning strategies used in KRLS, the method of estimation is inherently memory-heavy (O(N2)), making memory savings important even in small- and medium-sized applications. We develop and implement a new marginal effects algorithm, which reduces peak memory usage by approximately an order of magnitude, and cut the number of computations needed to find regularization parameter in half.
-
Improved memory management. Most data objects in
R
perform poorly in memory-intensive applications. We use a series of packages in the bigmemory environment to ease this constraint, allowing our implementation to handle larger datasets more smoothly. -
Parallel Processing. In addition to the single-core algorithmic improvements, parallel processing obtains the pointwise marginal effects substantially faster.
-
Interactive data visualization. We've designed an
R
Shiny app that allows usersbigKRLS
users to easily share results with collaborators or more general audiences. Simply callshiny.bigKRLS()
. -
Honest p values.
bigKRLS
now computes p values that reflect both the regularization process and the number of predictors. For details on how the effective sample size is calculated as well as other options, seehelp(summary.bigKRLS)
.
out <- bigKRLS(y, X)
out$Neffective
summary(out)
- Cross-validation, including K folds crossvalidation.
crossvalidate.bigKRLS
performs CV, stores a number of in and out of sample statistics, as well as metadata documenting how data the were split and the bigmemory file structure (if applicable).
cv <- crossvalidate.bigKRLS(y, X, seed = 2017, ptesting = 20)
kcv <- crossvalidate.bigKRLS(y, X, seed = 2017, Kfolds = 5)
See vignette("bigKRLS_basics")
for details.
- Eigentruncation.
bigKRLS
now supports two types of eigentruncation to decrease runtime.
out <- bigKRLS(y, X, eigtrunc = 0.001) # defaults to 0.001 if N > 3000 and 0 otherwise
out <- bigKRLS(y, X, Neig = 100) # only compute 100 vecs and vals (defaults to Neig = nrow(X))
bigKRLS
requires a series of packages--notably bigmemory
, Rcpp
, and RcppArmadillo
--current versions of which require up-to-date versions of R
and its compilers (RStudio
, if used, must be current as well). To install the latest stable version from CRAN
:
install.packages("bigKRLS")
To install the GitHub
version, use standard devtools syntax:
install.packages("devtools")
library(devtools)
install_github('rdrr1990/bigKRLS')
New users may wish to see our installation notes for specifics
For details on syntax, load the library and then open our vignette:
library(bigKRLS)
vignette("bigKRLS_basics")
Because of the quadratic memory requirement, users working on a typical laptop (8-16 gigabytes of RAM) may wish to start at N = 2,500 or 5,000, particularly if the number of x variables is large. When you have a sense of how bigKRLS runs on your system, you may wish to only estimate a subset of the marginal effects at N = 10-15,000 by setting bigKRLS(..., which.derivatives = c(1, 3, 5))
for the marginal effects of the first, third, and fifth x variable.
Code released under GPL (>= 2).