Algorithms
Hadi & Simonoff (1993)
LinRegOutliers.HS93.hs93
— Functionhs93(setting; alpha = 0.05, basicsubsetindices = nothing)
Perform the Hadi & Simonoff (1993) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.alpha::Float64
: Optional argument of the probability of rejecting the null hypothesis.basicsubsetindices::Array{Int, 1}
: Initial basic subset, by default, the algorithm creates an initial set of clean observations.
Description
Performs a forward search by selecting and enlarging an initial clean subset of observations and iterates until scaled residuals exceeds a threshold.
Output
["outliers"]
: Array of indices of outliers["t"]
: Threshold, specifically, calculated quantile of a Student-T distribution["d"]
: Internal and external scaled residuals.- `["betas"]: Vector of estimated regression coefficients.
- `["converged"]: Boolean value indicating whether the algorithm converged or not.
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
+Algorithms · LinRegOutliers Algorithms
Hadi & Simonoff (1993)
LinRegOutliers.HS93.hs93
— Functionhs93(setting; alpha = 0.05, basicsubsetindices = nothing)
Perform the Hadi & Simonoff (1993) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.alpha::Float64
: Optional argument of the probability of rejecting the null hypothesis.basicsubsetindices::Array{Int, 1}
: Initial basic subset, by default, the algorithm creates an initial set of clean observations.
Description
Performs a forward search by selecting and enlarging an initial clean subset of observations and iterates until scaled residuals exceeds a threshold.
Output
["outliers"]
: Array of indices of outliers["t"]
: Threshold, specifically, calculated quantile of a Student-T distribution["d"]
: Internal and external scaled residuals. - `["betas"]: Vector of estimated regression coefficients.
- `["converged"]: Boolean value indicating whether the algorithm converged or not.
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> hs93(reg0001)
Dict{Any,Any} with 3 entries:
"outliers" => [14, 15, 16, 17, 18, 19, 20, 21]
"t" => -3.59263
"d" => [2.04474, 1.14495, -0.0633255, 0.0632934, -0.354349, -0.766818, -1.06862, -1.47638, -0.7…
- "converged"=> true
References
Hadi, Ali S., and Jeffrey S. Simonoff. "Procedures for the identification of multiple outliers in linear models." Journal of the American Statistical Association 88.424 (1993): 1264-1272.
sourceKianifard & Swallow (1989)
LinRegOutliers.KS89.ks89
— Functionks89(setting; alpha = 0.05)
Perform the Kianifard & Swallow (1989) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.alpha::Float64
: Optional argument of the probability of rejecting the null hypothesis.
Description
The algorithm starts with a clean subset of observations. This initial set is then enlarged using recursive residuals. When the calculated statistics exceeds a threshold it terminates.
Output
["outliers]
: Array of indices of outliers.["betas"]
: Vector of regression coefficients.
Examples
julia> reg0001 = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
+ "converged"=> true
References
Hadi, Ali S., and Jeffrey S. Simonoff. "Procedures for the identification of multiple outliers in linear models." Journal of the American Statistical Association 88.424 (1993): 1264-1272.
sourceKianifard & Swallow (1989)
LinRegOutliers.KS89.ks89
— Functionks89(setting; alpha = 0.05)
Perform the Kianifard & Swallow (1989) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.alpha::Float64
: Optional argument of the probability of rejecting the null hypothesis.
Description
The algorithm starts with a clean subset of observations. This initial set is then enlarged using recursive residuals. When the calculated statistics exceeds a threshold it terminates.
Output
["outliers]
: Array of indices of outliers.["betas"]
: Vector of regression coefficients.
Examples
julia> reg0001 = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
julia> ks89(reg0001)
Dict{String, Vector} with 2 entries:
"betas" => [-42.4531, 0.956605, 0.555571, -0.108766]
- "outliers" => [4, 21]
References
Kianifard, Farid, and William H. Swallow. "Using recursive residuals, calculated on adaptively-ordered observations, to identify outliers in linear regression." Biometrics (1989): 571-585.
sourceSebert & Montgomery & Rollier (1998)
LinRegOutliers.SMR98.smr98
— Functionsmr98(setting)
Perform the Sebert, Monthomery and Rollier (1998) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm starts with an ordinary least squares estimation for a given model and data. Residuals and fitted responses are calculated using the estimated model. A hierarchical clustering analysis is applied using standardized residuals and standardized fitted responses. The tree structure of clusters are cut using a threshold, e.g Majona criterion, as suggested by the authors. It is expected that the subtrees with relatively small number of observations are declared to be clusters of outliers.
Output
["outliers"]
: Array of indices of outliers.["betas"]
: Vector of regression coefficients.
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
+ "outliers" => [4, 21]
References
Kianifard, Farid, and William H. Swallow. "Using recursive residuals, calculated on adaptively-ordered observations, to identify outliers in linear regression." Biometrics (1989): 571-585.
sourceSebert & Montgomery & Rollier (1998)
LinRegOutliers.SMR98.smr98
— Functionsmr98(setting)
Perform the Sebert, Monthomery and Rollier (1998) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm starts with an ordinary least squares estimation for a given model and data. Residuals and fitted responses are calculated using the estimated model. A hierarchical clustering analysis is applied using standardized residuals and standardized fitted responses. The tree structure of clusters are cut using a threshold, e.g Majona criterion, as suggested by the authors. It is expected that the subtrees with relatively small number of observations are declared to be clusters of outliers.
Output
["outliers"]
: Array of indices of outliers.["betas"]
: Vector of regression coefficients.
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> smr98(reg0001)
Dict{String, Vector} with 2 entries:
"betas" => [-55.4519, 1.15692]
- "outliers" => [15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
References
Sebert, David M., Douglas C. Montgomery, and Dwayne A. Rollier. "A clustering algorithm for identifying multiple outliers in linear regression." Computational statistics & data analysis 27.4 (1998): 461-484.
sourceLeast Median of Squares
LinRegOutliers.LMS.lms
— Functionlms(setting; iters = nothing, crit = 2.5)
Perform Least Median of Squares regression estimator with random sampling.
Arguments
setting::RegressionSetting
: A regression setting object.iters::Int
: Number of random samples.crit::Float64
: Critical value for standardized residuals.
Description
LMS (Least Median of Squares) estimator is highly robust with 50% breakdown property. The algorithm searches for regression coefficients which minimize (h)th ordered squared residual where h is Int(floor((n + 1.0) / 2.0))
Output
["stdres"]
: Array of standardized residuals["S"]
: Standard error of regression["outliers"]
: Array of indices of outliers["objective"]
: LMS objective value["betas"]
: Estimated regression coefficients["crit"]
: Threshold value.
Examples
julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
+ "outliers" => [15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
References
Sebert, David M., Douglas C. Montgomery, and Dwayne A. Rollier. "A clustering algorithm for identifying multiple outliers in linear regression." Computational statistics & data analysis 27.4 (1998): 461-484.
sourceLeast Median of Squares
LinRegOutliers.LMS.lms
— Functionlms(setting; iters = nothing, crit = 2.5)
Perform Least Median of Squares regression estimator with random sampling.
Arguments
setting::RegressionSetting
: A regression setting object.iters::Int
: Number of random samples.crit::Float64
: Critical value for standardized residuals.
Description
LMS (Least Median of Squares) estimator is highly robust with 50% breakdown property. The algorithm searches for regression coefficients which minimize (h)th ordered squared residual where h is Int(floor((n + 1.0) / 2.0))
Output
["stdres"]
: Array of standardized residuals["S"]
: Standard error of regression["outliers"]
: Array of indices of outliers["objective"]
: LMS objective value["betas"]
: Estimated regression coefficients["crit"]
: Threshold value.
Examples
julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
julia> lms(reg)
Dict{Any,Any} with 6 entries:
@@ -22,7 +22,7 @@
"outliers" => [14, 15, 16, 17, 18, 19, 20, 21]
"objective" => 0.515348
"betas" => [-56.1972, 1.1581]
- "crit" => 2.5
References
Rousseeuw, Peter J. "Least median of squares regression." Journal of the American statistical association 79.388 (1984): 871-880.
sourceLeast Trimmed Squares
LinRegOutliers.LTS.lts
— Functionlts(setting; iters = nothing, crit = 2.5, earlystop = true)
Perform the Fast-LTS (Least Trimmed Squares) algorithm for a given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.iters::Int
: Number of iterations.crit::Float64
: Critical value.earlystop::Bool
: Early stop if the best objective does not change in iters / 2 iterations.
Description
The algorithm searches for estimations of regression parameters which minimize the sum of first h ordered squared residuals where h is Int(floor((n + p + 1.0) / 2.0)). Specifically, our implementation, uses the algorithm Fast-LTS in which concentration steps are used for enlarging a basic subset to subset of clean observation of size h.
Output
["betas"]
: Estimated regression coefficients["S"]
: Standard error of regression["hsubset"]
: Best subset of clean observation of size h.["outliers"]
: Array of indices of outliers["scaled.residuals"]
: Array of scaled residuals["objective"]
: LTS objective value.
Examples
julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
+ "crit" => 2.5
References
Rousseeuw, Peter J. "Least median of squares regression." Journal of the American statistical association 79.388 (1984): 871-880.
sourceLeast Trimmed Squares
LinRegOutliers.LTS.lts
— Functionlts(setting; iters = nothing, crit = 2.5, earlystop = true)
Perform the Fast-LTS (Least Trimmed Squares) algorithm for a given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.iters::Int
: Number of iterations.crit::Float64
: Critical value.earlystop::Bool
: Early stop if the best objective does not change in iters / 2 iterations.
Description
The algorithm searches for estimations of regression parameters which minimize the sum of first h ordered squared residuals where h is Int(floor((n + p + 1.0) / 2.0)). Specifically, our implementation, uses the algorithm Fast-LTS in which concentration steps are used for enlarging a basic subset to subset of clean observation of size h.
Output
["betas"]
: Estimated regression coefficients["S"]
: Standard error of regression["hsubset"]
: Best subset of clean observation of size h.["outliers"]
: Array of indices of outliers["scaled.residuals"]
: Array of scaled residuals["objective"]
: LTS objective value.
Examples
julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
julia> lts(reg)
Dict{Any,Any} with 6 entries:
"betas" => [-56.5219, 1.16488]
@@ -30,7 +30,7 @@
"hsubset" => [11, 10, 5, 6, 23, 12, 13, 9, 24, 7, 3, 4, 8]
"outliers" => [14, 15, 16, 17, 18, 19, 20, 21]
"scaled.residuals" => [2.41447, 1.63472, 0.584504, 0.61617, 0.197052, -0.222066, -0.551027, -0.970146, -0.397538, -0.185558 … …
- "objective" => 3.43133
References
Rousseeuw, Peter J., and Katrien Van Driessen. "An algorithm for positive-breakdown regression based on concentration steps." Data Analysis. Springer, Berlin, Heidelberg, 2000. 335-346.
sourceMinimum Volume Ellipsoid (MVE)
LinRegOutliers.MVE.mve
— Functionmve(data; alpha = 0.01)
Performs the Minimum Volume Ellipsoid algorithm for a robust covariance matrix.
Arguments
data::DataFrame
: Multivariate data.alpha::Float64
: Probability for quantiles of Chi-Squared statistic.
Description
mve
searches for a robust location vector and a robust scale matrix, e.g covariance matrix. The method also reports a usable diagnostic measure, Mahalanobis distances, which are calculated using the robust counterparts instead of mean vector and usual covariance matrix. Mahalanobis distances are directly comparible with quantiles of a ChiSquare Distribution with p
degrees of freedom.
Output
["goal"]
: Objective value["best.subset"]
: Indices of best h-subset of observations["robust.location"]
: Vector of robust location measures["robust.covariance"]
: Robust covariance matrix["squared.mahalanobis"]
: Array of Mahalanobis distances calculated using robust location and scale measures.["chisq.crit"]
: Chisquare quantile used in threshold["alpha"]
: Probability used in calculating the Chisquare quantile, e.g chisq.crit
["outliers"]
: Array of indices of outliers.
References
Van Aelst, Stefan, and Peter Rousseeuw. "Minimum volume ellipsoid." Wiley Interdisciplinary Reviews: Computational Statistics 1.1 (2009): 71-82.
sourceMVE & LTS Plot
LinRegOutliers.MVELTSPlot.mveltsplot
— Functionmveltsplot(setting; alpha = 0.05, showplot = true)
Generate MVE - LTS plot for visual detecting of regression outliers.
Arguments
setting::RegressionSetting
: A regression setting object.alpha::Float64
: Probability for quantiles of Chi-Squared statistic.showplot::Bool
: Whether a plot is shown or only return statistics.
Description
This is a method of combination of lts
and mve
. Regression residuals and robust distances obtained by mve
and mve
are used to generate a plot. Despite this is a visual method, drawing a plot is not really necessary. The algorithm divides the residuals-distances space into 4 parts, one for clean observations, one for vertical outliers (y-space outliers), one for bad-leverage points (x-space outliers), and one for good leverage points (observations far from the remaining of data in both x and y space).
Output
["plot"]
: Generated plot object["robust.distances"]
: Robust Mahalanobis distances ["scaled.residuals"]
: Scaled residuals of an lts
estimate["chi.squared"]
: Quantile of Chi-Squared distribution ["regular.points"]
: Array of indices of clean observations["outlier.points"]
: Array of indices of y-space outliers (vertical outliers)["leverage.points"]
: Array of indices of x-space outliers (bad leverage points)["outlier.and.leverage.points"]
: Array of indices of xy-space outliers (good leverage points)
References
Van Aelst, Stefan, and Peter Rousseeuw. "Minimum volume ellipsoid." Wiley Interdisciplinary Reviews: Computational Statistics 1.1 (2009): 71-82.
Dependencies This method is enabled when the Plots package is installed and loaded.
sourceBillor & Chatterjee & Hadi (2006)
LinRegOutliers.BCH.bch
— Functionbch(setting; alpha = 0.05, maxiter = 1000, epsilon = 0.000001)
Perform the Billor & Chatterjee & Hadi (2006) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.alpha::Float64
: Optional argument of the probability of rejecting the null hypothesis.maxiter::Int
: Maximum number of iterations for calculating iterative weighted least squares estimates.epsilon::Float64
: Accuracy for determining convergency.
Description
The algorithm initially constructs a basic subset. These basic subset is then used to generate initial weights for a iteratively least squares estimation. Regression coefficients obtained in this stage are robust regression estimates. Squared normalized distances and squared normalized residuals are used in bchplot
which serves a visual way for investigation of outliers and their properties.
Output
["betas"]
: Final estimate of regression coefficients ["squared.normalized.robust.distances"]
: ["weights"]
: Final weights used in calculation of WLS estimates ["outliers"]
: Array of indices of outliers["squared.normalized.residuals"]
: Array of squared normalized residuals["residuals"]
: Array of regression residuals["basic.subset"]
: Array of indices of basic subset.
Examples
julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
+ "objective" => 3.43133
References
Rousseeuw, Peter J., and Katrien Van Driessen. "An algorithm for positive-breakdown regression based on concentration steps." Data Analysis. Springer, Berlin, Heidelberg, 2000. 335-346.
sourceMinimum Volume Ellipsoid (MVE)
LinRegOutliers.MVE.mve
— Functionmve(data; alpha = 0.01)
Performs the Minimum Volume Ellipsoid algorithm for a robust covariance matrix.
Arguments
data::DataFrame
: Multivariate data.alpha::Float64
: Probability for quantiles of Chi-Squared statistic.
Description
mve
searches for a robust location vector and a robust scale matrix, e.g covariance matrix. The method also reports a usable diagnostic measure, Mahalanobis distances, which are calculated using the robust counterparts instead of mean vector and usual covariance matrix. Mahalanobis distances are directly comparible with quantiles of a ChiSquare Distribution with p
degrees of freedom.
Output
["goal"]
: Objective value["best.subset"]
: Indices of best h-subset of observations["robust.location"]
: Vector of robust location measures["robust.covariance"]
: Robust covariance matrix["squared.mahalanobis"]
: Array of Mahalanobis distances calculated using robust location and scale measures.["chisq.crit"]
: Chisquare quantile used in threshold["alpha"]
: Probability used in calculating the Chisquare quantile, e.g chisq.crit
["outliers"]
: Array of indices of outliers.
References
Van Aelst, Stefan, and Peter Rousseeuw. "Minimum volume ellipsoid." Wiley Interdisciplinary Reviews: Computational Statistics 1.1 (2009): 71-82.
sourceMVE & LTS Plot
LinRegOutliers.MVELTSPlot.mveltsplot
— Functionmveltsplot(setting; alpha = 0.05, showplot = true)
Generate MVE - LTS plot for visual detecting of regression outliers.
Arguments
setting::RegressionSetting
: A regression setting object.alpha::Float64
: Probability for quantiles of Chi-Squared statistic.showplot::Bool
: Whether a plot is shown or only return statistics.
Description
This is a method of combination of lts
and mve
. Regression residuals and robust distances obtained by mve
and mve
are used to generate a plot. Despite this is a visual method, drawing a plot is not really necessary. The algorithm divides the residuals-distances space into 4 parts, one for clean observations, one for vertical outliers (y-space outliers), one for bad-leverage points (x-space outliers), and one for good leverage points (observations far from the remaining of data in both x and y space).
Output
["plot"]
: Generated plot object["robust.distances"]
: Robust Mahalanobis distances ["scaled.residuals"]
: Scaled residuals of an lts
estimate["chi.squared"]
: Quantile of Chi-Squared distribution ["regular.points"]
: Array of indices of clean observations["outlier.points"]
: Array of indices of y-space outliers (vertical outliers)["leverage.points"]
: Array of indices of x-space outliers (bad leverage points)["outlier.and.leverage.points"]
: Array of indices of xy-space outliers (good leverage points)
References
Van Aelst, Stefan, and Peter Rousseeuw. "Minimum volume ellipsoid." Wiley Interdisciplinary Reviews: Computational Statistics 1.1 (2009): 71-82.
Dependencies This method is enabled when the Plots package is installed and loaded.
sourceBillor & Chatterjee & Hadi (2006)
LinRegOutliers.BCH.bch
— Functionbch(setting; alpha = 0.05, maxiter = 1000, epsilon = 0.000001)
Perform the Billor & Chatterjee & Hadi (2006) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.alpha::Float64
: Optional argument of the probability of rejecting the null hypothesis.maxiter::Int
: Maximum number of iterations for calculating iterative weighted least squares estimates.epsilon::Float64
: Accuracy for determining convergency.
Description
The algorithm initially constructs a basic subset. These basic subset is then used to generate initial weights for a iteratively least squares estimation. Regression coefficients obtained in this stage are robust regression estimates. Squared normalized distances and squared normalized residuals are used in bchplot
which serves a visual way for investigation of outliers and their properties.
Output
["betas"]
: Final estimate of regression coefficients ["squared.normalized.robust.distances"]
: ["weights"]
: Final weights used in calculation of WLS estimates ["outliers"]
: Array of indices of outliers["squared.normalized.residuals"]
: Array of squared normalized residuals["residuals"]
: Array of regression residuals["basic.subset"]
: Array of indices of basic subset.
Examples
julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
julia> Dict{Any,Any} with 7 entries:
"betas" => [-55.9205, 1.15572]
"squared.normalized.robust.distances" => [0.104671, 0.0865052, 0.0700692, 0.0553633, 0.0423875, 0.03…
@@ -38,29 +38,29 @@
"outliers" => [1, 14, 15, 16, 17, 18, 19, 20, 21]
"squared.normalized.residuals" => [5.53742e-5, 2.42977e-5, 2.36066e-6, 2.77706e-6, 1.07985e-7…
"residuals" => [2.5348, 1.67908, 0.523367, 0.567651, 0.111936, -0.343779, …
-"basic.subset" => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 15, 16, 17, 18, 19, 20, …
References
Billor, Nedret, Samprit Chatterjee, and Ali S. Hadi. "A re-weighted least squares method for robust regression estimation." American journal of mathematical and management sciences 26.3-4 (2006): 229-252.
sourcePena & Yohai (1995)
LinRegOutliers.PY95.py95
— Functionpy95(setting)
Perform the Pena & Yohai (1995) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm starts by constructing an influence matrix using results of an ordinary least squares estimate for a given model and data. In the second stage, the eigen structure of the influence matrix is examined for detecting suspected subsets of data.
Output
["outliers"]
: Array of indices of outliers["suspected.sets"]
: Arrays of indices of observations for corresponding eigen value of the influence matrix.["betas]
: Vector of estimated regression coefficients using the clean observations.
Examples
julia> reg0001 = createRegressionSetting(@formula(y ~ x1 + x2 + x3), hbk);
+"basic.subset" => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 15, 16, 17, 18, 19, 20, …
References
Billor, Nedret, Samprit Chatterjee, and Ali S. Hadi. "A re-weighted least squares method for robust regression estimation." American journal of mathematical and management sciences 26.3-4 (2006): 229-252.
sourcePena & Yohai (1995)
LinRegOutliers.PY95.py95
— Functionpy95(setting)
Perform the Pena & Yohai (1995) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm starts by constructing an influence matrix using results of an ordinary least squares estimate for a given model and data. In the second stage, the eigen structure of the influence matrix is examined for detecting suspected subsets of data.
Output
["outliers"]
: Array of indices of outliers["suspected.sets"]
: Arrays of indices of observations for corresponding eigen value of the influence matrix.["betas]
: Vector of estimated regression coefficients using the clean observations.
Examples
julia> reg0001 = createRegressionSetting(@formula(y ~ x1 + x2 + x3), hbk);
julia> py95(reg0001)
ict{Any,Any} with 2 entries:
"outliers" => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
- "suspected.sets" => Set([[14, 13], [43, 54, 24, 38, 22], [6, 10], [14, 7, 8, 3, 10, 2, 5, 6, 1, 9, 4…
References
Peña, Daniel, and Victor J. Yohai. "The detection of influential subsets in linear regression by using an influence matrix." Journal of the Royal Statistical Society: Series B (Methodological) 57.1 (1995): 145-156.
sourceSatman (2013)
LinRegOutliers.Satman2013.satman2013
— Functionsatman2013(setting)
Perform Satman (2013) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm constructs a fast and robust covariance matrix to calculate robust mahalanobis distances. These distances are then used to construct weights for later use in a weighted least squares estimation. In the last stage, C-steps are iterated on the basic subset found in previous stages.
Output
["outliers"]
: Array of indices of outliers.["betas"]
: Array of estimated regression coefficients.["residuals"]
: Array of residuals.
Examples
julia> eg0001 = createRegressionSetting(@formula(y ~ x1 + x2 + x3), hbk);
+ "suspected.sets" => Set([[14, 13], [43, 54, 24, 38, 22], [6, 10], [14, 7, 8, 3, 10, 2, 5, 6, 1, 9, 4…
References
Peña, Daniel, and Victor J. Yohai. "The detection of influential subsets in linear regression by using an influence matrix." Journal of the Royal Statistical Society: Series B (Methodological) 57.1 (1995): 145-156.
sourceSatman (2013)
LinRegOutliers.Satman2013.satman2013
— Functionsatman2013(setting)
Perform Satman (2013) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm constructs a fast and robust covariance matrix to calculate robust mahalanobis distances. These distances are then used to construct weights for later use in a weighted least squares estimation. In the last stage, C-steps are iterated on the basic subset found in previous stages.
Output
["outliers"]
: Array of indices of outliers.["betas"]
: Array of estimated regression coefficients.["residuals"]
: Array of residuals.
Examples
julia> eg0001 = createRegressionSetting(@formula(y ~ x1 + x2 + x3), hbk);
julia> satman2013(reg0001)
Dict{Any,Any} with 1 entry:
"outliers" => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 47]
"betas" => ...
- "residuals" => ...
References
Satman, Mehmet Hakan. "A new algorithm for detecting outliers in linear regression." International Journal of statistics and Probability 2.3 (2013): 101.
sourceSatman (2015)
LinRegOutliers.Satman2015.satman2015
— Functionsatman2015(setting)
Perform Satman (2015) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm starts with sorting the design matrix using the Non-dominated sorting algorithm. An initial basic subset is then constructed using the ranks obtained in previous stage. After many C-steps, observations with high standardized residuals are reported to be outliers.
Output
["outliers]
": Array of indices of outliers.[betas]
: Array of regression coefficients.[residuals]
: Array of residuals.[standardized_residuals]
: Array of standardized residuals.
Examples
julia> eg0001 = createRegressionSetting(@formula(y ~ x1 + x2 + x3), hbk);
+ "residuals" => ...
References
Satman, Mehmet Hakan. "A new algorithm for detecting outliers in linear regression." International Journal of statistics and Probability 2.3 (2013): 101.
sourceSatman (2015)
LinRegOutliers.Satman2015.satman2015
— Functionsatman2015(setting)
Perform Satman (2015) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm starts with sorting the design matrix using the Non-dominated sorting algorithm. An initial basic subset is then constructed using the ranks obtained in previous stage. After many C-steps, observations with high standardized residuals are reported to be outliers.
Output
["outliers]
": Array of indices of outliers.[betas]
: Array of regression coefficients.[residuals]
: Array of residuals.[standardized_residuals]
: Array of standardized residuals.
Examples
julia> eg0001 = createRegressionSetting(@formula(y ~ x1 + x2 + x3), hbk);
julia> satman2015(reg0001)
Dict{Any,Any} with 1 entry:
"outliers" => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 14, 47]
-
References
Satman, Mehmet Hakan. "Fast online detection of outliers using least-trimmed squares regression with non-dominated sorting based initial subsets." International Journal of Advanced Statistics and Probability 3.1 (2015): 53.
source## Setan & Halim & Mohd (2000)
LinRegOutliers.ASM2000.asm2000
— Functionasm2000(setting)
Perform the Setan, Halim and Mohd (2000) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm performs a Least Trimmed Squares (LTS) estimate and yields standardized residual - fitted response pairs. A single linkage clustering algorithm is performed on these pairs. Like smr98
, the cluster tree is cut using the Majona criterion. Subtrees with relatively small number of observations are declared to be outliers.
Output
["outliers"]
: Vector of indices of outliers.["betas"]
: Vector of regression coefficients.
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
+
References
Satman, Mehmet Hakan. "Fast online detection of outliers using least-trimmed squares regression with non-dominated sorting based initial subsets." International Journal of Advanced Statistics and Probability 3.1 (2015): 53.
source## Setan & Halim & Mohd (2000)
LinRegOutliers.ASM2000.asm2000
— Functionasm2000(setting)
Perform the Setan, Halim and Mohd (2000) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm performs a Least Trimmed Squares (LTS) estimate and yields standardized residual - fitted response pairs. A single linkage clustering algorithm is performed on these pairs. Like smr98
, the cluster tree is cut using the Majona criterion. Subtrees with relatively small number of observations are declared to be outliers.
Output
["outliers"]
: Vector of indices of outliers.["betas"]
: Vector of regression coefficients.
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> asm2000(reg0001)
Dict{Any, Any} with 2 entries:
"betas" => [-63.4816, 1.30406]
- "outliers" => [15, 16, 17, 18, 19, 20]
References
Robiah Adnan, Mohd Nor Mohamad, & Halim Setan (2001). Identifying multiple outliers in linear regression: robust fit and clustering approach. Proceedings of the Malaysian Science and Technology Congress 2000: Symposium C, Vol VI, (p. 400). Malaysia: Confederation of Scientific and Technological Associations in Malaysia COSTAM.
sourceLeast Absolute Deviations (LAD)
LinRegOutliers.LAD.lad
— Functionlad(setting; exact = true)
Perform Least Absolute Deviations regression for a given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.exact::Bool
: If true, use exact LAD regression. If false, estimate LAD regression parameters using GA. Default is true.
Description
The LAD estimator searches for regression the parameters estimates that minimize the sum of absolute residuals. The optimization problem is
Min z = u1(-) + u1(+) + u2(-) + u2(+) + .... + un(-) + un(+) Subject to: y1 - beta0 - beta1 * x2 + u1(-) - u1(+) = 0 y2 - beta0 - beta1 * x2 + u2(-) - u2(+) = 0 . . . yn - beta0 - beta1 * xn + un(-) - un(+) = 0 where ui(-), ui(+) >= 0 i = 1, 2, ..., n beta0, beta1 in R n : Number of observations
Output
["betas"]
: Estimated regression coefficients["residuals"]
: Regression residuals["model"]
: Linear Programming Model
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
+ "outliers" => [15, 16, 17, 18, 19, 20]
References
Robiah Adnan, Mohd Nor Mohamad, & Halim Setan (2001). Identifying multiple outliers in linear regression: robust fit and clustering approach. Proceedings of the Malaysian Science and Technology Congress 2000: Symposium C, Vol VI, (p. 400). Malaysia: Confederation of Scientific and Technological Associations in Malaysia COSTAM.
sourceLeast Absolute Deviations (LAD)
LinRegOutliers.LAD.lad
— Functionlad(setting; exact = true)
Perform Least Absolute Deviations regression for a given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.exact::Bool
: If true, use exact LAD regression. If false, estimate LAD regression parameters using GA. Default is true.
Description
The LAD estimator searches for regression the parameters estimates that minimize the sum of absolute residuals. The optimization problem is
Min z = u1(-) + u1(+) + u2(-) + u2(+) + .... + un(-) + un(+) Subject to: y1 - beta0 - beta1 * x2 + u1(-) - u1(+) = 0 y2 - beta0 - beta1 * x2 + u2(-) - u2(+) = 0 . . . yn - beta0 - beta1 * xn + un(-) - un(+) = 0 where ui(-), ui(+) >= 0 i = 1, 2, ..., n beta0, beta1 in R n : Number of observations
Output
["betas"]
: Estimated regression coefficients["residuals"]
: Regression residuals["model"]
: Linear Programming Model
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> lad(reg0001)
Dict{Any,Any} with 2 entries:
"betas" => [-57.3269, 1.19155]
"residuals" => [2.14958, 1.25803, 0.0664872, 0.0749413, -0.416605, -0.90815, -1.2997, -1.79124,…
-
sourcelad(X, y, exact = true)
Perform Least Absolute Deviations regression for a given regression setting.
Arguments
X::AbstractMatrix{Float64}
: Design matrix of the linear model.y::AbstractVector{Float64}
: Response vector of the linear model.exact::Bool
: If true, use exact LAD regression. If false, estimate LAD regression parameters using GA. Default is true.
sourceLeast Trimmed Absolute Deviations (LTA)
LinRegOutliers.LTA.lta
— Functionlta(setting; exact = false, earlystop = true)
Perform the Hawkins & Olive (1999) algorithm (Least Trimmed Absolute Deviations) for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.exact::Bool
: Consider all possible subsets of p or not where p is the number of regression parameters.earlystop::Bool
: Early stop if the best objective does not change in number of remaining iters / 5 iterations.
Description
lta
is a trimmed version of lad
in which the sum of first h absolute residuals is minimized where h is Int(floor((n + p + 1.0) / 2.0)).
Output
["betas"]
: Estimated regression coefficients["objective]
: Objective value
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
+
sourcelad(X, y, exact = true)
Perform Least Absolute Deviations regression for a given regression setting.
Arguments
X::AbstractMatrix{Float64}
: Design matrix of the linear model.y::AbstractVector{Float64}
: Response vector of the linear model.exact::Bool
: If true, use exact LAD regression. If false, estimate LAD regression parameters using GA. Default is true.
sourceLeast Trimmed Absolute Deviations (LTA)
LinRegOutliers.LTA.lta
— Functionlta(setting; exact = false, earlystop = true)
Perform the Hawkins & Olive (1999) algorithm (Least Trimmed Absolute Deviations) for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.exact::Bool
: Consider all possible subsets of p or not where p is the number of regression parameters.earlystop::Bool
: Early stop if the best objective does not change in number of remaining iters / 5 iterations.
Description
lta
is a trimmed version of lad
in which the sum of first h absolute residuals is minimized where h is Int(floor((n + p + 1.0) / 2.0)).
Output
["betas"]
: Estimated regression coefficients["objective]
: Objective value
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> lta(reg0001)
Dict{Any,Any} with 2 entries:
"betas" => [-55.5, 1.15]
@@ -69,13 +69,13 @@
julia> lta(reg0001, exact = true)
Dict{Any,Any} with 2 entries:
"betas" => [-55.5, 1.15]
- "objective" => 5.7
References
Hawkins, Douglas M., and David Olive. "Applications and algorithms for least trimmed sum of absolute deviations regression." Computational Statistics & Data Analysis 32.2 (1999): 119-134.
sourcelta(X, y; exact = false)
Perform the Hawkins & Olive (1999) algorithm (Least Trimmed Absolute Deviations) for the given regression setting.
Arguments
X::AbstractMatrix{Float64}
: Design matrix of linear regression model.y::AbstractVector{Float64}
: Response vector of linear regression model.exact::Bool
: Consider all possible subsets of p or not where p is the number of regression parameters.earlystop::Bool
: Early stop if the best objective does not change in number of remaining iters / 5 iterations.
References
Hawkins, Douglas M., and David Olive. "Applications and algorithms for least trimmed sum of absolute deviations regression." Computational Statistics & Data Analysis 32.2 (1999): 119-134.
sourceHadi (1992)
LinRegOutliers.Hadi92.hadi1992
— Functionhadi1992(multivariateData)
Perform Hadi (1992) algorithm for a given multivariate data.
Arguments
multivariateData::AbstractMatrix{Float64}
: Multivariate data.
Description
Algorithm starts with an initial subset and enlarges the subset to obtain robust covariance matrix and location estimates.
Output
["outliers"]
: Array of indices of outliers["critical.chi.squared"]
: Threshold value for determining being an outlier["rth.robust.distance"]
: rth robust distance, where (r+1)th robust distance is the first one that exceeds the threshold.
Examples
julia> multidata = hcat(hbk.x1, hbk.x2, hbk.x3);
+ "objective" => 5.7
References
Hawkins, Douglas M., and David Olive. "Applications and algorithms for least trimmed sum of absolute deviations regression." Computational Statistics & Data Analysis 32.2 (1999): 119-134.
sourcelta(X, y; exact = false)
Perform the Hawkins & Olive (1999) algorithm (Least Trimmed Absolute Deviations) for the given regression setting.
Arguments
X::AbstractMatrix{Float64}
: Design matrix of linear regression model.y::AbstractVector{Float64}
: Response vector of linear regression model.exact::Bool
: Consider all possible subsets of p or not where p is the number of regression parameters.earlystop::Bool
: Early stop if the best objective does not change in number of remaining iters / 5 iterations.
References
Hawkins, Douglas M., and David Olive. "Applications and algorithms for least trimmed sum of absolute deviations regression." Computational Statistics & Data Analysis 32.2 (1999): 119-134.
sourceHadi (1992)
LinRegOutliers.Hadi92.hadi1992
— Functionhadi1992(multivariateData)
Perform Hadi (1992) algorithm for a given multivariate data.
Arguments
multivariateData::AbstractMatrix{Float64}
: Multivariate data.
Description
Algorithm starts with an initial subset and enlarges the subset to obtain robust covariance matrix and location estimates.
Output
["outliers"]
: Array of indices of outliers["critical.chi.squared"]
: Threshold value for determining being an outlier["rth.robust.distance"]
: rth robust distance, where (r+1)th robust distance is the first one that exceeds the threshold.
Examples
julia> multidata = hcat(hbk.x1, hbk.x2, hbk.x3);
julia> hadi1992(multidata)
Dict{Any,Any} with 3 entries:
"outliers" => [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
"critical.chi.squared" => 7.81473
- "rth.robust.distance" => 5.04541
# Reference Hadi, Ali S. "Identifying multiple outliers in multivariate data." Journal of the Royal Statistical Society: Series B (Methodological) 54.3 (1992): 761-771.
sourceMarchette & Solka (2003) Data Images
LinRegOutliers.DataImage.dataimage
— Functiondataimage(dataMatrix; distance = :)
Generate the Marchette & Solka (2003) data image for a given data matrix.
Arguments
dataMatrix::AbstractVector{Float64}
: Data matrix with dimensions n x p, where n is the number of observations and p is the number of variables.distance::Symbol
: Optional argument for the distance function.
Notes
distance is :mahalanobis by default, for the Mahalanobis distances.
+ "rth.robust.distance" => 5.04541
# Reference Hadi, Ali S. "Identifying multiple outliers in multivariate data." Journal of the Royal Statistical Society: Series B (Methodological) 54.3 (1992): 761-771.
sourceMarchette & Solka (2003) Data Images
LinRegOutliers.DataImage.dataimage
— Functiondataimage(dataMatrix; distance = :)
Generate the Marchette & Solka (2003) data image for a given data matrix.
Arguments
dataMatrix::AbstractVector{Float64}
: Data matrix with dimensions n x p, where n is the number of observations and p is the number of variables.distance::Symbol
: Optional argument for the distance function.
Notes
distance is :mahalanobis by default, for the Mahalanobis distances.
use
dataimage(mat, distance = :euclidean)
@@ -85,23 +85,23 @@
julia> x3 = hbk[:,"x3"];
julia> mat = hcat(x1, x2, x3);
julia> di = dataimage(mat, distance = :euclidean)
-julia> Plots.plot(di)
References
Marchette, David J., and Jeffrey L. Solka. "Using data images for outlier detection." Computational Statistics & Data Analysis 43.4 (2003): 541-552.
Dependencies This method is enabled when the Plots package is installed and loaded.
sourceSatman's GA based LTS estimation (2012)
LinRegOutliers.GALTS.galts
— Functiongalts(setting)
Perform Satman(2012) algorithm for estimating LTS coefficients.
Arguments
setting
: A regression setting object.
Description
The algorithm performs a genetic search for estimating LTS coefficients using C-Steps.
Output
["betas"]
: Robust regression coefficients["best.subset"]
: Clean subset of h observations, where h is an integer greater than n / 2. The default value of h is Int(floor((n + p + 1.0) / 2.0))
.["objective"]
: Objective value
Examples
julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
+julia> Plots.plot(di)
References
Marchette, David J., and Jeffrey L. Solka. "Using data images for outlier detection." Computational Statistics & Data Analysis 43.4 (2003): 541-552.
Dependencies This method is enabled when the Plots package is installed and loaded.
sourceSatman's GA based LTS estimation (2012)
LinRegOutliers.GALTS.galts
— Functiongalts(setting)
Perform Satman(2012) algorithm for estimating LTS coefficients.
Arguments
setting
: A regression setting object.
Description
The algorithm performs a genetic search for estimating LTS coefficients using C-Steps.
Output
["betas"]
: Robust regression coefficients["best.subset"]
: Clean subset of h observations, where h is an integer greater than n / 2. The default value of h is Int(floor((n + p + 1.0) / 2.0))
.["objective"]
: Objective value
Examples
julia> reg = createRegressionSetting(@formula(calls ~ year), phones);
julia> galts(reg)
Dict{Any,Any} with 3 entries:
"betas" => [-56.5219, 1.16488]
"best.subset" => [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 23, 24]
- "objective" => 3.43133
References
Satman, M. Hakan. "A genetic algorithm based modification on the lts algorithm for large data sets." Communications in Statistics-Simulation and Computation 41.5 (2012): 644-652.
sourceFischler & Bolles (1981) RANSAC Algorithm
LinRegOutliers.Ransac.ransac
— Functionransac(setting; t, w=0.5, m=0, k=0, d=0, confidence=0.99)
Run the RANSAC (1981) algorithm for the given regression setting
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and a dataset.t::Float64
: The threshold distance of a sample point to the regression hyperplane to determine if it fits the model well.w::Float64
: The probability of a sample point being inlier, default=0.5.m::Int
: The number of points to sample to estimate the model parameter for each iteration. If set to 0, defaults to picking p points which is the minimum required.k::Int
: The number of iterations to run. If set to 0, is calculated according to the formula given in the paper based on outlier probability and the sample set size.d::Int
: The number of close data points required to accept the model. Defaults to number of data points multiplied by inlier ratio.confidence::Float64
: Required to determine the number of optimum iterations if k is not specified.
Output
["outliers"]
: Array of indices of outliers.
Examples
julia> df = DataFrame(y=[0,1,2,3,3,4,10], x=[0,1,2,2,3,4,2])
+ "objective" => 3.43133
References
Satman, M. Hakan. "A genetic algorithm based modification on the lts algorithm for large data sets." Communications in Statistics-Simulation and Computation 41.5 (2012): 644-652.
sourceFischler & Bolles (1981) RANSAC Algorithm
LinRegOutliers.Ransac.ransac
— Functionransac(setting; t, w=0.5, m=0, k=0, d=0, confidence=0.99)
Run the RANSAC (1981) algorithm for the given regression setting
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and a dataset.t::Float64
: The threshold distance of a sample point to the regression hyperplane to determine if it fits the model well.w::Float64
: The probability of a sample point being inlier, default=0.5.m::Int
: The number of points to sample to estimate the model parameter for each iteration. If set to 0, defaults to picking p points which is the minimum required.k::Int
: The number of iterations to run. If set to 0, is calculated according to the formula given in the paper based on outlier probability and the sample set size.d::Int
: The number of close data points required to accept the model. Defaults to number of data points multiplied by inlier ratio.confidence::Float64
: Required to determine the number of optimum iterations if k is not specified.
Output
["outliers"]
: Array of indices of outliers.
Examples
julia> df = DataFrame(y=[0,1,2,3,3,4,10], x=[0,1,2,2,3,4,2])
julia> reg = createRegressionSetting(@formula(y ~ x), df)
julia> ransac(reg, t=0.8, w=0.85)
Dict{String,Array{Int64,1}} with 1 entry:
- "outliers" => [7]
References
Martin A. Fischler & Robert C. Bolles (June 1981). "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography" Comm. ACM. 24 (6): 381–395.
sourceMinimum Covariance Determinant Estimator (MCD)
LinRegOutliers.MVE.mcd
— Functionmcd(data; alpha = 0.01)
Performs the Minimum Covariance Determinant algorithm for a robust covariance matrix.
Arguments
data::DataFrame
: Multivariate data.alpha::Float64
: Probability for quantiles of Chi-Squared statistic.
Description
mcd
searches for a robust location vector and a robust scale matrix, e.g covariance matrix. The method also reports a usable diagnostic measure, Mahalanobis distances, which are calculated using the robust counterparts instead of mean vector and usual covariance matrix. Mahalanobis distances are directly comparible with quantiles of a ChiSquare Distribution with p
degrees of freedom.
Output
["goal"]
: Objective value["best.subset"]
: Indices of best h-subset of observations["robust.location"]
: Vector of robust location measures["robust.covariance"]
: Robust covariance matrix["squared.mahalanobis"]
: Array of Mahalanobis distances calculated using robust location and scale measures.["chisq.crit"]
: Chisquare quantile used in threshold["alpha"]
: Probability used in calculating the Chisquare quantile, e.g chisq.crit
["outliers"]
: Array of indices of outliers.
Notes
Algorithm is implemented using concentration steps as described in the reference paper. However, details about number of iterations are slightly different.
References
Rousseeuw, Peter J., and Katrien Van Driessen. "A fast algorithm for the minimum covariance determinant estimator." Technometrics 41.3 (1999): 212-223.
sourceImon (2005) Algorithm
LinRegOutliers.Imon2005.imon2005
— Functionimon2005(setting)
Perform the Imon 2005 algorithm for a given regression setting.
Arguments
setting::RegressionSetting
: A regression setting.
Description
The algorithm estimates the GDFFITS diagnostic, which is an extension of well-known regression diagnostic DFFITS. Unlikely, GDFFITS is used for detecting multiple outliers whereas the original one was used for detecting single outliers.
Output
["crit"]
: The critical value used["gdffits"]
: Array of GDFFITS diagnostic calculated for observations["outliers"]
: Array of indices of outliers.["betas"]
: Vector of regression coefficients.
Notes
The implementation uses LTS rather than LMS as suggested in the paper.
References
A. H. M. Rahmatullah Imon (2005) Identifying multiple influential observations in linear regression, Journal of Applied Statistics, 32:9, 929-946, DOI: 10.1080/02664760500163599
sourceBarratt & Angeris & Boyd (2020) CCF algorithm
LinRegOutliers.CCF.ccf
— Functionccf(setting; starting_lambdas = nothing)
Perform signed gradient descent for clipped convex functions for a given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.starting_lambdas::AbstractVector{Float64}
: Starting values of weighting parameters used by signed gradient descent.alpha::Float64
: Loss at which a point is labeled as an outlier (points with loss ≥ alpha will be called outliers).max_iter::Int64
: Maximum number of iterations to run signed gradient descent.beta::Float64
: Step size parameter.tol::Float64
: Tolerance below which convergence is declared.
Output
["betas"]
: Robust regression coefficients[""outliers"]
: Array of indices of outliers[""lambdas"]
: Lambda coefficients estimated in each iteration [""residuals"]
: Regression residuals.
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
+ "outliers" => [7]
References
Martin A. Fischler & Robert C. Bolles (June 1981). "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography" Comm. ACM. 24 (6): 381–395.
sourceMinimum Covariance Determinant Estimator (MCD)
LinRegOutliers.MVE.mcd
— Functionmcd(data; alpha = 0.01)
Performs the Minimum Covariance Determinant algorithm for a robust covariance matrix.
Arguments
data::DataFrame
: Multivariate data.alpha::Float64
: Probability for quantiles of Chi-Squared statistic.
Description
mcd
searches for a robust location vector and a robust scale matrix, e.g covariance matrix. The method also reports a usable diagnostic measure, Mahalanobis distances, which are calculated using the robust counterparts instead of mean vector and usual covariance matrix. Mahalanobis distances are directly comparible with quantiles of a ChiSquare Distribution with p
degrees of freedom.
Output
["goal"]
: Objective value["best.subset"]
: Indices of best h-subset of observations["robust.location"]
: Vector of robust location measures["robust.covariance"]
: Robust covariance matrix["squared.mahalanobis"]
: Array of Mahalanobis distances calculated using robust location and scale measures.["chisq.crit"]
: Chisquare quantile used in threshold["alpha"]
: Probability used in calculating the Chisquare quantile, e.g chisq.crit
["outliers"]
: Array of indices of outliers.
Notes
Algorithm is implemented using concentration steps as described in the reference paper. However, details about number of iterations are slightly different.
References
Rousseeuw, Peter J., and Katrien Van Driessen. "A fast algorithm for the minimum covariance determinant estimator." Technometrics 41.3 (1999): 212-223.
sourceImon (2005) Algorithm
LinRegOutliers.Imon2005.imon2005
— Functionimon2005(setting)
Perform the Imon 2005 algorithm for a given regression setting.
Arguments
setting::RegressionSetting
: A regression setting.
Description
The algorithm estimates the GDFFITS diagnostic, which is an extension of well-known regression diagnostic DFFITS. Unlikely, GDFFITS is used for detecting multiple outliers whereas the original one was used for detecting single outliers.
Output
["crit"]
: The critical value used["gdffits"]
: Array of GDFFITS diagnostic calculated for observations["outliers"]
: Array of indices of outliers.["betas"]
: Vector of regression coefficients.
Notes
The implementation uses LTS rather than LMS as suggested in the paper.
References
A. H. M. Rahmatullah Imon (2005) Identifying multiple influential observations in linear regression, Journal of Applied Statistics, 32:9, 929-946, DOI: 10.1080/02664760500163599
sourceBarratt & Angeris & Boyd (2020) CCF algorithm
LinRegOutliers.CCF.ccf
— Functionccf(setting; starting_lambdas = nothing)
Perform signed gradient descent for clipped convex functions for a given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.starting_lambdas::AbstractVector{Float64}
: Starting values of weighting parameters used by signed gradient descent.alpha::Float64
: Loss at which a point is labeled as an outlier (points with loss ≥ alpha will be called outliers).max_iter::Int64
: Maximum number of iterations to run signed gradient descent.beta::Float64
: Step size parameter.tol::Float64
: Tolerance below which convergence is declared.
Output
["betas"]
: Robust regression coefficients[""outliers"]
: Array of indices of outliers[""lambdas"]
: Lambda coefficients estimated in each iteration [""residuals"]
: Regression residuals.
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
julia> ccf(reg0001)
Dict{Any,Any} with 4 entries:
"betas" => [-63.4816, 1.30406]
"outliers" => [15, 16, 17, 18, 19, 20]
"lambdas" => [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 … 2.77556e-17, 2.77556e-17, 0…
"residuals" => [-2.67878, -1.67473, -0.37067, -0.266613, 0.337444, 0.941501, 1.44556, 2.04962, 1…
-
References
Barratt, S., Angeris, G. & Boyd, S. Minimizing a sum of clipped convex functions. Optim Lett 14, 2443–2459 (2020). https://doi.org/10.1007/s11590-020-01565-4
sourceccf(X, y; starting_lambdas = nothing)
Perform signed gradient descent for clipped convex functions for a given regression setting.
Arguments
X::AbstractMatrix{Float64}
: Design matrix of the linear model.y::AbstractVector{Float64}
: Response vector of the linear model.starting_lambdas::AbstractVector{Float64}
: Starting values of weighting parameters used by signed gradient descent.alpha::Float64
: Loss at which a point is labeled as an outlier. If unspecified, will be chosen as p*mean(residuals.^2), where residuals are OLS residuals.p::Float64
: Points that have squared OLS residual greater than p times the mean squared OLS residual are considered outliers.max_iter::Int64
: Maximum number of iterations to run signed gradient descent.beta::Float64
: Step size parameter.tol::Float64
: Tolerance below which convergence is declared.
Output
["betas"]
: Robust regression coefficients[""outliers"]
: Array of indices of outliers[""lambdas"]
: Lambda coefficients estimated in each iteration [""residuals"]
: Regression residuals.
References
Barratt, S., Angeris, G. & Boyd, S. Minimizing a sum of clipped convex functions. Optim Lett 14, 2443–2459 (2020). https://doi.org/10.1007/s11590-020-01565-4
sourceAtkinson (1994) Forward Search Algorithm
LinRegOutliers.Atkinson94.atkinson94
— Function atkinson94(setting, iters, crit)
Runs the Atkinson94 algorithm to find out outliers using LMS method.
Arguments
setting::RegressionSetting
: A regression setting object.iters::Int
: Number of random samples.crit::Float64
: Critical value for residuals
Description
The algorithm randomly selects initial basic subsets and performs a very robust method, e.g lms
to enlarge the basic subset. In each iteration of forward search, the best objective value and parameter estimates are stored. These values are also used in Atkinson's Stalactite Plot for a visual investigation of outliers. See atkinsonstalactiteplot
.
Output
["optimum_index"]
: The iteration number in which the minimum objective is obtained["residuals_matrix"]
: Matrix of residuals obtained in each iteration["outliers"]
: Array of indices of detected outliers["objective"]
: Minimum objective value["coef"]
: Estimated regression coefficients["crit"]
: Critical value given by the user.
Examples
julia> reg = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
+
References
Barratt, S., Angeris, G. & Boyd, S. Minimizing a sum of clipped convex functions. Optim Lett 14, 2443–2459 (2020). https://doi.org/10.1007/s11590-020-01565-4
sourceccf(X, y; starting_lambdas = nothing)
Perform signed gradient descent for clipped convex functions for a given regression setting.
Arguments
X::AbstractMatrix{Float64}
: Design matrix of the linear model.y::AbstractVector{Float64}
: Response vector of the linear model.starting_lambdas::AbstractVector{Float64}
: Starting values of weighting parameters used by signed gradient descent.alpha::Float64
: Loss at which a point is labeled as an outlier. If unspecified, will be chosen as p*mean(residuals.^2), where residuals are OLS residuals.p::Float64
: Points that have squared OLS residual greater than p times the mean squared OLS residual are considered outliers.max_iter::Int64
: Maximum number of iterations to run signed gradient descent.beta::Float64
: Step size parameter.tol::Float64
: Tolerance below which convergence is declared.
Output
["betas"]
: Robust regression coefficients[""outliers"]
: Array of indices of outliers[""lambdas"]
: Lambda coefficients estimated in each iteration [""residuals"]
: Regression residuals.
References
Barratt, S., Angeris, G. & Boyd, S. Minimizing a sum of clipped convex functions. Optim Lett 14, 2443–2459 (2020). https://doi.org/10.1007/s11590-020-01565-4
sourceAtkinson (1994) Forward Search Algorithm
LinRegOutliers.Atkinson94.atkinson94
— Function atkinson94(setting, iters, crit)
Runs the Atkinson94 algorithm to find out outliers using LMS method.
Arguments
setting::RegressionSetting
: A regression setting object.iters::Int
: Number of random samples.crit::Float64
: Critical value for residuals
Description
The algorithm randomly selects initial basic subsets and performs a very robust method, e.g lms
to enlarge the basic subset. In each iteration of forward search, the best objective value and parameter estimates are stored. These values are also used in Atkinson's Stalactite Plot for a visual investigation of outliers. See atkinsonstalactiteplot
.
Output
["optimum_index"]
: The iteration number in which the minimum objective is obtained["residuals_matrix"]
: Matrix of residuals obtained in each iteration["outliers"]
: Array of indices of detected outliers["objective"]
: Minimum objective value["coef"]
: Estimated regression coefficients["crit"]
: Critical value given by the user.
Examples
julia> reg = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
julia> atkinson94(reg)
Dict{Any,Any} with 6 entries:
"optimum_index" => 10
@@ -110,26 +110,26 @@
"objective" => 0.799134
"coef" => [-38.3133, 0.745659, 0.432794, 0.0104587]
"crit" => 3.0
-
References
Atkinson, Anthony C. "Fast very robust methods for the detection of multiple outliers." Journal of the American Statistical Association 89.428 (1994): 1329-1339.
sourceBACON Algorithm (Billor & Hadi & Velleman (2000))
LinRegOutliers.Bacon.bacon
— Function bacon(setting, m, method, alpha)
Run the BACON algorithm to detect outliers on regression data.
Arguments:
setting
: RegressionSetting object with a formula and a dataset.m
: The number of elements to be included in the initial subset.method
: The distance method to use for selecting the points for initial subsetalpha
: The quantile used for cutoff
Description
BACON (Blocked Adaptive Computationally efficient Outlier Nominators) algoritm, defined in the citation below, has many versions, e.g BACON for multivariate data, BACON for regression etc. Since the design matrix of a regression model is multivariate data, BACON for multivariate data is performed in early stages of the algorithm. After selecting a clean subset of observations, then a forward search is applied. Observations with high studendized residuals are reported as outliers.
Output
["outliers"]
: Array of indices of outliers.["betas"]
: Array of estimated coefficients.
Examples
julia> reg = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
+
References
Atkinson, Anthony C. "Fast very robust methods for the detection of multiple outliers." Journal of the American Statistical Association 89.428 (1994): 1329-1339.
sourceBACON Algorithm (Billor & Hadi & Velleman (2000))
LinRegOutliers.Bacon.bacon
— Function bacon(setting, m, method, alpha)
Run the BACON algorithm to detect outliers on regression data.
Arguments:
setting
: RegressionSetting object with a formula and a dataset.m
: The number of elements to be included in the initial subset.method
: The distance method to use for selecting the points for initial subsetalpha
: The quantile used for cutoff
Description
BACON (Blocked Adaptive Computationally efficient Outlier Nominators) algoritm, defined in the citation below, has many versions, e.g BACON for multivariate data, BACON for regression etc. Since the design matrix of a regression model is multivariate data, BACON for multivariate data is performed in early stages of the algorithm. After selecting a clean subset of observations, then a forward search is applied. Observations with high studendized residuals are reported as outliers.
Output
["outliers"]
: Array of indices of outliers.["betas"]
: Array of estimated coefficients.
Examples
julia> reg = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
julia> bacon(reg, m=12)
Dict{String, Vector} with 2 entries:
"betas" => [-37.6525, 0.797686, 0.57734, -0.0670602]
- "outliers" => [1, 3, 4, 21]
References
Billor, Nedret, Ali S. Hadi, and Paul F. Velleman. "BACON: blocked adaptive computationally efficient outlier nominators." Computational statistics & data analysis 34.3 (2000): 279-298.
sourceHadi (1994) Algorithm
LinRegOutliers.Hadi94.hadi1994
— Functionhadi1994(multivariateData)
Perform Hadi (1994) algorithm for a given multivariate data.
Arguments
multivariateData::AbstractMatrix{Float64}
: Multivariate data.
Description
Algorithm starts with an initial subset and enlarges the subset to obtain robust covariance matrix and location estimates. This algorithm is an extension of hadi1992
.
Output
["outliers"]
: Array of indices of outliers["critical.chi.squared"]
: Threshold value for determining being an outlier["rth.robust.distance"]
: rth robust distance, where (r+1)th robust distance is the first one that exceeds the threshold.
Examples
julia> multidata = hcat(hbk.x1, hbk.x2, hbk.x3);
+ "outliers" => [1, 3, 4, 21]
References
Billor, Nedret, Ali S. Hadi, and Paul F. Velleman. "BACON: blocked adaptive computationally efficient outlier nominators." Computational statistics & data analysis 34.3 (2000): 279-298.
sourceHadi (1994) Algorithm
LinRegOutliers.Hadi94.hadi1994
— Functionhadi1994(multivariateData)
Perform Hadi (1994) algorithm for a given multivariate data.
Arguments
multivariateData::AbstractMatrix{Float64}
: Multivariate data.
Description
Algorithm starts with an initial subset and enlarges the subset to obtain robust covariance matrix and location estimates. This algorithm is an extension of hadi1992
.
Output
["outliers"]
: Array of indices of outliers["critical.chi.squared"]
: Threshold value for determining being an outlier["rth.robust.distance"]
: rth robust distance, where (r+1)th robust distance is the first one that exceeds the threshold.
Examples
julia> multidata = hcat(hbk.x1, hbk.x2, hbk.x3);
julia> hadi1994(multidata)
Dict{Any,Any} with 3 entries:
"outliers" => [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
"critical.chi.squared" => 7.81473
- "rth.robust.distance" => 5.04541
# Reference Hadi, Ali S. "A modification of a method for the dedection of outliers in multivariate samples" Journal of the Royal Statistical Society: Series B (Methodological) 56.2 (1994): 393-396.
sourceChatterjee & Mächler (1997)
LinRegOutliers.CM97.cm97
— Functioncm97(setting; maxiter = 1000)
Perform the Chatterjee and Mächler (1997) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm performs a iteratively weighted least squares estimation to obtain robust regression coefficients.
Output
["betas"]
: Robust regression coefficients["iterations"]
: Number of iterations performed["converged"]
: true if the algorithm converges, otherwise, false.
Examples
julia> myreg = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
+ "rth.robust.distance" => 5.04541
# Reference Hadi, Ali S. "A modification of a method for the dedection of outliers in multivariate samples" Journal of the Royal Statistical Society: Series B (Methodological) 56.2 (1994): 393-396.
sourceChatterjee & Mächler (1997)
LinRegOutliers.CM97.cm97
— Functioncm97(setting; maxiter = 1000)
Perform the Chatterjee and Mächler (1997) algorithm for the given regression setting.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.
Description
The algorithm performs a iteratively weighted least squares estimation to obtain robust regression coefficients.
Output
["betas"]
: Robust regression coefficients["iterations"]
: Number of iterations performed["converged"]
: true if the algorithm converges, otherwise, false.
Examples
julia> myreg = createRegressionSetting(@formula(stackloss ~ airflow + watertemp + acidcond), stackloss)
julia> result = cm97(myreg)
Dict{String,Any} with 3 entries:
"betas" => [-37.0007, 0.839285, 0.632333, -0.113208]
"iterations" => 22
- "converged" => true
References
Chatterjee, Samprit, and Martin Mächler. "Robust regression: A weighted least squares approach." Communications in Statistics-Theory and Methods 26.6 (1997): 1381-1394.
sourceQuantile Regression
LinRegOutliers.QuantileRegression.quantileregression
— Functionquantileregression(setting; tau = 0.5)
Perform Quantile Regression for a given regression setting (multiple linear regression).
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.tau::Float64
: Quantile level. Default is 0.5.
Description
The Quantile Regression estimator searches for the regression parameter estimates that minimize the
Min z = (1 - tau) (u1(-) + u2(-) + ... + un(-)) + tau (u1(+) + u2(+) + ... + un(+)) Subject to: y1 - beta0 - beta1 * x2 + u1(-) - u1(+) = 0 y2 - beta0 - beta1 * x2 + u2(-) - u2(+) = 0 . . . yn - beta0 - beta1 * xn + un(-) - un(+) = 0 where ui(-), ui(+) >= 0 i = 1, 2, ..., n beta0, beta1 in R n : Number of observations model is the y = beta1 + beta2 * x + u
Output
["betas"]
: Estimated regression coefficients["residuals"]
: Regression residuals["model"]
: Linear Programming Model
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
-julia> quantileregression(reg0001)
sourcequantileregression(X, y, tau = 0.5)
Estimates parameters of linear regression using Quantile Regression Estimator for a given regression setting.
Arguments
X::AbstractMatrix{Float64}
: Design matrix of the linear model.y::AbstractVector{Float64}
: Response vector of the linear model.tau::Float64
: Quantile level. Default is 0.5.
Examples
julia> income = [420.157651, 541.411707, 901.157457, 639.080229, 750.875606];
+ "converged" => true
References
Chatterjee, Samprit, and Martin Mächler. "Robust regression: A weighted least squares approach." Communications in Statistics-Theory and Methods 26.6 (1997): 1381-1394.
sourceQuantile Regression
LinRegOutliers.QuantileRegression.quantileregression
— Functionquantileregression(setting; tau = 0.5)
Perform Quantile Regression for a given regression setting (multiple linear regression).
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.tau::Float64
: Quantile level. Default is 0.5.
Description
The Quantile Regression estimator searches for the regression parameter estimates that minimize the
Min z = (1 - tau) (u1(-) + u2(-) + ... + un(-)) + tau (u1(+) + u2(+) + ... + un(+)) Subject to: y1 - beta0 - beta1 * x2 + u1(-) - u1(+) = 0 y2 - beta0 - beta1 * x2 + u2(-) - u2(+) = 0 . . . yn - beta0 - beta1 * xn + un(-) - un(+) = 0 where ui(-), ui(+) >= 0 i = 1, 2, ..., n beta0, beta1 in R n : Number of observations model is the y = beta1 + beta2 * x + u
Output
["betas"]
: Estimated regression coefficients["residuals"]
: Regression residuals["model"]
: Linear Programming Model
Examples
julia> reg0001 = createRegressionSetting(@formula(calls ~ year), phones);
+julia> quantileregression(reg0001)
sourcequantileregression(X, y, tau = 0.5)
Estimates parameters of linear regression using Quantile Regression Estimator for a given regression setting.
Arguments
X::AbstractMatrix{Float64}
: Design matrix of the linear model.y::AbstractVector{Float64}
: Response vector of the linear model.tau::Float64
: Quantile level. Default is 0.5.
Examples
julia> income = [420.157651, 541.411707, 901.157457, 639.080229, 750.875606];
julia> foodexp = [255.839425, 310.958667, 485.680014, 402.997356, 495.560775];
julia> n = length(income)
julia> X = hcat(ones(Float64, n), income)
-julia> result = quantileregression(X, foodexp, tau = 0.25)
sourceTheil-Sen estimator for multiple regresion
LinRegOutliers.TheilSen.theilsen
— Functiontheilsen(setting, m, nsamples = 5000)
Theil-Sen estimator for multiple regression.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.m::Int
: Number of observations to be used in each iteration. This number must be in the range [p, n], where p is the number of regressors and n is the number of observations.nsamples::Int
: Number of m-samples. Default is 5000.
Description
The function starts with a regression formula and datasets. The number of observations to be used in each iteration is specified by the user. The function then randomly selects m observations from the dataset and performs an ordinary least squares estimation. The estimated coefficients are saved. The process is repeated until nsamples regressions are estimated. The multivariate median of the estimated coefficients is then calculated. In this case, the multivariate median is the point that minimizes the sum of distances to all the estimated coefficients. Hooke & Jeeves algorithm is used for the optimization problem.
References
Dang, X., Peng, H., Wang, X., & Zhang, H. (2008). Theil-sen estimators in a multiple linear regression model. Olemiss Edu.
sourceDeepest Regression Estimator
LinRegOutliers.DeepestRegression.deepestregression
— Functiondeepestregression(setting; maxit = 1000)
Estimate Deepest Regression paramaters.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.maxit
: Maximum number of iterations
Description
Estimates Deepest Regression Estimator coefficients.
References
Van Aelst S., Rousseeuw P.J., Hubert M., Struyf A. (2002). The deepest regression method. Journal of Multivariate Analysis, 81, 138-166.
Output
betas
: Vector of regression coefficients estimated.
sourceSettings
This document was generated with Documenter.jl version 1.4.1 on Sunday 26 May 2024. Using Julia version 1.7.3.
+julia> result = quantileregression(X, foodexp, tau = 0.25)
Theil-Sen estimator for multiple regresion
LinRegOutliers.TheilSen.theilsen
— Functiontheilsen(setting, m, nsamples = 5000)
Theil-Sen estimator for multiple regression.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.m::Int
: Number of observations to be used in each iteration. This number must be in the range [p, n], where p is the number of regressors and n is the number of observations.nsamples::Int
: Number of m-samples. Default is 5000.
Description
The function starts with a regression formula and datasets. The number of observations to be used in each iteration is specified by the user. The function then randomly selects m observations from the dataset and performs an ordinary least squares estimation. The estimated coefficients are saved. The process is repeated until nsamples regressions are estimated. The multivariate median of the estimated coefficients is then calculated. In this case, the multivariate median is the point that minimizes the sum of distances to all the estimated coefficients. Hooke & Jeeves algorithm is used for the optimization problem.
References
Dang, X., Peng, H., Wang, X., & Zhang, H. (2008). Theil-sen estimators in a multiple linear regression model. Olemiss Edu.
Deepest Regression Estimator
LinRegOutliers.DeepestRegression.deepestregression
— Functiondeepestregression(setting; maxit = 1000)
Estimate Deepest Regression paramaters.
Arguments
setting::RegressionSetting
: RegressionSetting object with a formula and dataset.maxit
: Maximum number of iterations
Description
Estimates Deepest Regression Estimator coefficients.
References
Van Aelst S., Rousseeuw P.J., Hubert M., Struyf A. (2002). The deepest regression method. Journal of Multivariate Analysis, 81, 138-166.
Output
betas
: Vector of regression coefficients estimated.