[Bug]: the result of Lasso learner is different from others #149

victoriasunsun · 2022-02-06T01:32:54Z

victoriasunsun
Feb 6, 2022

Describe the bug

Hi, the DML package is really useful for me and I am using it to conduct my master thesis. I have tried LightGBM/RF/Xgboost/Lasso for learners. The results of LightGBM/RF/Xgboost are similar but the results of Lasso is rather different. The following is a part of the results. Can you help me with that issue?

Minimum reproducible code snippet

LassoFormula =xnames[1]

for (name in xnames[-1]){
LassoFormula = paste0(LassoFormula,'+',name)
}
LassoFormula = paste0('~(',LassoFormula,")^2")

LassoFormula = formula(LassoFormula)# create the formula
#features_flex = data.frame(model.matrix(LassoFormula, dataS)) #second order term
model_data = data.table("y"= dataS[, ynames],
"d" = dataS[, "IndShareSuccessful"],
features_flex)

################################ Lasso

DMLLasso = function(yname){
set.seed(123)
lasso = lrn("regr.cv_glmnet", nfolds = 5, s = "lambda.min") #set g model
lasso_class = lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min")# set m model

data_dml_flex = DoubleMLData$new(model_data,
y_col = paste0('y.',yname),
d_cols ='d.IndShareSuccessful')
dml_plr_lasso = DoubleMLPLR$new(data_dml_flex,
ml_g = lasso,
ml_m = lasso_class,
n_folds = 3)
dml_plr_lasso$fit()
dml_plr_lasso$summary()
}

Expected Result

I think the results of different learners should be similar.

Actual Result

indicators	Lasso	lightGBM	Xgboost	RF
a	-0.105	-13.424***	-13.410***	-13.025***
b	0.001	0.265***	0.275***	0.259***
c	0.003	0.186***	0.187***	0.185***
d	-0.017	20.600***	21.417***	20.165***
e	1.701	13.227***	13.282***	12.853***
f	16.672	10.549*	15.637**	8.339

Versions

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.5.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8

attached base packages:
[1] grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] xtable_1.8-4 mlr3tuning_0.9.0 paradox_0.7.1
[4] mlr3learners.lightgbm_0.0.10.9001 xgboost_1.5.0.2 WRS2_1.1-3
[7] tmcn_0.2-13 forcats_0.5.1 stringr_1.4.0
[10] purrr_0.3.4 readr_2.0.1 tidyr_1.1.3
[13] tibble_3.1.3 tidyverse_1.3.1 stargazer_5.2.2
[16] sm_2.2-5.7 scales_1.1.1 readxl_1.3.1
[19] randomForest_4.6-14 ranger_0.13.1 qdapRegex_0.7.2
[22] plotrix_3.8-2 plotly_4.10.0 ggplot2_3.3.5
[25] plm_2.4-3 np_0.60-11 nnet_7.3-16
[28] nlme_3.1-153 MultiGHQuad_1.2.0 mvtnorm_1.1-3
[31] mlr3_0.13.1 MatchIt_4.3.2 lubridate_1.7.10
[34] lm.beta_1.5-1 lfe_2.8-7.1 lawstat_3.4
[37] knitr_1.33 kSamples_1.2-9 SuppDists_1.1-9.7
[40] kknn_1.3.1 gridExtra_2.3 grf_2.0.2
[43] gmm_1.6-6 glmnet_4.1-3 ggridges_0.5.3
[46] frequentdirections_0.1.0 fixest_0.10.1 expm_0.999-6
[49] Matrix_1.3-4 DoubleML_0.4.1 dgof_1.2
[52] data.table_1.14.2 contextual_0.9.8.4 coda_0.19-4
[55] BTYDplus_1.2.0 BTYD_2.4.3 dplyr_1.0.7
[58] optimx_2021-6.12 hypergeo_1.2-13 broom_0.7.11
[61] bit64_4.0.5 bit_4.0.4 beepr_1.3
[64] AER_1.2-9 survival_3.2-13 sandwich_3.0-1
[67] lmtest_0.9-38 zoo_1.8-9 car_3.0-12
[70] carData_3.0-5 devtools_2.4.3 usethis_2.1.3

loaded via a namespace (and not attached):
[1] SparseM_1.81 ModelMetrics_1.2.2.2 R.methodsS3_1.8.1 maxLik_1.5-2
[5] clusterGeneration_1.3.7 R.utils_2.11.0 rpart_4.1-15 doParallel_1.0.16
[9] generics_0.1.0 callr_3.7.0 future_1.23.0 tzdb_0.1.2
[13] xml2_1.3.2 assertthat_0.2.1 gower_0.2.2 xfun_0.24
[17] hms_1.1.0 fansi_0.5.0 dbplyr_2.1.1 igraph_1.2.6
[21] DBI_1.1.1 htmlwidgets_1.5.3 reshape_0.8.8 stats4_4.1.2
[25] ellipsis_0.3.2 backports_1.2.1 vctrs_0.3.8 remotes_2.4.1
[29] quantreg_5.86 abind_1.4-5 caret_6.0-78 cachem_1.0.5
[33] withr_2.4.2 itertools_0.1-3 mlr3learners_0.5.1 vroom_1.5.4
[37] bdsmatrix_1.3-4 checkmate_2.0.0 prettyunits_1.1.1 cluster_2.1.2
[41] lazyeval_0.2.2 crayon_1.4.1 elliptic_1.4-0 recipes_0.1.17
[45] pkgconfig_2.0.3 pkgload_1.2.3 rlang_0.4.11 globals_0.14.0
[49] lifecycle_1.0.0 MatrixModels_0.5-0 palmerpenguins_0.1.0 modelr_0.1.8
[53] Kendall_2.2 cellranger_1.1.0 rprojroot_2.0.2 matrixStats_0.61.0
[57] mc2d_0.1-21 boot_1.3-28 reprex_2.0.1 base64enc_0.1-3
[61] processx_3.5.2 png_0.1-7 viridisLite_0.4.0 rjson_0.2.21
[65] R.oo_1.24.0 shape_1.4.6 parallelly_1.28.1 jpeg_0.1-9
[69] memoise_2.0.1 magrittr_2.0.1 plyr_1.8.6 audio_0.1-10
[73] compiler_4.1.2 miscTools_0.6-26 RColorBrewer_1.1-2 cli_3.1.0
[77] listenv_0.8.0 ps_1.6.0 htmlTable_2.3.0 Formula_1.2-4
[81] MASS_7.3-54 tidyselect_1.1.1 stringi_1.7.3 latticeExtra_0.6-29
[85] tools_4.1.2 mlr3misc_0.10.0 future.apply_1.8.1 parallel_4.1.2
[89] rstudioapi_0.13 uuid_0.1-4 foreign_0.8-81 foreach_1.5.1
[93] cubature_2.0.4.2 prodlim_2019.11.13 digest_0.6.27 lava_1.6.10
[97] quadprog_1.5-8 Rcpp_1.0.7 R.devices_2.17.0 httr_1.4.2
[101] contfrac_1.1-12 Rdpack_2.1.3 colorspace_2.0-2 rvest_1.0.1
[105] fs_1.5.0 readstata13_0.10.0 splines_4.1.2 lgr_0.4.3
[109] bbotk_0.4.0 conquer_1.2.1 sessioninfo_1.2.1 dreamerr_1.2.3
[113] jsonlite_1.7.2 timeDate_3043.102 testthat_3.1.0 ipred_0.9-12
[117] R6_2.5.1 Hmisc_4.6-0 pillar_1.6.2 htmltools_0.5.1.1
[121] glue_1.4.2 fastmap_1.1.0 deSolve_1.30 class_7.3-19
[125] codetools_0.2-18 pkgbuild_1.2.0 utf8_1.2.2 lattice_0.20-45
[129] numDeriv_2016.8-1.1 curl_4.3.2 desc_1.4.0 munsell_0.5.0
[133] iterators_1.0.13 haven_2.4.3 reshape2_1.4.4 gtable_0.3.0
[137] rbibutils_2.2.7

packageVersion('DoubleML')
[1] ‘0.4.1’
packageVersion('mlr3')
[1] ‘0.13.1’

Answered by PhilippBach

Feb 7, 2022

Hello @victoriasunsun ,

thank you for opening this issue/discussion. We moved your issue to the discussion because we believe that it's not really concerning a bug. I guess it rather depends on the learners' performance. In our example notebooks, e.g., on the 401(k) example, the performance of lasso is comparable to the other learners (which does not necessarily have to be the case).

When applying double machine learning for causal inference, ML methods are used to approximate potentially high-dimensional and / or complex nuisance functions. When you obtain different results with different ML methods it is a good idea to check the first stage predictions. If the different ML methods have …

View full answer

PhilippBach · 2022-02-07T08:31:43Z

PhilippBach
Feb 7, 2022
Maintainer

Hello @victoriasunsun ,

thank you for opening this issue/discussion. We moved your issue to the discussion because we believe that it's not really concerning a bug. I guess it rather depends on the learners' performance. In our example notebooks, e.g., on the 401(k) example, the performance of lasso is comparable to the other learners (which does not necessarily have to be the case).

When applying double machine learning for causal inference, ML methods are used to approximate potentially high-dimensional and / or complex nuisance functions. When you obtain different results with different ML methods it is a good idea to check the first stage predictions. If the different ML methods have a similar prediction quality in the first stage and still the estimates for the causal parameters are very different, this would be problematic.

It's a bit hard to really see what's going wrong in your example. IMO there's no guarantee that all learners lead to the same results. The tree-based methods are more flexible in terms of fitting the nuisance components due to their nonlinearity, whereas lasso is based on a linear model.

Also an interacted model used in lasso like this one

 LassoFormula = paste0('~(',LassoFormula,")^2")

does not necessarily achieve the same flexibility as the tree-based methods because the generated interactions might not match the structure of the trees in random forests etc. To be more specific: Maybe the effect of X on y cannot be well approximated with a sparse linear function (in other words, the "flexible" specification might simply not be flexible enough). How do the baseline results for lasso (i.e., without interactions) compare to the interacted model?

The performance of the learners might also depend on the choice of the parameters. Note that for the cross-fitting the sample is split into three folds (n_folds=3). Then you choose to further split each fold into five subsets to choose the optimal value of $\lambda$. Alternatively, you could use a global tuning rule for finding the optimal value of $\lambda$ using the tune() method. In this tuning approach, you would split up the entire data set into $K$ folds and then find the optimal value for $\lambda$ based on cross-validation, see the learner section in the user guide .

And of course, the ratio of p (number of variables) and n (number of observations) plays a role, too. How many natural and constructed covariates x do you have and how many observations?

Have you tried out a standard linear regression and logistic regression for estimation of the nuisance part? Might be interesting to see, whether the performance is similar to lasso - which might happen in case p is not big compared to n.

Also you may want to use the option store_predictions = TRUE when calling fit() (see the documentation ) and might do some diagnostics for the predictions of the nuisance parts ml_g and ml_m. You could use the predictions to calculate the cross-fitted MSE / accuracy and see how the models compare in that respect.

I hope this helps you a bit in finding out what's going on in your application. Let us know if you gain some insights and want to share some lessons learned...

Thanks again and best,

Philipp

1 reply

victoriasunsun Feb 10, 2022
Author

Hi Philipp, I tried several methods you mentioned. ACC_m is the acc of classification and AAC_g is the r2.
I have 10000 obsevations, and 70+ variables so dim of orginal variable plus second order term is 3000+

From DML predictions:
Lasso with interactions
ACC_m = 0.6591787
ACC_g = 0.9581733

Lasso without interactions
ACC_m = 0.5845089
ACC_g = 0.04529238

Xgboost
ACC_m = 0.6289873
ACC_g = 0.1615665

RF
ACC_m = 0.600593
ACC_g = 0.1186266

This is very strange that ACC_g of Lasso with interactions is too high, so I conduct the lasso regression solely by cv_glmnet:
2. First stage results
Lasso with interactions
Acc_g = 0.0755333677841756
Lasso without interactions
Acc_g = 0.0547673475517324
Liner regression:
Acc_g = 0.110906403053039
LightGBM
Acc_g = 0.177061388411034
That seems that results are Lasso with interactions make sense.

And i found the warning messages of LAsso with interactions,

Warning messages:
from glmnet Fortran code (error code -96); Convergence for 96th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned
from glmnet Fortran code (error code -100); Convergence for 100th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned
from glmnet Fortran code (error code -87); Convergence for 87th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned

So maybe because sample split, the variables are too much compared with the number of observations

I tried Lasso with only original predictors, the results are similar to XGboost and RF now.
I tried Lasso with only polynomial_features and original predictors(without those interactions)(total 150+ variables), the results are similar to Lasso with interactions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: the result of Lasso learner is different from others #149

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[Bug]: the result of Lasso learner is different from others #149

victoriasunsun Feb 6, 2022

Describe the bug

Minimum reproducible code snippet

################################ Lasso

Expected Result

Actual Result

Versions

Replies: 1 comment · 1 reply

PhilippBach Feb 7, 2022 Maintainer

victoriasunsun Feb 10, 2022 Author

victoriasunsun
Feb 6, 2022

Replies: 1 comment 1 reply

PhilippBach
Feb 7, 2022
Maintainer

victoriasunsun Feb 10, 2022
Author