`check_outliers` paper #544

rempsyc · 2023-01-18T22:17:51Z

Draft of the check_outliers paper (re: https://github.com/orgs/easystats/teams/core-team/discussions/5).

Looking forward to any and all feedback and contributions @easystats/core-team!

PS: reminder to use [skip ci] in your pushes (don't keep forgetting like me!!).

rempsyc · 2023-01-18T22:21:11Z

This is the link to the PDF (to see the current look): https://github.com/easystats/performance/blob/fcec6306987475b250ffb063b1c8e6e2ba3bbf48/papers/Mathematics/paper.pdf

This is the link to the rmarkdown file (the one to edit for any changes or comments): https://github.com/easystats/performance/blob/fcec6306987475b250ffb063b1c8e6e2ba3bbf48/papers/Mathematics/paper.Rmd

[skip ci]

mattansb · 2023-01-22T20:41:25Z

I've added some edits to reflect my own POV on the topic - specifically that outlier detection tools are only part of the picture, a tool to be used by humans to "flag" suspect outliers, but that the final decision should align with domain knowledge. Happy to smooth those points out some more (:

I also know very little about the maths of these methods - but did you know you can just select-copy formulas from Wikipedia and they paste as latex??

codecov-commenter · 2023-01-22T20:58:48Z

Codecov Report

Merging #544 (02b4445) into main (9793e40) will not change coverage.
The diff coverage is n/a.

❗ Current head 02b4445 differs from pull request most recent head f84823d. Consider uploading reports for the commit f84823d to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@           Coverage Diff           @@
##             main     #544   +/-   ##
=======================================
  Coverage   48.32%   48.32%           
=======================================
  Files          84       84           
  Lines        5507     5507           
=======================================
  Hits         2661     2661           
  Misses       2846     2846

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

rempsyc · 2023-01-23T21:18:05Z

@mattansb @IndrajeetPatil thanks for your changes!

I also know very little about the maths of these methods - but did you know you can just select-copy formulas from Wikipedia and they paste as latex??

Cool! Didn't know!

Should there be equations? The title says "accessible" introduction to outlier detection methods, but if we don't do a good job of explaining these equations, we might be compromise on the accessibility.

Very good point. I'm not a fan of maths myself but just tried adding some because the journal is called Mathematics. @DominiqueMakowski WDYT? Should we remove all equations or go in length explaining them?

rempsyc · 2023-02-05T19:10:45Z

@easystats/core-team I have integrated the earlier comments for the outliers paper. A reminder that there is less than 3 weeks before the submission deadline, so now is a great time for contributions for anyone who hasn't contributed yet. I would like to keep the last week before submission to finalize all the last details that were not already addressed yet (e.g., writing the abstract if no one else did).

Here's a few "easy" things it is still possible to contribute to (beyond expertise-based feedback of course):

1. Writing the abstract
2. I feel like one weakness of the paper is that we rely almost exclusively on the two key Leys/Lakens papers. Perhaps adding a bit of citation variety would make the paper stronger.
3. Chip in on whether to keep or remove formulas
4. Flesh out a proper paragraph based on @mattansb's comment on model-based outliers below:

Something something... what is leverage... why we should care if few observations have (relatively) strong leverage (answer: they are suspect of biasing our estimates!).

rempsyc · 2023-02-05T19:16:27Z

@bwiernik , it would also be great if you could go over the paper if you have the time. Here's a few lingering stats questions beyond my competence level for you:

1. I know you prefer model-based outlier detection methods over multivariate methods (and I now do as well). Do we have any references to back this or is it just common sense/opinion? How would you write a justification for this in the paper?
2. I write that the model-based influence plot represents the Cook's distance. Is that wrong? Is it just leverage/influential observations? Then how should I describe it in the caption? What is the relationship to Cook’s distance if that was the selected method?
3. For z-score methods, what's our justification for using a threshold of 3.09 instead of 3 like suggested by Leys et al. (2019)? Anything we can cite to support this discrepancy or justify it? Or should we change it to align with recommandations?
4. Leys et al. (2019) write,

Do not use the mean or variance as indicators but the MAD for univariate outliers, with a cutoff of 3 (for more information see Leys et al. 2013), or the MCD75 (breakdown point = 0.25) (or the MCD50 (breakdown point = 0.5) if you suspect the presence of more than 25% of outlying values)".

We use a threshold of stats::qchisq(p = 1 - 0.001, df = ncol(x)) (18.467 in the example in the paper). So how do these thresholds compare, and should we align our threshold with this recommendation, or else how to justify it?

bwiernik · 2023-02-05T21:58:04Z

I'll take a look tonight

bwiernik · 2023-02-05T23:03:54Z

For 3, it's the critical value for p < .001 (ie, qnorm(.999). That is consistent with all of our other thresholds. That also seems to be what the "3" in Leys et al is an approximation of. I suggest being consistent across indices and using the critical value for .999 (so 3.09023…)

rempsyc · 2023-02-07T23:49:36Z

For 1, seems like Leys (the people we keep citing) actually explicitly recommends robust Mahalanobis (MCD) over Cook's distance in one of their papers I have missed:

Cook's distance and leverage methods

Among outlier detection methods, Cook's distance and leverage are less common than the basic Mahalanobis distance, but still used. Cook's distance estimates the variations in regression coefficients after removing each observation, one by one (Cook, 1977). Therefore, as soon as there is more than one outlying value, the remaining outliers influence the estimators. As for the leverage method, it provides the same information as the Mahalanobis distance (Cohen et al., 2003): It is based on the study of residuals and their distance from the mean vector (e.g. Thode, 2002), which are computed using mean and variance, still polluted by outliers. This is why we recommend to use robust procedures to estimate the position μ and the scatter matrix Σ. The aim of robust methods is to estimate the location μ and the scatter matrix Σ even though the data has been contaminated. We introduce in the remaining of the text the robust method called Minimum Covariance Determinant.

https://doi.org/10.1016/j.jesp.2017.09.011

So... do we have robust model-based outlier detection methods? In the paper I was going to suggest to always prioritize model-based methods, but with the above, what does it change?

The paper below, based on a simulation comparing Cook to Mahalanobis, also does not recommend Cook:

http://article.sapub.org/10.5923.j.ajms.20150501.06.html

bwiernik · 2023-02-09T03:41:21Z

I pretty strongly disagree with them. If there are a bunch of outliers you're concerned about then you should use a robust regression method. The idea of results-agnostic outlier evaluation is pretty silly to me.

The overarching philosophy I think is good to communicate is that unusual events happen, and we shouldn't arbitrarily discard them. We should expect to see some "outliers" in a decently sized sample. Approaches like Mahalanobis D + heuristic tend to flag too many "outliers". If you are concerned about outliers, use a robust model like T regression or quantile regression.

mattansb · 2023-02-09T14:53:32Z

💯% agree with @bwiernik (which was also the point I was trying to get across in the suggestions/edits I made).

rempsyc · 2023-02-09T15:04:27Z

Ok. Good. Very good actually because this disagreement could make for an even more interesting, meaningful, and unique point of contribution.

mattansb · 2023-02-10T08:33:14Z

rempsyc · 2023-02-11T20:05:01Z

Still from Leys et al. (2018), in the intro, they're actually calling Cook and leverage (so indirectly easystats) questionnable methods 🤔

A survey made in the same journals as those used by Leys et al. (2013), namely the Journal of Personality and Social Psychology (JPSP) and Psychological Science (PS), revealed that few researchers seem to mind about multivariate outliers. [...] From these 24 papers, nine used the basic Mahalanobis distance, five used another criterion (leverage using Student-t residuals or Cook's distance), and ten did not provide any information about the detection strategy. [...] This means that for over 97.5% of this type of multivariate analyses, either researchers did not search for multivariate outliers or they did not report any information about it. The 16 [15*?] other teams looked for multivariate outliers, but either with a questionable method or without providing information about the method.

Ok, the war is on

rempsyc · 2023-02-12T00:25:52Z

Ok, I have made some new changes to the paper. I am by no means an expert on statistics and even more so on statistical outlier detection methods. So your feedback is important to make sure I'm not inadvertently writing bs :-)

@easystats/core-team this is also the last week to make contributions to the paper, as next weekend I would like to wrap it up so we have a few days to review the final version before submission. Thanks!

@rempsyc

@rempsyc feel free to accept/reject ;)

Apologies the changelog looks a bit like a mess 🙈

rempsyc · 2023-02-20T23:21:42Z

We haven't posted it as a preprint though. Should I have done so? It kind of is here on GitHub 😬 But maybe it's not too late to formally do it now if you think that's relevant/useful?

rempsyc · 2023-02-25T22:30:09Z

Manuscript update: under review.

IndrajeetPatil · 2023-03-17T19:40:04Z

Now that we have submitted this paper, shouldn't we (squash and) merge the PR?

rempsyc · 2023-03-19T17:24:02Z

Reviewer 1:

In this paper, the authors have shown how to search for outliers using the check_outliers() function of the {performance} package while following current good practices. This contribution will help researchers engage in good search practices while providing an outlier detection experience. They cover univariate, multivariate, and model-based statistical outlier detection methods, their recommended threshold, standard output, and plotting methods. The present paper represents the subject of check the outliers, An accessible introduction to identifying statistical outliers in R with easystats. The study design and methods appropriate for the research question. The results presented clearly and accurately. The authors logically explain the findings. I highly recommend the publication of this paper in Mathematics Journal.

rempsyc · 2023-03-19T17:28:23Z

rempsyc · 2023-03-19T17:28:50Z

Reviewer 3:

The paper outlines techniques for processing data with inhomogeneities, such as outliers. However, the style of the article does not correspond to scientific research. The work is educational and auxiliary in nature and is more in line with educational publications, including popular scientific ones.

rempsyc · 2023-03-20T00:33:42Z

shouldn't we (squash and) merge the PR?

I think we can wait until we get the final accepted version, after integrating all the reviewer feedback, etc.

rempsyc · 2023-03-25T23:33:23Z

Editor comments:

Less convinced as a scientific research is the major concern. I cannot find a solid contribution for the statistical methodologies, new packages, or sound applications in this study. Almost all the obtained conclusions are known. The readers would be happy to see more convincing results from good real examples from the real world if only the well known statistical methods and packages are used in this study. The tutorial-like results using the examples in R are too weak for an academic paper. It is contributive if the authors can include real examples in different areas to show the power of the used package and present the insight findings, not just show how to use the packages. Then, resubmit the revision as a new paper for review.

rempsyc · 2023-03-26T22:09:53Z

Reviewer 1: All good.

Reviewer 2:

A critical review on the analysis of outliers is proposed in this article which must be cited
Riani, M., & Atkinson, A. C. (2020). Robust regression methods in machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(2), e1359.

This reference does not seem to exist, neither by name nor other info. Like it should be volume 10, issue 2 here, but it's not there. Even the e1359 refers to a completely different article. Anything else we should cite @bwiernik?

Page 3, lines 101-102: It is unclear what the authors mean when referring to BCA, a technique used in non-parametric bootstrapping, but the paper does not mention this.

We say those methods are documented in the help page, so I think we can ignore this? I don’t think we’re going to explain BCA here…

It is necessary to change the x-axis labels of all the figures as the numbers overlap; they should be written in a smaller font or otherwise eliminate some of them since they are not readable. I am talking about the figures referring to the histogram of the outliers.

WIP easystats/see#262

Page 10, line 272, what does " Registration " mean?

Ok I know this was a mathematical journal so reviewers may not be familiar with the word preregistration. But now that we’re submitting to a psych journal, I think that’s pretty ubiquitous.

Check the reference style because it is not uniform

Looks good to me? 🤔 Not APA but that’s because of the MDPI template.

Editor comment:

The readers would be happy to see more convincing results from good real examples from the real world if only the well known statistical methods and packages are used in this study. […] It is contributive if the authors can include real examples in different areas to show the power of the used package and present the insight findings, not just show how to use the packages.

I added the code example of height and weight as we discussed. Think it’s sufficient?

Reviewer 3: I think that we make at least two original contributions: one by arguing in favour of model-based methods instead of univariate or multivariate. Second, by proposing a novel method: which is the combination of multiple methods. With the added real-life demo of height and weight, I think we got this covered.

rempsyc · 2023-03-26T22:28:20Z

I thought we didn't need to merge because we would address the reviewers comments and it would get accepted right away. But now that we need to change journal, I think we should squash and merge this one. And then I will create another PR for Collabra to start with a cleaner workspace. Does that make sense?

bwiernik · 2023-03-26T23:41:18Z

by name nor other info. Like it should be volume 10, issue 2 here, but it's not there. Even the e1359 refers to a completely different article. Anything else we should cite @bwiernik?

Let's just cite the idea of using an outlier-robust distribution like Student T instead of normal, or negative binomial instead of Poisson. For that a good cite is Statistical Rethinking, Chapter 4

rempsyc · 2023-03-27T00:15:57Z

Statistical Rethinking, Chapter 4

Thanks. Couldn't find anything with keywords "robust" in Chapter 4. However, there is a small mention of it in Chapter 7 (Ulysses' Compass, of the 2020 edition only), I think this must be it:

One way to both use these extreme observations and reduce their influence is to employ some kind of robust regression . A “robust regression” can mean many different things, but usually it indicates a linear model in which the influence of extreme observations is reduced. A common and useful kind of robust regression is to replace the Gaussian model with a thicker-tailed distribution like Student’s t (or “Student-t”) distribution.

The negative binomial bit is mentioned in Chapter 11. So I think I'll cite the whole book instead of a specific chapter.

bwiernik · 2023-03-27T02:32:26Z

There is a section in chapter 4 on using student T distributions

rempsyc · 2023-03-29T00:29:27Z

@strengejacke what do you think of the changes? Do you think we can merge this and then I can start a new PR for Collabra?

strengejacke · 2023-03-31T13:12:11Z

@rempsyc Thanks, your revisions so far look good! Let's merge and open a new PR, right?

rempsyc · 2023-03-31T13:12:58Z

Yeash please :)

strengejacke · 2023-03-31T13:15:08Z

ok, just open a new PR whenever you want to / can continue with preparing the manuscript.

rempsyc added 3 commits January 12, 2023 17:11

[skip ci] first draft

100b784

change to apa style 7th edition

943a62a

check_outliers paper first official draft

fcec630

Merge branch 'easystats:main' into check_outliers_paper

2b62498

rempsyc mentioned this pull request Jan 20, 2023

New plot methods for check_outliers (?) easystats/see#262

Open

IndrajeetPatil and others added 2 commits January 21, 2023 11:00

edits and comments [skip ci]

7d15cd9

Update paper.Rmd

47e79ae

[skip ci]

forgot to add myself

bb6ed24

rempsyc marked this pull request as draft January 25, 2023 22:26

integrate comments [skip ci]

e45b204

rempsyc added 2 commits February 5, 2023 14:23

typos [skip ci]

f66f9f2

attributes, restructure, datawizard citation [skip ci]

eb90ad0

paragraph on cook distance and mcd [skip ci]

c2151c7

DominiqueMakowski added 2 commits February 12, 2023 17:15

comments / edits on the intro

e368c2e

@rempsyc feel free to accept/reject ;)

Univariate methods edit

f1e2ec9

Apologies the changelog looks a bit like a mess 🙈

rempsyc removed the High priority 🏃 This issue should be addressed soon label Feb 25, 2023

Merge branch 'main' into check_outliers_paper

1adbca0

IndrajeetPatil approved these changes Mar 17, 2023

View reviewed changes

rempsyc added 4 commits March 26, 2023 10:46

start addressing reviewer comments [skip ci]

2cf6f72

Merge branch 'easystats:main' into check_outliers_paper

cfbe186

integrate reviewer comments first round [skip ci]

e3e842f

address editor comment with real-life example [skip ci]

89addc5

add figure captions [skip ci]

2f92088

correct figure caption typo [skip ci]

ee67ee5

add robust reference McElreath 2020 [skip ci]

b8148e5

strengejacke merged commit c09c190 into easystats:main Mar 31, 2023

This was referenced Mar 31, 2023

Test cleanup part 2 #567

Merged

New Collabra paper draft #568

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`check_outliers` paper #544

`check_outliers` paper #544

rempsyc commented Jan 18, 2023 •

edited

Loading

rempsyc commented Jan 18, 2023 •

edited

Loading

mattansb commented Jan 22, 2023

codecov-commenter commented Jan 22, 2023 •

edited

Loading

rempsyc commented Jan 23, 2023 •

edited

Loading

rempsyc commented Feb 5, 2023 •

edited

Loading

rempsyc commented Feb 5, 2023 •

edited

Loading

bwiernik commented Feb 5, 2023

bwiernik commented Feb 5, 2023

rempsyc commented Feb 7, 2023

Cook's distance and leverage methods

bwiernik commented Feb 9, 2023

mattansb commented Feb 9, 2023

rempsyc commented Feb 9, 2023

mattansb commented Feb 10, 2023

rempsyc commented Feb 11, 2023

rempsyc commented Feb 12, 2023

rempsyc commented Feb 20, 2023

rempsyc commented Feb 25, 2023

IndrajeetPatil commented Mar 17, 2023

rempsyc commented Mar 19, 2023 •

edited

Loading

rempsyc commented Mar 19, 2023 •

edited

Loading

rempsyc commented Mar 19, 2023 •

edited

Loading

rempsyc commented Mar 20, 2023

rempsyc commented Mar 25, 2023 •

edited

Loading

rempsyc commented Mar 26, 2023

rempsyc commented Mar 26, 2023

bwiernik commented Mar 26, 2023

rempsyc commented Mar 27, 2023

bwiernik commented Mar 27, 2023

rempsyc commented Mar 29, 2023

strengejacke commented Mar 31, 2023

rempsyc commented Mar 31, 2023 •

edited

Loading

strengejacke commented Mar 31, 2023

check_outliers paper #544

check_outliers paper #544

Conversation

rempsyc commented Jan 18, 2023 • edited Loading

rempsyc commented Jan 18, 2023 • edited Loading

mattansb commented Jan 22, 2023

codecov-commenter commented Jan 22, 2023 • edited Loading

Codecov Report

rempsyc commented Jan 23, 2023 • edited Loading

rempsyc commented Feb 5, 2023 • edited Loading

rempsyc commented Feb 5, 2023 • edited Loading

bwiernik commented Feb 5, 2023

bwiernik commented Feb 5, 2023

rempsyc commented Feb 7, 2023

Cook's distance and leverage methods

bwiernik commented Feb 9, 2023

mattansb commented Feb 9, 2023

rempsyc commented Feb 9, 2023

mattansb commented Feb 10, 2023

rempsyc commented Feb 11, 2023

rempsyc commented Feb 12, 2023

rempsyc commented Feb 20, 2023

rempsyc commented Feb 25, 2023

IndrajeetPatil commented Mar 17, 2023

rempsyc commented Mar 19, 2023 • edited Loading

rempsyc commented Mar 19, 2023 • edited Loading

rempsyc commented Mar 19, 2023 • edited Loading

rempsyc commented Mar 20, 2023

rempsyc commented Mar 25, 2023 • edited Loading

rempsyc commented Mar 26, 2023

rempsyc commented Mar 26, 2023

bwiernik commented Mar 26, 2023

rempsyc commented Mar 27, 2023

bwiernik commented Mar 27, 2023

rempsyc commented Mar 29, 2023

strengejacke commented Mar 31, 2023

rempsyc commented Mar 31, 2023 • edited Loading

strengejacke commented Mar 31, 2023

`check_outliers` paper #544

`check_outliers` paper #544

rempsyc commented Jan 18, 2023 •

edited

Loading

rempsyc commented Jan 18, 2023 •

edited

Loading

codecov-commenter commented Jan 22, 2023 •

edited

Loading

rempsyc commented Jan 23, 2023 •

edited

Loading

rempsyc commented Feb 5, 2023 •

edited

Loading

rempsyc commented Feb 5, 2023 •

edited

Loading

rempsyc commented Mar 19, 2023 •

edited

Loading

rempsyc commented Mar 19, 2023 •

edited

Loading

rempsyc commented Mar 19, 2023 •

edited

Loading

rempsyc commented Mar 25, 2023 •

edited

Loading

rempsyc commented Mar 31, 2023 •

edited

Loading