Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_outliers paper #544

Merged
merged 44 commits into from
Mar 31, 2023
Merged

check_outliers paper #544

merged 44 commits into from
Mar 31, 2023

Conversation

rempsyc
Copy link
Member

@rempsyc rempsyc commented Jan 18, 2023

Draft of the check_outliers paper (re: https://github.com/orgs/easystats/teams/core-team/discussions/5).

Looking forward to any and all feedback and contributions @easystats/core-team!

PS: reminder to use [skip ci] in your pushes (don't keep forgetting like me!!).

@rempsyc
Copy link
Member Author

rempsyc commented Jan 18, 2023

This is the link to the PDF (to see the current look): https://github.com/easystats/performance/blob/fcec6306987475b250ffb063b1c8e6e2ba3bbf48/papers/Mathematics/paper.pdf

This is the link to the rmarkdown file (the one to edit for any changes or comments): https://github.com/easystats/performance/blob/fcec6306987475b250ffb063b1c8e6e2ba3bbf48/papers/Mathematics/paper.Rmd

@mattansb
Copy link
Member

I've added some edits to reflect my own POV on the topic - specifically that outlier detection tools are only part of the picture, a tool to be used by humans to "flag" suspect outliers, but that the final decision should align with domain knowledge. Happy to smooth those points out some more (:

I also know very little about the maths of these methods - but did you know you can just select-copy formulas from Wikipedia and they paste as latex??

@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2023

Codecov Report

Merging #544 (02b4445) into main (9793e40) will not change coverage.
The diff coverage is n/a.

❗ Current head 02b4445 differs from pull request most recent head f84823d. Consider uploading reports for the commit f84823d to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@           Coverage Diff           @@
##             main     #544   +/-   ##
=======================================
  Coverage   48.32%   48.32%           
=======================================
  Files          84       84           
  Lines        5507     5507           
=======================================
  Hits         2661     2661           
  Misses       2846     2846           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@rempsyc
Copy link
Member Author

rempsyc commented Jan 23, 2023

@mattansb @IndrajeetPatil thanks for your changes!

I also know very little about the maths of these methods - but did you know you can just select-copy formulas from Wikipedia and they paste as latex??

Cool! Didn't know!

Should there be equations? The title says "accessible" introduction to outlier detection methods, but if we don't do a good job of explaining these equations, we might be compromise on the accessibility.

Very good point. I'm not a fan of maths myself but just tried adding some because the journal is called Mathematics. @DominiqueMakowski WDYT? Should we remove all equations or go in length explaining them?

@rempsyc rempsyc marked this pull request as draft January 25, 2023 22:26
@rempsyc
Copy link
Member Author

rempsyc commented Feb 5, 2023

@easystats/core-team I have integrated the earlier comments for the outliers paper. A reminder that there is less than 3 weeks before the submission deadline, so now is a great time for contributions for anyone who hasn't contributed yet. I would like to keep the last week before submission to finalize all the last details that were not already addressed yet (e.g., writing the abstract if no one else did).

Here's a few "easy" things it is still possible to contribute to (beyond expertise-based feedback of course):

  • 1. Writing the abstract
  • 2. I feel like one weakness of the paper is that we rely almost exclusively on the two key Leys/Lakens papers. Perhaps adding a bit of citation variety would make the paper stronger.
  • 3. Chip in on whether to keep or remove formulas
  • 4. Flesh out a proper paragraph based on @mattansb's comment on model-based outliers below:

Something something... what is leverage... why we should care if few observations have (relatively) strong leverage (answer: they are suspect of biasing our estimates!).

@rempsyc
Copy link
Member Author

rempsyc commented Feb 5, 2023

@bwiernik , it would also be great if you could go over the paper if you have the time. Here's a few lingering stats questions beyond my competence level for you:

  • 1. I know you prefer model-based outlier detection methods over multivariate methods (and I now do as well). Do we have any references to back this or is it just common sense/opinion? How would you write a justification for this in the paper?
  • 2. I write that the model-based influence plot represents the Cook's distance. Is that wrong? Is it just leverage/influential observations? Then how should I describe it in the caption? What is the relationship to Cook’s distance if that was the selected method?
  • 3. For z-score methods, what's our justification for using a threshold of 3.09 instead of 3 like suggested by Leys et al. (2019)? Anything we can cite to support this discrepancy or justify it? Or should we change it to align with recommandations?
  • 4. Leys et al. (2019) write,

Do not use the mean or variance as indicators but the MAD for univariate outliers, with a cutoff of 3 (for more information see Leys et al. 2013), or the MCD75 (breakdown point = 0.25) (or the MCD50 (breakdown point = 0.5) if you suspect the presence of more than 25% of outlying values)".

We use a threshold of stats::qchisq(p = 1 - 0.001, df = ncol(x)) (18.467 in the example in the paper). So how do these thresholds compare, and should we align our threshold with this recommendation, or else how to justify it?

@bwiernik
Copy link
Contributor

bwiernik commented Feb 5, 2023

I'll take a look tonight

@bwiernik
Copy link
Contributor

bwiernik commented Feb 5, 2023

For 3, it's the critical value for p < .001 (ie, qnorm(.999). That is consistent with all of our other thresholds. That also seems to be what the "3" in Leys et al is an approximation of. I suggest being consistent across indices and using the critical value for .999 (so 3.09023…)

@rempsyc
Copy link
Member Author

rempsyc commented Feb 7, 2023

For 1, seems like Leys (the people we keep citing) actually explicitly recommends robust Mahalanobis (MCD) over Cook's distance in one of their papers I have missed:

Cook's distance and leverage methods

Among outlier detection methods, Cook's distance and leverage are less common than the basic Mahalanobis distance, but still used. Cook's distance estimates the variations in regression coefficients after removing each observation, one by one (Cook, 1977). Therefore, as soon as there is more than one outlying value, the remaining outliers influence the estimators. As for the leverage method, it provides the same information as the Mahalanobis distance (Cohen et al., 2003): It is based on the study of residuals and their distance from the mean vector (e.g. Thode, 2002), which are computed using mean and variance, still polluted by outliers. This is why we recommend to use robust procedures to estimate the position μ and the scatter matrix Σ. The aim of robust methods is to estimate the location μ and the scatter matrix Σ even though the data has been contaminated. We introduce in the remaining of the text the robust method called Minimum Covariance Determinant.

https://doi.org/10.1016/j.jesp.2017.09.011

So... do we have robust model-based outlier detection methods? In the paper I was going to suggest to always prioritize model-based methods, but with the above, what does it change?

The paper below, based on a simulation comparing Cook to Mahalanobis, also does not recommend Cook:

http://article.sapub.org/10.5923.j.ajms.20150501.06.html

@bwiernik
Copy link
Contributor

bwiernik commented Feb 9, 2023

I pretty strongly disagree with them. If there are a bunch of outliers you're concerned about then you should use a robust regression method. The idea of results-agnostic outlier evaluation is pretty silly to me.

The overarching philosophy I think is good to communicate is that unusual events happen, and we shouldn't arbitrarily discard them. We should expect to see some "outliers" in a decently sized sample. Approaches like Mahalanobis D + heuristic tend to flag too many "outliers". If you are concerned about outliers, use a robust model like T regression or quantile regression.

@mattansb
Copy link
Member

mattansb commented Feb 9, 2023

💯% agree with @bwiernik (which was also the point I was trying to get across in the suggestions/edits I made).

@rempsyc
Copy link
Member Author

rempsyc commented Feb 9, 2023

Ok. Good. Very good actually because this disagreement could make for an even more interesting, meaningful, and unique point of contribution.

@mattansb
Copy link
Member

LetThemGIF

@rempsyc
Copy link
Member Author

rempsyc commented Feb 11, 2023

Still from Leys et al. (2018), in the intro, they're actually calling Cook and leverage (so indirectly easystats) questionnable methods 🤔

A survey made in the same journals as those used by Leys et al. (2013), namely the Journal of Personality and Social Psychology (JPSP) and Psychological Science (PS), revealed that few researchers seem to mind about multivariate outliers. [...] From these 24 papers, nine used the basic Mahalanobis distance, five used another criterion (leverage using Student-t residuals or Cook's distance), and ten did not provide any information about the detection strategy. [...] This means that for over 97.5% of this type of multivariate analyses, either researchers did not search for multivariate outliers or they did not report any information about it. The 16 [15*?] other teams looked for multivariate outliers, but either with a questionable method or without providing information about the method.

Ok, the war is on

40ebafd4cdb31c02a4169b63a2004ae4

@rempsyc
Copy link
Member Author

rempsyc commented Feb 12, 2023

Ok, I have made some new changes to the paper. I am by no means an expert on statistics and even more so on statistical outlier detection methods. So your feedback is important to make sure I'm not inadvertently writing bs :-)

@easystats/core-team this is also the last week to make contributions to the paper, as next weekend I would like to wrap it up so we have a few days to review the final version before submission. Thanks!

@rempsyc feel free to accept/reject ;)
Apologies the changelog looks a bit like a mess 🙈
@rempsyc
Copy link
Member Author

rempsyc commented Feb 20, 2023

We haven't posted it as a preprint though. Should I have done so? It kind of is here on GitHub 😬 But maybe it's not too late to formally do it now if you think that's relevant/useful?

@rempsyc rempsyc removed the High priority 🏃 This issue should be addressed soon label Feb 25, 2023
@rempsyc
Copy link
Member Author

rempsyc commented Feb 25, 2023

Manuscript update: under review.

under review

@IndrajeetPatil
Copy link
Member

Now that we have submitted this paper, shouldn't we (squash and) merge the PR?

@rempsyc
Copy link
Member Author

rempsyc commented Mar 19, 2023

Reviewer 1:

  • In this paper, the authors have shown how to search for outliers using the check_outliers() function of the {performance} package while following current good practices. This contribution will help researchers engage in good search practices while providing an outlier detection experience. They cover univariate, multivariate, and model-based statistical outlier detection methods, their recommended threshold, standard output, and plotting methods. The present paper represents the subject of check the outliers, An accessible introduction to identifying statistical outliers in R with easystats. The study design and methods appropriate for the research question. The results presented clearly and accurately. The authors logically explain the findings. I highly recommend the publication of this paper in Mathematics Journal.

@rempsyc
Copy link
Member Author

rempsyc commented Mar 19, 2023

Reviewer 2:

 The authors propose the illustration of an R library for detecting outliers. The paper is quite clear, however, the following remarks are provided to improve the article

  • The word accessible in the title should be removed.
  • The authors should clarify whether the functions implemented in the library were implemented only for continuous variables or other types of variables are also considered.
  • The author's reference to the Normal distribution on page 2, line 31 is unclear. They should clarify or delete this sentence.
  • The reference to the mathematical score is unclear on page 2, line 36. I think this reference should be removed.
  • A critical review on the analysis of outliers is proposed in this article which must be cited

Riani, M., & Atkinson, A. C. (2020). Robust regression methods in machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(2), e1359.

  • Page 3, lines 101-102: It is unclear what the authors mean when referring to BCA, a technique used in non-parametric bootstrapping, but the paper does not mention this.
  • It is necessary to change the x-axis labels of all the figures as the numbers overlap; they should be written in a smaller font or otherwise eliminate some of them since they are not readable. I am talking about the figures referring to the histogram of the outliers.
  • Page seven, line 146, it is better to write association instead of relationship.
  • Page 7, line 154, the authors should specify what they mean by compatible regression models.
  • Page 7, code of the first chunk, the authors use the iqr method, but this is not discussed above. They must add details of all the methods they show in the output.
  • Page 9, line 247, it is difficult to follow the discussion for those unfamiliar with the R datawizard package. What does it do? How does the proposed function tie in with those in the package?
  • Page 10, line 272, what does " Registration " mean?
  • Check the reference style because it is not uniform

@rempsyc
Copy link
Member Author

rempsyc commented Mar 19, 2023

Reviewer 3:

  • The paper outlines techniques for processing data with inhomogeneities, such as outliers. However, the style of the article does not correspond to scientific research. The work is educational and auxiliary in nature and is more in line with educational publications, including popular scientific ones.

@rempsyc
Copy link
Member Author

rempsyc commented Mar 20, 2023

shouldn't we (squash and) merge the PR?

I think we can wait until we get the final accepted version, after integrating all the reviewer feedback, etc.

@rempsyc
Copy link
Member Author

rempsyc commented Mar 25, 2023

Editor comments:

  • Less convinced as a scientific research is the major concern. I cannot find a solid contribution for the statistical methodologies, new packages, or sound applications in this study. Almost all the obtained conclusions are known. The readers would be happy to see more convincing results from good real examples from the real world if only the well known statistical methods and packages are used in this study. The tutorial-like results using the examples in R are too weak for an academic paper. It is contributive if the authors can include real examples in different areas to show the power of the used package and present the insight findings, not just show how to use the packages. Then, resubmit the revision as a new paper for review.

@rempsyc
Copy link
Member Author

rempsyc commented Mar 26, 2023

Reviewer 1: All good.


Reviewer 2:

A critical review on the analysis of outliers is proposed in this article which must be cited
Riani, M., & Atkinson, A. C. (2020). Robust regression methods in machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(2), e1359.

This reference does not seem to exist, neither by name nor other info. Like it should be volume 10, issue 2 here, but it's not there. Even the e1359 refers to a completely different article. Anything else we should cite @bwiernik?

Page 3, lines 101-102: It is unclear what the authors mean when referring to BCA, a technique used in non-parametric bootstrapping, but the paper does not mention this.

We say those methods are documented in the help page, so I think we can ignore this? I don’t think we’re going to explain BCA here…

It is necessary to change the x-axis labels of all the figures as the numbers overlap; they should be written in a smaller font or otherwise eliminate some of them since they are not readable. I am talking about the figures referring to the histogram of the outliers.

WIP easystats/see#262

Page 10, line 272, what does " Registration " mean?

Ok I know this was a mathematical journal so reviewers may not be familiar with the word preregistration. But now that we’re submitting to a psych journal, I think that’s pretty ubiquitous.

Check the reference style because it is not uniform

Looks good to me? 🤔 Not APA but that’s because of the MDPI template.


Editor comment:

The readers would be happy to see more convincing results from good real examples from the real world if only the well known statistical methods and packages are used in this study. […] It is contributive if the authors can include real examples in different areas to show the power of the used package and present the insight findings, not just show how to use the packages.

I added the code example of height and weight as we discussed. Think it’s sufficient?


Reviewer 3: I think that we make at least two original contributions: one by arguing in favour of model-based methods instead of univariate or multivariate. Second, by proposing a novel method: which is the combination of multiple methods. With the added real-life demo of height and weight, I think we got this covered.

@rempsyc
Copy link
Member Author

rempsyc commented Mar 26, 2023

I thought we didn't need to merge because we would address the reviewers comments and it would get accepted right away. But now that we need to change journal, I think we should squash and merge this one. And then I will create another PR for Collabra to start with a cleaner workspace. Does that make sense?

@bwiernik
Copy link
Contributor

by name nor other info. Like it should be volume 10, issue 2 here, but it's not there. Even the e1359 refers to a completely different article. Anything else we should cite @bwiernik?

Let's just cite the idea of using an outlier-robust distribution like Student T instead of normal, or negative binomial instead of Poisson. For that a good cite is Statistical Rethinking, Chapter 4

@rempsyc
Copy link
Member Author

rempsyc commented Mar 27, 2023

Statistical Rethinking, Chapter 4

Thanks. Couldn't find anything with keywords "robust" in Chapter 4. However, there is a small mention of it in Chapter 7 (Ulysses' Compass, of the 2020 edition only), I think this must be it:

 One way to both use these extreme observations and reduce their influence is to employ some kind of robust regression . A “robust regression” can mean many different things, but usually it indicates a linear model in which the influence of extreme observations is reduced. A common and useful kind of robust regression is to replace the Gaussian model with a thicker-tailed distribution like Student’s t (or “Student-t”) distribution.

The negative binomial bit is mentioned in Chapter 11. So I think I'll cite the whole book instead of a specific chapter.

@bwiernik
Copy link
Contributor

There is a section in chapter 4 on using student T distributions

@rempsyc
Copy link
Member Author

rempsyc commented Mar 29, 2023

@strengejacke what do you think of the changes? Do you think we can merge this and then I can start a new PR for Collabra?

@strengejacke
Copy link
Member

@rempsyc Thanks, your revisions so far look good! Let's merge and open a new PR, right?

@rempsyc
Copy link
Member Author

rempsyc commented Mar 31, 2023

Yeash please :)

@strengejacke strengejacke merged commit c09c190 into easystats:main Mar 31, 2023
@strengejacke
Copy link
Member

ok, just open a new PR whenever you want to / can continue with preparing the manuscript.

This was referenced Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants