-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check_outliers
paper
#544
check_outliers
paper
#544
Conversation
This is the link to the PDF (to see the current look): https://github.com/easystats/performance/blob/fcec6306987475b250ffb063b1c8e6e2ba3bbf48/papers/Mathematics/paper.pdf This is the link to the rmarkdown file (the one to edit for any changes or comments): https://github.com/easystats/performance/blob/fcec6306987475b250ffb063b1c8e6e2ba3bbf48/papers/Mathematics/paper.Rmd |
I've added some edits to reflect my own POV on the topic - specifically that outlier detection tools are only part of the picture, a tool to be used by humans to "flag" suspect outliers, but that the final decision should align with domain knowledge. Happy to smooth those points out some more (: I also know very little about the maths of these methods - but did you know you can just select-copy formulas from Wikipedia and they paste as latex?? |
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #544 +/- ##
=======================================
Coverage 48.32% 48.32%
=======================================
Files 84 84
Lines 5507 5507
=======================================
Hits 2661 2661
Misses 2846 2846 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@mattansb @IndrajeetPatil thanks for your changes!
Cool! Didn't know!
Very good point. I'm not a fan of maths myself but just tried adding some because the journal is called Mathematics. @DominiqueMakowski WDYT? Should we remove all equations or go in length explaining them? |
@easystats/core-team I have integrated the earlier comments for the outliers paper. A reminder that there is less than 3 weeks before the submission deadline, so now is a great time for contributions for anyone who hasn't contributed yet. I would like to keep the last week before submission to finalize all the last details that were not already addressed yet (e.g., writing the abstract if no one else did). Here's a few "easy" things it is still possible to contribute to (beyond expertise-based feedback of course):
|
@bwiernik , it would also be great if you could go over the paper if you have the time. Here's a few lingering stats questions beyond my competence level for you:
We use a threshold of |
I'll take a look tonight |
For 3, it's the critical value for p < .001 (ie, |
For 1, seems like Leys (the people we keep citing) actually explicitly recommends robust Mahalanobis (MCD) over Cook's distance in one of their papers I have missed:
https://doi.org/10.1016/j.jesp.2017.09.011 So... do we have robust model-based outlier detection methods? In the paper I was going to suggest to always prioritize model-based methods, but with the above, what does it change? The paper below, based on a simulation comparing Cook to Mahalanobis, also does not recommend Cook: |
I pretty strongly disagree with them. If there are a bunch of outliers you're concerned about then you should use a robust regression method. The idea of results-agnostic outlier evaluation is pretty silly to me. The overarching philosophy I think is good to communicate is that unusual events happen, and we shouldn't arbitrarily discard them. We should expect to see some "outliers" in a decently sized sample. Approaches like Mahalanobis D + heuristic tend to flag too many "outliers". If you are concerned about outliers, use a robust model like T regression or quantile regression. |
💯% agree with @bwiernik (which was also the point I was trying to get across in the suggestions/edits I made). |
Ok. Good. Very good actually because this disagreement could make for an even more interesting, meaningful, and unique point of contribution. |
Still from Leys et al. (2018), in the intro, they're actually calling Cook and leverage (so indirectly easystats) questionnable methods 🤔
Ok, the war is on |
Ok, I have made some new changes to the paper. I am by no means an expert on statistics and even more so on statistical outlier detection methods. So your feedback is important to make sure I'm not inadvertently writing bs :-) @easystats/core-team this is also the last week to make contributions to the paper, as next weekend I would like to wrap it up so we have a few days to review the final version before submission. Thanks! |
@rempsyc feel free to accept/reject ;)
Apologies the changelog looks a bit like a mess 🙈
We haven't posted it as a preprint though. Should I have done so? It kind of is here on GitHub 😬 But maybe it's not too late to formally do it now if you think that's relevant/useful? |
Now that we have submitted this paper, shouldn't we (squash and) merge the PR? |
Reviewer 1:
|
Reviewer 2:
|
Reviewer 3:
|
I think we can wait until we get the final accepted version, after integrating all the reviewer feedback, etc. |
Editor comments:
|
Reviewer 1: All good. Reviewer 2:
This reference does not seem to exist, neither by name nor other info. Like it should be volume 10, issue 2 here, but it's not there. Even the e1359 refers to a completely different article. Anything else we should cite @bwiernik?
We say those methods are documented in the help page, so I think we can ignore this? I don’t think we’re going to explain BCA here…
Ok I know this was a mathematical journal so reviewers may not be familiar with the word preregistration. But now that we’re submitting to a psych journal, I think that’s pretty ubiquitous.
Looks good to me? 🤔 Not APA but that’s because of the MDPI template. Editor comment:
I added the code example of height and weight as we discussed. Think it’s sufficient? Reviewer 3: I think that we make at least two original contributions: one by arguing in favour of model-based methods instead of univariate or multivariate. Second, by proposing a novel method: which is the combination of multiple methods. With the added real-life demo of height and weight, I think we got this covered. |
I thought we didn't need to merge because we would address the reviewers comments and it would get accepted right away. But now that we need to change journal, I think we should squash and merge this one. And then I will create another PR for Collabra to start with a cleaner workspace. Does that make sense? |
Let's just cite the idea of using an outlier-robust distribution like Student T instead of normal, or negative binomial instead of Poisson. For that a good cite is Statistical Rethinking, Chapter 4 |
Thanks. Couldn't find anything with keywords "robust" in Chapter 4. However, there is a small mention of it in Chapter 7 (Ulysses' Compass, of the 2020 edition only), I think this must be it:
The negative binomial bit is mentioned in Chapter 11. So I think I'll cite the whole book instead of a specific chapter. |
There is a section in chapter 4 on using student T distributions |
@strengejacke what do you think of the changes? Do you think we can merge this and then I can start a new PR for Collabra? |
@rempsyc Thanks, your revisions so far look good! Let's merge and open a new PR, right? |
Yeash please :) |
ok, just open a new PR whenever you want to / can continue with preparing the manuscript. |
Draft of the
check_outliers
paper (re: https://github.com/orgs/easystats/teams/core-team/discussions/5).Looking forward to any and all feedback and contributions @easystats/core-team!
PS: reminder to use [skip ci] in your pushes (don't keep forgetting like me!!).