Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run HitWindows only once #10

Open
kbeutel opened this issue Mar 12, 2024 · 1 comment
Open

Run HitWindows only once #10

kbeutel opened this issue Mar 12, 2024 · 1 comment

Comments

@kbeutel
Copy link
Contributor

kbeutel commented Mar 12, 2024

From Nathan: we want to run HitWindows only once, and that would be after intersecting the variants of interest with the ones that we have in the data set. This isn't usually an issue, because GSP is typically run to create a PRS and independent variants have already been selected.

@kbeutel
Copy link
Contributor Author

kbeutel commented Mar 12, 2024

Original issue write up from Ava Kelley (20 Jul 2018):

GeneScorePipeline currently operates on a two step approach to running HitWindows on SNPs:

When loading meta files, if there are >1000 meta SNPs, HitWindows is run on these SNPs before continuing.
After cross-filtering to a dataset containing only the overlap of SNPs for a study and the meta SNPs remaining after step 1, a HitWindows is run on the cross-filtered SNPs.
This approach has a few concerning holes:

The Step 1 HitWindows can choose an index SNP that is not in a data file even when a secondary SNP in the window is in the data file. This will result in that window being lost from analysis when the Step 2 HitWindows is run with all of the matching SNPs pre-filtered by Step 1.
When there are >1000 meta SNPs and the Step 1 HitWindows occurs, the second round in Step 2 should never do anything, all of the SNPs will already have at least the extension threshold between them due to being filtered to the index SNPs from the first HitWindows.
In the Step 2 HitWindows, if the Step 1 HitWindows was skipped due to number of meta SNPs, windows could be split because of window extensions that don't occur due to a SNP necessary to extend the window not being in the data and therefore already being dropped by cross-filtering. (e.g. if your window extension size is 100k and you have three snps below the threshold that are each 70k apart but the middle of the 3 is missing from your data, the current second HW will just see two snps that are 140k apart and split this window into two windows.)
A "correct" approach to selecting SNPs would be to run HitWindows just once for each meta file, holding on to every SNP that was used to create the window, not just the index SNP (not currently how HitWindows operates) and then when cross-filtering to a data file to select from each of those windows the lowest p-value SNP that is in the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant