-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering out samples in Viridian v0.4 dataset #204
Comments
I'll take a look at the breakdown of samples filtered out at varying values of the maximum N threshold. I went with 800 Ns initially, because it is roughly two amplicons (e.g. if the terminal amplicons drop out). Maybe it's tossing out too many samples. |
I've been filtering out the samples before importing the alignments. Probably, it is better to implement a simple filter based on the number of Ns to filter out sample during inference. |
Agreed - let's keep as much of the filtering and data pre-processing logic within sc2ts as we can |
Filtering samples by the imported sequence alignments would involve grabbing the alignment from alignment store and then processing it in |
Or maybe keep a boolean array to keep track of which samples pass filters, and then use it to subset the genotype matrix before input to HMM matching. |
That's OK I think - we can easily break |
Also, there are a number of entries in the metadata file that do not have full-precision dates. I have been filtering them out before import the metadata. It is better that this, too, is done within sc2ts. |
It seems that both of these filters (and any other filter on the metadata and alignments) can be done in |
Hmm, actually, about the full-precision dates from metadata, I don't think |
Before doing runs, I have been filtering out samples in the Viridian dataset based on two criteria: (1) having full-precision collection dates, and (2) having at most 800 Ns (excluding gaps) in the aligned consensus sequence (i.e. disregarding insertions). A better way is to exclude problematic sites before filtering by the maximum N criterion.
The text was updated successfully, but these errors were encountered: