Filtering out samples in Viridian v0.4 dataset #204

szhan · 2024-07-29T13:58:11Z

Before doing runs, I have been filtering out samples in the Viridian dataset based on two criteria: (1) having full-precision collection dates, and (2) having at most 800 Ns (excluding gaps) in the aligned consensus sequence (i.e. disregarding insertions). A better way is to exclude problematic sites before filtering by the maximum N criterion.

szhan · 2024-07-30T07:43:00Z

I'll take a look at the breakdown of samples filtered out at varying values of the maximum N threshold. I went with 800 Ns initially, because it is roughly two amplicons (e.g. if the terminal amplicons drop out). Maybe it's tossing out too many samples.

szhan · 2024-07-31T10:39:39Z

I've been filtering out the samples before importing the alignments. Probably, it is better to implement a simple filter based on the number of Ns to filter out sample during inference.

jeromekelleher · 2024-07-31T10:53:19Z

Agreed - let's keep as much of the filtering and data pre-processing logic within sc2ts as we can

szhan · 2024-07-31T14:13:43Z

Filtering samples by the imported sequence alignments would involve grabbing the alignment from alignment store and then processing it in preprocess_and_match_alignments. This would require an additional pass over the Sample objects I think, because the genotype matrix which goes into HMM matching is preset.

szhan · 2024-07-31T14:16:41Z

Or maybe keep a boolean array to keep track of which samples pass filters, and then use it to subset the genotype matrix before input to HMM matching.

jeromekelleher · 2024-07-31T14:55:54Z

Filtering samples by the imported sequence alignments would involve grabbing the alignment from alignment store and then processing it in preprocess_and_match_alignments. This would require an additional pass over the Sample objects I think, because the genotype matrix which goes into HMM matching is preset.

That's OK I think - we can easily break preprocess_and_match_alignments into steps, or add some complexity where we only pass alignments that meet QC requirements on to the matching step.

szhan · 2024-08-01T10:10:06Z

Also, there are a number of entries in the metadata file that do not have full-precision dates. I have been filtering them out before import the metadata. It is better that this, too, is done within sc2ts.

szhan · 2024-08-01T10:16:10Z

It seems that both of these filters (and any other filter on the metadata and alignments) can be done in preprocess_and_match_alignments. Or refactor it into preprocess_samples and match_alignments, where we can implement the filters.

szhan · 2024-08-01T10:23:36Z

Hmm, actually, about the full-precision dates from metadata, I don't think get is getting entries by comparing dates. It is just getting entries by comparing dates in the form of strings. So, I don't think it needs to be modified.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering out samples in Viridian v0.4 dataset #204

Filtering out samples in Viridian v0.4 dataset #204

szhan commented Jul 29, 2024

szhan commented Jul 30, 2024

szhan commented Jul 31, 2024 •

edited

Loading

jeromekelleher commented Jul 31, 2024

szhan commented Jul 31, 2024

szhan commented Jul 31, 2024

jeromekelleher commented Jul 31, 2024

szhan commented Aug 1, 2024

szhan commented Aug 1, 2024

szhan commented Aug 1, 2024 •

edited

Loading

Filtering out samples in Viridian v0.4 dataset #204

Filtering out samples in Viridian v0.4 dataset #204

Comments

szhan commented Jul 29, 2024

szhan commented Jul 30, 2024

szhan commented Jul 31, 2024 • edited Loading

jeromekelleher commented Jul 31, 2024

szhan commented Jul 31, 2024

szhan commented Jul 31, 2024

jeromekelleher commented Jul 31, 2024

szhan commented Aug 1, 2024

szhan commented Aug 1, 2024

szhan commented Aug 1, 2024 • edited Loading

szhan commented Jul 31, 2024 •

edited

Loading

szhan commented Aug 1, 2024 •

edited

Loading