Test of duplicates removal by preprocess.py #44
micronuria
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dear all, a potential concern of duplicates removal by preprocess.py script arisen when reviewing the MAR dabatabases.
I have run some simple tests with the script and here are the results. Duplicates are removed in all these cases:
When preprocessing a folder with duplicate files
Sequences within the same fasta file:
- preprocessing sequences with identical amino acid composition and different label
- preprocessing sequences that are different (aa composition) but have the same label
Duplicates are removed in the final output. The sequence that appears first in the fasta file is kept.
Sequences in different fasta files:
- preprocessing sequences with identical aminoacid composition and different label
- preprocessing different sequences (aa composition) with same label -
Duplicates are removed, the sequence that appears in the first file listed is kept.
Let me know if you need further tests.
Beta Was this translation helpful? Give feedback.
All reactions