-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyfastx sometimes gives an erroneously shifted subsequence from hg19 reference #11
Comments
Hi Tuomo, The position you used to slice sequence is not correct. In bed file, the start position is 0-based, and the end position is 1-based. This means that (60757997, 60758117) in bed file is corresponding to subsequence from 60757998 to 60758117. In pyfastx, Sequence object is just like a Python string. The sliced position is 0-based and not include the end position. So, the sliced position genome["chr20"][60757997:60758118] is corresponding to subsequence from 60757998 to 60758118. You can just use genome['chr20'][60757997:60758117] and genome['chr20'][60757999:60758119] to get right sequence. |
Hi and thank you for a quick reply! I see that I was maybe a bit unclear in explaining the source of my confusion so I will try again. So I want to fetch the following 120bp long sequences: seq1:
seq2:
I know that I can get the exact same sequence seq1 with pyfastx with:
This is the same 120bp long sequence than my seq1. Now my seq2 is 2bp upstream of seq1, so I should get that exact same sequence by adding 2bp to end and start coordinates right? But instead of getting seq2, I get a sequence that is 3bp upstream of seq1:
Actually, I cannot find any coordinate that would return me a sequence that would start with CGGCGTG... as the seq2 I have defined above. It seems that for some reason pyfastx skips the C that is on 3rd position of seq1:
To me it seems that sometimes increasing the start coordinate by 1 shifts the sequence sometimes by 0bp, sometimes by 1bp and sometimes by 2bp. Also the last of these sequences is 121bp long and the others 120bp long even though the difference between the start and end coordinates is always the same. So to me this is weird:
Also I don't understand why changing the end position changes the start position of the sequence:
Hopefully I was able to better elaborate my points of confusion. Best regards, |
Thank you for reporting issue. It is a strange bug. Could you send me your Python version and operating system. I have test it on ubuntu with Python 3.6.9 and it woks fine. |
Hi! So the OS:
and Python:
And I am using the standard hg19 reference genome that can be downloaded from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz Best regards, |
Hi Tuomo, I am so sorry. I did my best to test your code on Ubuntu and Windows with Python3.6 and Python3.7. I really didn't meet the bug you described. But I just released a new version. In this version, I have fixed error start postion for reading sequence caused by fseek when processing large file. I am not sure whether this version can solve your problem. But you might try it. By the way, which tool you have used to install pyfastx, pip or conda? I recommended you try the precompiled version in pypi. If it still doesn't work, please let me know. Thank you for that! |
Hi!
I have tried for a few days now to understand why I sometimes get different subsequences than using 'bedtools getfasta' from hg19 reference. I was wondering if you could elaborate the indexing used by pyfastx in case I am making some error. I demonstrate this by fetching these two subsequences from chromosome 20:
seq1: start=60757997, end=60758117
seq2: start=60757999, end=60758119
both sequences are 120bp long and seq2 is shifted downstream by 2bp from seq1, meaning they should overlap with 118bp.
Using bedtools getfasta, I get the expected result:
However, when trying to fetch the same sequences with pyfastx, I get a seq2 that is shifted by 3bp instead of 2bp. The commands issued in ipython:
So seq1 is identical to what bedtools returns but for some reason seq2 starts from different position even though the start position is defined to be the same as in bedtools example.
I also replicated the test with third tool, seqkit, and it agrees with bedtools:
I have also deleted both the .fai and .fxi index files and rebuilt them but the issue persists. Can you explain why I get the different sequence than with bedtools? And why sometimes I get the same sequence as with other tools?
Let me know if I was unclear or some information is missing from my report.
Best regards,
Tuomo
The text was updated successfully, but these errors were encountered: