-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create custom databases #11
Comments
Hi, thanks for using StrainScan! Currently, you can use Bioconda to install the new version of StrainScan (v1.0.13). We have updated it to the latest GitHub version. The first problem could be a result of conflict python version. To avoid that error, you can try to create an environment whose python version is 3.7.3. For the second problem, it looks a little strange. We haven't met this error before. Thus, to debug, would you mind sending me the input genomes that make the error happen? Then, I will investigate the potential problem and solution. Thanks! |
I have the same error when I used ~2000 genomes to build a customized database. Is it a problem when the genome number is too high? |
Hi, It's hard to say. Have you tried using StrainScan to build a database with a subset of your 2000 genomes? (e.g. with 100 genomes)? If it works, then it may be a result of large number genomes. Otherwise, it may be caused by other reasons. However, according to my experience, the large number of genomes usually lead to memory issue rather than "index" error. Need to check. Sorry for the inconvenience. |
Hey, Thanks for your reply! ! |
Hi, Sorry for late reply. Can you send me your |
Dear Ray, I have attached the "hclsMap_95.txt" file. Thanks for checking! |
Hi, I tested the code (line 46, index error part) with the provided file ( |
This is the error log I got |
I see... It seems that this error is caused by |
hey, I have attached the file. Thanks |
Dear Ray, Thank you for your immediate assistance with troubleshooting. My genome names are in "*__bin.number.fna" format and the pipeline seems to cut off whatever is after __bin, so some MAGs ended up with the same strain name causing the issue as mentioned above. The database is working at the moment. Nice pipeline! |
Good to hear that! If there are any further problems, please let me know! Again, thanks for using StrainScan! |
Dear Ray, I would like to hear your opinion about the analysis. Since I have ~2000 genomes, it can end up with many clusters. So I decided to define the clusters first and build a customized database. |
"Can I compare the relative abundance of defined clusters in a metagenomic sample?" Yes. You can get the cluster relative abundance by adding the summed frequencies (Predicted_Depth (Ab*cls_depth)) of identified strains for each cluster as the total abundance of the cluster in the sample. For example, you got something like Then, you can estimate the relative abundance of C1 using: Relative abundance C1 = 1.995/(1.995+1.658+0.95+0.40). In addition, when you decide to use the custom cluster file, then sometimes, there could be problems caused by the diverse similarity in the defined cluster. (But of course you can try.) |
Hey Ray, Thanks for the detailed explanation. |
Dear Ray, One more question popped up. I ended up with C146, C154, ... C2225, which are not continuous cluster numbers. Did the pipeline filter out C1, C2, C3 etc.? And how? Another thing is that I used ~2400 genomes and got 2200 clusters according to the the database. Does it make sense? Thanks! |
Hi, For the first question, if you found the non-continuous ID in the log, then don't panic. It's normal and a result of intra-cluster (only cluster with more than one strain will have this step) k-mer matrix construction step. If you check the file "hclsMap_95_recls.txt", you may find all strains are included in these clusters and all clusters are also included with continuous number. For the second question, it could be. This could happen if your input strains genomes are highly divergent, which means their k-mer similarity is not very high. |
Dear Ray,
Thanks for your help again!
…On Mon, May 20, 2024 at 3:07 AM Ray ***@***.***> wrote:
Hi,
For the first question, if you found the non-continuous ID in the log,
then don't panic. It's normal and a result of intra-cluster (only cluster
with more than one strain will have this step) k-mer matrix construction
step. If you check the file "hclsMap_95_recls.txt", you may find all
strains are included in these clusters and all clusters are also included
with continuous number.
For the second question, it could be. This could happen if your input
strains genomes are highly divergent, which means their k-mer similarity is
not very high.
—
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A4XEWANJH6LLK622L3XKNKDZDFLHFAVCNFSM6AAAAAA5VX25HSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZGU2DAOBRHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi!
I would like to use your tool with my own reference database.
I installed using option 2 after reading through #10
However, when I try to run the StrainScan_build.py command with your test data, I get the following error:
~/StrainScan$ python StrainScan_build.py -i Test_genomes -o DB_Small
When I try to use it with my own genomes, I get the following different error:
python StrainScan_build.py -i /path/fasta/ -o /StrainScan_DB
Any ideas?
The text was updated successfully, but these errors were encountered: