How to create custom databases #11

NamenloseHeldin · 2023-10-06T12:27:25Z

Hi!
I would like to use your tool with my own reference database.

I installed using option 2 after reading through #10

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan

conda env create -f environment_candidate.yaml
conda activate strainscan

chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128

However, when I try to run the StrainScan_build.py command with your test data, I get the following error:
~/StrainScan$ python StrainScan_build.py -i Test_genomes -o DB_Small

2023-10-06 12:14:33,705 - Constructing matrix with dashing (jaccard index)
2023-10-06 12:14:33,822 - Hierarchical clustering
2023-10-06 12:14:34,117 - Constructing the tree
Traceback (most recent call last):
  File "StrainScan_build.py", line 162, in <module>
    main()
  File "StrainScan_build.py", line 127, in main
    31, params])
  File "/home/StrainScan/library/Build_tree.py", line 385, in build_tree
    kmer_index = extract_kmers(fna_mapping[i.identifier], fna_path, ksize, kmer_index_dict, kmer_index, Lv, spec, tree_dir, alpha_ratio, i.identifier)
  File "/home/StrainScan/library/Build_tree.py", line 115, in extract_kmers
    reverse = seqpy.revcomp(forward)
AttributeError: module 'seqpy' has no attribute 'revcomp'

When I try to use it with my own genomes, I get the following different error:
python StrainScan_build.py -i /path/fasta/ -o /StrainScan_DB

2023-10-06 12:23:53,950 - Constructing matrix with dashing (jaccard index)
2023-10-06 12:23:54,539 - Hierarchical clustering
Traceback (most recent call last):
  File "StrainScan_build.py", line 162, in <module>
    main()
  File "StrainScan_build.py", line 116, in main
    cls_file, cls_res)
  File "/home/StrainScan/library/select_rep.py", line 46, in pick_rep
    final_res[ele[2]]=int(ele[0])
IndexError: list index out of range

Any ideas?

The text was updated successfully, but these errors were encountered:

liaoherui · 2023-10-07T02:10:42Z

Hi, thanks for using StrainScan!

Currently, you can use Bioconda to install the new version of StrainScan (v1.0.13). We have updated it to the latest GitHub version.
conda install -c bioconda strainscan

The first problem could be a result of conflict python version. To avoid that error, you can try to create an environment whose python version is 3.7.3.

For the second problem, it looks a little strange. We haven't met this error before. Thus, to debug, would you mind sending me the input genomes that make the error happen? Then, I will investigate the potential problem and solution. Thanks!

dugala239 · 2024-05-10T11:10:05Z

Hi, thanks for using StrainScan!

Currently, you can use Bioconda to install the new version of StrainScan (v1.0.13). We have updated it to the latest GitHub version. conda install -c bioconda strainscan

The first problem could be a result of conflict python version. To avoid that error, you can try to create an environment whose python version is 3.7.3.

For the second problem, it looks a little strange. We haven't met this error before. Thus, to debug, would you mind sending me the input genomes that make the error happen? Then, I will investigate the potential problem and solution. Thanks!

I have the same error when I used ~2000 genomes to build a customized database. Is it a problem when the genome number is too high?

liaoherui · 2024-05-11T01:35:41Z

Hi,

It's hard to say. Have you tried using StrainScan to build a database with a subset of your 2000 genomes? (e.g. with 100 genomes)? If it works, then it may be a result of large number genomes. Otherwise, it may be caused by other reasons.

However, according to my experience, the large number of genomes usually lead to memory issue rather than "index" error. Need to check. Sorry for the inconvenience.

dugala239 · 2024-05-11T08:32:58Z

Hey,

Thanks for your reply! !
A subset of genomes worked. I am using genome & MAGs for the database building. Do you think the poor quality of MAGs can cause this "index" problem? I did not encounter memory problems using 80 CPU and 600GB memory.

liaoherui · 2024-05-13T05:57:39Z

Hi,

Sorry for late reply. Can you send me your hclsMap_95.txt file in the output folder (usually, it's under "Outdir/Cluster_Result")? Then, I can quickly check the possible reason for this problem. Thanks!

dugala239 · 2024-05-13T09:32:12Z

Hi,

Sorry for late reply. Can you send me your hclsMap_95.txt file in the output folder (usually, it's under "Outdir/Cluster_Result")? Then, I can quickly check the possible reason for this problem. Thanks!

Dear Ray,

I have attached the "hclsMap_95.txt" file. Thanks for checking!

hclsMap_95.txt

liaoherui · 2024-05-13T10:42:13Z

Hi,

I tested the code (line 46, index error part) with the provided file (hclsMap_95.txt) and it worked well. This is a little strange. May I ask for the complete error log info you got? Thanks a lot for your time and assistance!

dugala239 · 2024-05-13T10:54:52Z

Hi,

I tested the code (line 46, index error part) with the provided file (hclsMap_95.txt) and it worked well. This is a little strange. May I ask for the complete error log info you got? Thanks a lot for your time and assistance!

This is the error log I got
"2024-05-13 11:58:13,080 - Constructing matrix with dashing (jaccard index)
2024-05-13 11:58:51,360 - Hierarchical clustering
2024-05-13 11:59:00,113 - Constructing the tree
Traceback (most recent call last):
File "/lisc/user/ye/.conda/envs/strainscan/bin/strainscan_build", line 10, in
sys.exit(main())
File "/lisc/user/ye/.conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 128, in main
31, params])
File "/lisc/user/ye/.conda/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 281, in build_tree
for i in temp[2].split(","):
IndexError: list index out of range
strainscan_build-1505346.err (END)"

liaoherui · 2024-05-13T11:01:12Z

I see... It seems that this error is caused by hclsMap_95_recls.txt (the same folder with hclsMap_95.txt) rather than hclsMap_95.txt. Can you send me that file, too? Thanks!

dugala239 · 2024-05-13T12:21:10Z

I see... It seems that this error is caused by hclsMap_95_recls.txt (the same folder with hclsMap_95.txt) rather than hclsMap_95.txt. Can you send me that file, too? Thanks!

hey, I have attached the file. Thanks
hclsMap_95_recls.txt
hclsMap_95_Rep.txt

liaoherui · 2024-05-13T12:44:11Z

I checked these files and found there are some genomes with the same strain ID in the input (see below). Please make sure all input genomes have unique ID/Name and rerun the program to see whether it works. Thanks!

dugala239 · 2024-05-13T19:58:11Z

I checked these files and found there are some genomes with the same strain ID in the input (see below). Please make sure all input genomes have unique ID/Name and rerun the program to see whether it works. Thanks!

Dear Ray,

Thank you for your immediate assistance with troubleshooting.

My genome names are in "*__bin.number.fna" format and the pipeline seems to cut off whatever is after __bin, so some MAGs ended up with the same strain name causing the issue as mentioned above. The database is working at the moment. Nice pipeline!

liaoherui · 2024-05-14T01:11:10Z

Good to hear that! If there are any further problems, please let me know! Again, thanks for using StrainScan!

dugala239 · 2024-05-16T08:33:15Z

Good to hear that! If there are any further problems, please let me know! Again, thanks for using StrainScan!

Dear Ray,

I would like to hear your opinion about the analysis. Since I have ~2000 genomes, it can end up with many clusters. So I decided to define the clusters first and build a customized database.
Can I compare the relative abundance of defined clusters in a metagenomic sample? The Strainscan output sums up the abundance of all strains to 1. Does it mean that I can only compare the composition of all clusters in a sample, but not the relative abundance.

liaoherui · 2024-05-16T13:21:40Z

"Can I compare the relative abundance of defined clusters in a metagenomic sample?"

Yes. You can get the cluster relative abundance by adding the summed frequencies (Predicted_Depth (Ab*cls_depth)) of identified strains for each cluster as the total abundance of the cluster in the sample.

For example, you got something like
ID Strain_Name Cluster_ID Relative_Abundance Predicted_Depth (Enet) Predicted_Depth (Ab*cls_depth) Coverage Coverd/Total_kmr
1 Strain_A C1 0.3984296647659111 87.85996285435535 1.995290772579857 0.8032051656635306 54981/68452
2 Strain_B C2 0.33126055378500013 6.515928000998808 1.658915449167767 0.5102015852823173 20855/40876
3 Strain_C C3 0.18988646589720443 3.7350856478774794 0.9509299802390313 0.4822947720497346 13266/27506
4 Strain_D C4 0.08042331555188444 1.5819346063093358 0.40275088330893394 0.5901400469521574 14580/24706

Then, you can estimate the relative abundance of C1 using: Relative abundance C1 = 1.995/(1.995+1.658+0.95+0.40).

In addition, when you decide to use the custom cluster file, then sometimes, there could be problems caused by the diverse similarity in the defined cluster. (But of course you can try.)

dugala239 · 2024-05-16T17:20:39Z

"Can I compare the relative abundance of defined clusters in a metagenomic sample?"

Yes. You can get the cluster relative abundance by adding the summed frequencies (Predicted_Depth (Ab*cls_depth)) of identified strains for each cluster as the total abundance of the cluster in the sample.

For example, you got something like ID Strain_Name Cluster_ID Relative_Abundance Predicted_Depth (Enet) Predicted_Depth (Ab*cls_depth) Coverage Coverd/Total_kmr 1 Strain_A C1 0.3984296647659111 87.85996285435535 1.995290772579857 0.8032051656635306 54981/68452 2 Strain_B C2 0.33126055378500013 6.515928000998808 1.658915449167767 0.5102015852823173 20855/40876 3 Strain_C C3 0.18988646589720443 3.7350856478774794 0.9509299802390313 0.4822947720497346 13266/27506 4 Strain_D C4 0.08042331555188444 1.5819346063093358 0.40275088330893394 0.5901400469521574 14580/24706

Then, you can estimate the relative abundance of C1 using: Relative abundance C1 = 1.995/(1.995+1.658+0.95+0.40).

In addition, when you decide to use the custom cluster file, then sometimes, there could be problems caused by the diverse similarity in the defined cluster. (But of course you can try.)

Hey Ray,

Thanks for the detailed explanation.

dugala239 · 2024-05-19T20:55:57Z

Dear Ray,

One more question popped up. I ended up with C146, C154, ... C2225, which are not continuous cluster numbers. Did the pipeline filter out C1, C2, C3 etc.? And how?

Another thing is that I used ~2400 genomes and got 2200 clusters according to the the database. Does it make sense?

Thanks!
hclsMap_95_recls.txt

liaoherui · 2024-05-20T02:07:25Z

Hi,

For the first question, if you found the non-continuous ID in the log, then don't panic. It's normal and a result of intra-cluster (only cluster with more than one strain will have this step) k-mer matrix construction step. If you check the file "hclsMap_95_recls.txt", you may find all strains are included in these clusters and all clusters are also included with continuous number.

For the second question, it could be. This could happen if your input strains genomes are highly divergent, which means their k-mer similarity is not very high.

dugala239 · 2024-05-20T07:58:11Z

Dear Ray, Thanks for your help again!

…

On Mon, May 20, 2024 at 3:07 AM Ray ***@***.***> wrote: Hi, For the first question, if you found the non-continuous ID in the log, then don't panic. It's normal and a result of intra-cluster (only cluster with more than one strain will have this step) k-mer matrix construction step. If you check the file "hclsMap_95_recls.txt", you may find all strains are included in these clusters and all clusters are also included with continuous number. For the second question, it could be. This could happen if your input strains genomes are highly divergent, which means their k-mer similarity is not very high. — Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A4XEWANJH6LLK622L3XKNKDZDFLHFAVCNFSM6AAAAAA5VX25HSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJZGU2DAOBRHA> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create custom databases #11

How to create custom databases #11

NamenloseHeldin commented Oct 6, 2023

liaoherui commented Oct 7, 2023

dugala239 commented May 10, 2024

liaoherui commented May 11, 2024

dugala239 commented May 11, 2024

liaoherui commented May 13, 2024

dugala239 commented May 13, 2024

liaoherui commented May 13, 2024

dugala239 commented May 13, 2024

liaoherui commented May 13, 2024

dugala239 commented May 13, 2024

liaoherui commented May 13, 2024 •

edited

Loading

dugala239 commented May 13, 2024

liaoherui commented May 14, 2024

dugala239 commented May 16, 2024

liaoherui commented May 16, 2024

dugala239 commented May 16, 2024

dugala239 commented May 19, 2024 •

edited

Loading

liaoherui commented May 20, 2024

dugala239 commented May 20, 2024 via email

How to create custom databases #11

How to create custom databases #11

Comments

NamenloseHeldin commented Oct 6, 2023

liaoherui commented Oct 7, 2023

dugala239 commented May 10, 2024

liaoherui commented May 11, 2024

dugala239 commented May 11, 2024

liaoherui commented May 13, 2024

dugala239 commented May 13, 2024

liaoherui commented May 13, 2024

dugala239 commented May 13, 2024

liaoherui commented May 13, 2024

dugala239 commented May 13, 2024

liaoherui commented May 13, 2024 • edited Loading

dugala239 commented May 13, 2024

liaoherui commented May 14, 2024

dugala239 commented May 16, 2024

liaoherui commented May 16, 2024

dugala239 commented May 16, 2024

dugala239 commented May 19, 2024 • edited Loading

liaoherui commented May 20, 2024

dugala239 commented May 20, 2024 via email

liaoherui commented May 13, 2024 •

edited

Loading

dugala239 commented May 19, 2024 •

edited

Loading