Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output GFA missing unitigs in S lines #73

Open
atabeerk opened this issue Oct 12, 2023 · 13 comments
Open

Output GFA missing unitigs in S lines #73

atabeerk opened this issue Oct 12, 2023 · 13 comments
Labels
bug Something isn't working

Comments

@atabeerk
Copy link
Collaborator

In the output GFA file (strainy_final.gfa), some unitigs in L lines do not have corresponding S lines. This may be due to attempting to remove the unitigs at some point (and removing their S lines) but forgetting to remove the L lines in which these unitigs are used.

The attached file is the output of the mock ONT dataset. Some unitigs that have that issue: edge_1291_139, edge_956_40, edge_874_33, edge_3054_s1_3041692, edge_3024_11380, edge_1553_1030193, edge_2864_1000769

@atabeerk atabeerk added the bug Something isn't working label Oct 12, 2023
@jianshu93
Copy link

@atabeerk Hi, I have the same problem and the Bandage warning me that the format is not correct. See attached warning. How should I solve it?

Thanks,
Jianshu

IMG_3343

@atabeerk
Copy link
Collaborator Author

atabeerk commented Nov 4, 2024

Hi @jianshu93, thanks for reaching out. We will look into this.

Ataberk

@jianshu93
Copy link

@atabeerk,

Thanks for the quick response. The code is well-written, I can run it without any problems and produce expected output. Just the format (feel like a small bug). Let me know if you want my data to reproduce the error.

best,

Jianshu

@atabeerk
Copy link
Collaborator Author

atabeerk commented Nov 4, 2024

@jianshu93, if you can share

  1. input files,
  2. the command that you use to run strainy, and
  3. strain_contigs.gfa file that produces the error you mention

that would be very helpful. Feel free to attach the files to this issue or send an email to [email protected] if that is what you prefer.

Ataberk

@jianshu93
Copy link

Hi @atabeerk, I shared with you the reads, metaFlye assembly graph and strainy graph output. I followed exact the same scripts as you suggested:
flye --pacbio-hifi m84137_240709_192956_s1.hifi_reads.bc2076--bc2076.bam.fastq.gz -o metaflye -t 30 --meta --no-alt-contigs --keep-haplotypes -I 0
./strainy.py --gfa_ref assembly_graph.gfa --fastq m84137_240709_192956_s1.hifi_reads.bc2076--bc2076.bam.fastq.gz --mode hifi -t 30 --output strainy_out

I've shared the input and output files with you via goole drive, let me know if you cannot access them.

Best,
Jianshu

@atabeerk
Copy link
Collaborator Author

atabeerk commented Nov 4, 2024

Hi @jianshu93, I got the files. I will keep you updated.

Best,
Ataberk

@jianshu93
Copy link

hi @atabeerk,

Is there any updates for this error. I just needs a GFA plot that can be visualized in Bandage for a companion's purposes.

Thanks,
Jianshu

@atabeerk
Copy link
Collaborator Author

atabeerk commented Dec 5, 2024

hi @jianshu93,
Apologies for the late reply. I just checked and I can successfully view the strain_contigs.gfa file you shared with both Bandage and BandageNG. Can you confirm the file you shared with me is the file you
get the error with.

Screenshot 2024-12-05 at 3 01 28 PM

@jianshu93
Copy link

Oh, I did not realize that , not sure what happened. I can confirm that it works. But it seems not so many new strain genomes were created. Do you have an idea why since a simple LJA/metaFlye assembly, we can have 5 complete/circular genomes.

Thanks,
Jianshu

@atabeerk
Copy link
Collaborator Author

Hi @jianshu93,
Let me run it and check the results myself, I will get back to you.

@atabeerk
Copy link
Collaborator Author

Hi @jianshu93,

I was able to run strainy on my end got the results. The statistics (multiplicity_stats.txt) file show the multiplicities and looks like there is very little sequence that requires phasing. Still, some regions were phased but they were overall a small percentage of the unitigs in the graph.
You can look into reference_unitig_info_table.csv to see the set of phased unitigs of the original graph. Some of the unitigs will not be processed (will not be considered for phasing) if their coverage is too low or too high. You can set these by --min-unitig-coverage or --min-unitig-coverage.

Let me know if you have more questions.

Best,
Ataberk

@jianshu93
Copy link

Hi @atabeerk,

Many thanks for confirming it. So those hairpin problem (several big assembly in the plot) is not due to strain heterogeneity right but the coverage is too high. How can I resolve those hairpin if not via phasing.

Thanks,
Jianshu

@atabeerk
Copy link
Collaborator Author

Hi @jianshu93,

We were looking at the flye assembly and there are regions with heterogeneity. This may indicate that flye was able to phase some of these regions already. When we input a graph that is already partially phased (such as this one) it is not surprising that strainy produces a disconnected output. This is one possible reason.

We are also aware that strainy may create disconnected graphs even with completely unphased input assemblies and we are working on a fix that should be available in the next update.

One way to alleviate these issues on this data with flye+strainy is to change flye parameters (e.g., sensitivity) so that the assembly produced by flye is simpler and more contiguous. Strainy may be able to work better with that kind of input.

Best,
Ataberk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants