Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sam/bam/cram output options to nucmer #5867

Merged
merged 30 commits into from
Mar 18, 2024

Conversation

fubar2
Copy link
Member

@fubar2 fubar2 commented Mar 15, 2024

FOR CONTRIBUTOR:

  • - I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • - License permits unrestricted use (educational + commercial)
  • - This PR updates an existing tool or tool collection

Mummer is old and github untouched for >4 years but still seems popular and has a potential use for self-homology mapping in the VGP. Nucmer is the nucleotide mapper. The default output is "delta" format but bam/cram alignment formats would be more useful downstream. This PR adds output conversion to sambam/cram as an option.

Nucmer now has --sam-long and --sam-short output options.
Both produce sam headers with spaces instead of tabs so samtools complains that the header cannot be read unless they are fixed.

Update: Removed all the sam and the sam-short based outputs and added a routine samtools calmd step to prevent downstream Mummer sam format errors.

Cannot get gnuplot tests to pass CI here so unless there are strong objections or an easy fix, will remove all the plotting functions and encourage downstream visualisation of the bam/cram with JBrowse

@fubar2
Copy link
Member Author

fubar2 commented Mar 15, 2024

oh dear.
nucmer is fine but need to fix the rest of the broken xml and tests.

@fubar2
Copy link
Member Author

fubar2 commented Mar 15, 2024

Conda gnuplot is complaining about missing libgl in the CI tests.

Plots are not worth keeping unless there's a simple fix - adding https://anaconda.org/conda-forge/mesa-libgl-devel-cos7-x86_64 - this is surely not ideal - better solutions welcomed if it does work. Will remove all plots if that doesn't fix the CI tests.

@fubar2
Copy link
Member Author

fubar2 commented Mar 15, 2024

Can anyone suggest an easy and working fix for this libGL issue with gnuplot?
Cannot find any other IUC tools with gnuplot as a dependency.
Perhaps this is why.
Otherwise, does anybody object to the plots being removed here?
They are crude compared to using JBrowse for example.

@fubar2 fubar2 requested a review from bgruening March 15, 2024 06:32
@bgruening
Copy link
Member

I know how to fix it. Can you remove the libgl dep please again. I need 1-2 once I get up ;-)

@fubar2
Copy link
Member Author

fubar2 commented Mar 15, 2024

@bgruening - removed so you can do the needful...
Thanks!

tools/mummer4/nucmer.xml Outdated Show resolved Hide resolved
tools/mummer4/nucmer.xml Outdated Show resolved Hide resolved
Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fubar2 the output is now BAM correct? Should we also change the wording in the wrapper and talk always from BAM?

Also what is the difference between short and long? Should we offer only long?

@fubar2
Copy link
Member Author

fubar2 commented Mar 15, 2024

@fubar2 the output is now BAM correct?

or cram.

Also what is the difference between short and long? Should we offer only long?

The original output format of nucmer, the delta format, contains only the minimum information necessary to quickly recreate the alignment. It contains the name of the matching sequences, the length of the match, number of errors and positions of indels. An important addition in MUMmer4 is option to produce output in SAM format, one of the most widely used formats for alignments of NGS data [[10](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005944#pcbi.1005944.ref010)]. This also allows the MUMmer4 output to be used in any of the numerous tools that require SAM files as input. Nucmer4 supports two different options for SAM format output. With --sam-short, nucmer4 reports only the name of the matching sequence, length, and CIGAR string (which reports the indel positions). With --sam-long, it additionally reports the MD string (which specifies the mismatching positions), the sequence and, if applicable, the quality values of the matching sequence. The long format is more expensive to compute and it generates larger output files, but this option allows nucmer4 to match the behavior of other aligners such as Bowtie2 or BWA.

@fubar2
Copy link
Member Author

fubar2 commented Mar 15, 2024

@bgruening: I'd leave both but up to you.
Have added that text to the help.

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an empty SAM file in this PR tools/mummer4/test-data/out.sam

tools/mummer4/nucmer.xml Outdated Show resolved Hide resolved
tools/mummer4/nucmer.xml Outdated Show resolved Hide resolved
fubar2 and others added 2 commits March 15, 2024 22:19
Co-authored-by: Björn Grüning <[email protected]>
Co-authored-by: Björn Grüning <[email protected]>
@wm75
Copy link
Contributor

wm75 commented Mar 15, 2024

with --sam-long, it additionally reports the MD string (which specifies the mismatching positions), the sequence and, if applicable, the quality values of the matching sequence. The long format is more expensive to compute and ...

@fubar2 can you run the long format output through samtools calmd please to see if there's a warning about existing MD tags that got changed?
I wouldn't trust a tool that doesn't even get the sam header right to calculate MD correctly.
If everything is fine, then maybe offer only the long version, otherwise short cause you'd have to run it through samtools calmd anyway.

@fubar2
Copy link
Member Author

fubar2 commented Mar 15, 2024

@fubar2 can you run the long format output through samtools calmd please to see if there's a warning about existing MD tags that got changed?

The samtools calmd tool returns a file of the same size, with no visible warning on both short and long sam outputs
@wm75: @bgruening: I have no expertise for making this decision so do not have useful opinions to offer in this debate.
If it makes sense to run calmd routinely, or to get rid of the short form outputs please just tell me what you want or feel free to repair....

@fubar2
Copy link
Member Author

fubar2 commented Mar 15, 2024

Testing with a larger input: calmd does report and fix some NM and MD errors in the sam long format so will add it as a prophylactic step.

The fatter format is about 1/3 bigger but will remove the short-sam output options since the resulting bam/cram will be missing things that may cause mischief downstream?

[bam_fillmd1] different NM for read 'sacCer3_gold_BK006938.2': 202 -> 1706
[bam_fillmd1] different MD for read 'sacCer3_gold_BK006938.2': '33^a23a4t1c0t2c4c5c4c1^ct2t1t2c4g16c17g18t1c8c7^t2g17c4t
...

added samtools calmd to fix broken MD and NM headers - thanks, mummer.
@fubar2
Copy link
Member Author

fubar2 commented Mar 16, 2024

@bgruening @wm75
Now with calmd run to fix Mummer sam format and an executive decision to remove all short-sam.
Added a test for sam output too.

@fubar2
Copy link
Member Author

fubar2 commented Mar 16, 2024

and added the required output file for the sam test so passing CI now...

A self-homology on SacCer JBrowse2 confirms the cram/bam give identical looking tracks - calmd reports no errors with either.

image

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vers nice Ross, I only have three small comments/questions. Lets wait for @wm75 and then get it in.

tools/mummer4/nucmer.xml Outdated Show resolved Hide resolved
tools/mummer4/nucmer.xml Outdated Show resolved Hide resolved
tools/mummer4/nucmer.xml Outdated Show resolved Hide resolved
@fubar2
Copy link
Member Author

fubar2 commented Mar 17, 2024

For a VGP amphioxus assembly self-homology, nucmer is currently using 43GB of virtual ram on my test server so it may need 48GB or more to be reliable - they have even bigger genomes than the ones I'm testing and they want both parental chromosomes so twice the genome size inputs.

…B to process, so removed one file copy operation

by appending the tail output to the prepared head for the fixed sam to save 10 minutes in a 3 hour job on my test server...
@fubar2
Copy link
Member Author

fubar2 commented Mar 17, 2024

@wm75 @bgruening: saved a few percent run time wasted in a copy, but noticing that samtools sort on the 5GB output I'm seeing takes > 5% (30 minutes of about 600 on my test server) - is it a good investment for all future jobs despite both having indexes?

Comment on lines 67 to 75
&& samtools dict reference.fa > outsamhead
&& tail -n +3 outsam.sam >> outsamhead
&& samtools sort -o outsamheadsort outsamhead
&& samtools calmd outsamheadsort reference.fa > outsamcalm
#if $outform.out_format == 'bam-long':
&& samtools view -b -o outsam outsamcalm
#else if $outform.out_format == 'cram-long':
&& samtools view -C -o outsam -T reference.fa outsamcalm
#end if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
&& samtools dict reference.fa > outsamhead
&& tail -n +3 outsam.sam >> outsamhead
&& samtools sort -o outsamheadsort outsamhead
&& samtools calmd outsamheadsort reference.fa > outsamcalm
#if $outform.out_format == 'bam-long':
&& samtools view -b -o outsam outsamcalm
#else if $outform.out_format == 'cram-long':
&& samtools view -C -o outsam -T reference.fa outsamcalm
#end if
&& samtools view -T reference.fa -u -x ^ outsam.sam | samtools sort -@ \${GALAXY_SLOTS:-1} -T "\${TMPDIR:-.}" |
#if $outform.out_format == 'bam-long':
-u | samtools calmd -b --threads {GALAXY_SLOTS:-1} - reference.fa > outsamcalm
#else if $outform.out_format == 'cram-long':
-O cram --reference reference.fa -o outsamcalm
#end if

what about this instead?
untested, but likely working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the samtools view command adds the required header and removes the MD and NM tags calculated by nucmer
  • samtools sort gets called with multiple threads and with the admin supplied tempdir for efficiency
  • samtools calmd is required only for bam; cram format autocalculates MD/NM on decoding (http://www.htslib.org/doc/samtools-calmd.html)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, this looks good to me :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will test and get this suggestion merged later today - thanks @wm75

Copy link
Member Author

@fubar2 fubar2 Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wm75:

the samtools view command adds the required header and removes the MD and NM tags calculated by nucmer

I seem to remember this is why I was manually excising the header - samtools view barfs on the broken input so won't remove it.

[main_samview] fail to read the header from "outsam.sam".

It seems there's no choice but to manually remove it. Will continue to experiment...

Copy link
Member Author

@fubar2 fubar2 Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works I think - thanks @wm75 .

            #else:
                && samtools dict reference.fa > outsamhead
                && tail -n +3 outsam.sam >> outsamhead
                && samtools view -T reference.fa -u outsamhead | samtools sort -@ \${GALAXY_SLOTS:-1} -u -T "\${TMPDIR:-.}" |
                #if $outform.out_format == 'bam-long':
                    samtools calmd -b --threads {GALAXY_SLOTS:-1} - reference.fa > outsam
                #else if $outform.out_format == 'cram-long':
                    samtools view -C --reference reference.fa -o outsam -
                #end if
            #end if

Once the initial 2 rows are removed and a proper header inserted, it uses pipes for all steps.
Frustrating that we won't see this bogus sam and other warts fixed upstream unless we can find someone willing to take it on as a project.

Copy link
Member Author

@fubar2 fubar2 Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for -x ^ since we just made decent headers!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frustrating that we won't see this bogus sam and other warts fixed

A pity, yes, but at least you've come up with an acceptable workaround 👍

You should either drop the samtools dict or the samtools view though to avoid redundancy.
samtools view should work after the header is removed by tail and the intention behind -x ^ was to suppress the warnings from samtools calmd.

If you don't need MD/NM for vgp, you could also just drop samtools calmd and produce BAM without these tags. Users who need them and cannot go with cram could still run samtools calmd as a separate step of their WF.

Any of these options are fine with me, just pick what seems most practical for you.

Copy link
Member Author

@fubar2 fubar2 Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should either drop the samtools dict or the samtools view though to avoid redundancy.

Oh. Yes. That was silly. Will drop that redundant step - thanks @wm75

If you don't need MD/NM for vgp,

VGP won't be the only user - but no, they won't be interested in those.
However if these could potentially cause problems - and they are wrong from nucmer - safer to emit these fixed perhaps. Arguably being a good citizen in terms of repairing broken nucmer sam before unleashing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wm75: How about we go with:

#else:
&& samtools dict reference.fa > outsamhead
&& tail -n +3 outsam.sam >> outsamhead
&& samtools sort  -@ \${GALAXY_SLOTS:-1} -T "\${TMPDIR:-.}" outsamhead |
#if $outform.out_format == 'bam-long':
    samtools calmd -b --threads {GALAXY_SLOTS:-1} - reference.fa > outsam
#else if $outform.out_format == 'cram-long':
    samtools view -C --reference reference.fa -o outsam -
#end if

Copy link
Contributor

@wm75 wm75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's an option, too :-)

@bgruening bgruening merged commit 026db72 into galaxyproject:main Mar 18, 2024
14 checks passed
@fubar2
Copy link
Member Author

fubar2 commented Mar 18, 2024

Thanks for all your help, @bgruening and @wm75.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants