-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sam/bam/cram output options to nucmer #5867
Conversation
Fixed many parameters with both name and argument making planemo lint upset.
These need to be replaced for any conversions
oh dear. |
Conda gnuplot is complaining about missing libgl in the CI tests. Plots are not worth keeping unless there's a simple fix - adding https://anaconda.org/conda-forge/mesa-libgl-devel-cos7-x86_64 - this is surely not ideal - better solutions welcomed if it does work. Will remove all plots if that doesn't fix the CI tests. |
Can anyone suggest an easy and working fix for this libGL issue with gnuplot? |
I know how to fix it. Can you remove the libgl dep please again. I need 1-2 once I get up ;-) |
@bgruening - removed so you can do the needful... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fubar2 the output is now BAM correct? Should we also change the wording in the wrapper and talk always from BAM?
Also what is the difference between short and long? Should we offer only long?
or cram.
|
@bgruening: I'd leave both but up to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an empty SAM file in this PR tools/mummer4/test-data/out.sam
Co-authored-by: Björn Grüning <[email protected]>
Co-authored-by: Björn Grüning <[email protected]>
@fubar2 can you run the long format output through samtools calmd please to see if there's a warning about existing MD tags that got changed? |
The samtools calmd tool returns a file of the same size, with no visible warning on both short and long sam outputs |
Testing with a larger input: calmd does report and fix some NM and MD errors in the sam long format so will add it as a prophylactic step. The fatter format is about 1/3 bigger but will remove the short-sam output options since the resulting bam/cram will be missing things that may cause mischief downstream?
|
added samtools calmd to fix broken MD and NM headers - thanks, mummer.
@bgruening @wm75 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vers nice Ross, I only have three small comments/questions. Lets wait for @wm75 and then get it in.
For a VGP amphioxus assembly self-homology, nucmer is currently using 43GB of virtual ram on my test server so it may need 48GB or more to be reliable - they have even bigger genomes than the ones I'm testing and they want both parental chromosomes so twice the genome size inputs. |
…B to process, so removed one file copy operation by appending the tail output to the prepared head for the fixed sam to save 10 minutes in a 3 hour job on my test server...
@wm75 @bgruening: saved a few percent run time wasted in a copy, but noticing that |
tools/mummer4/nucmer.xml
Outdated
&& samtools dict reference.fa > outsamhead | ||
&& tail -n +3 outsam.sam >> outsamhead | ||
&& samtools sort -o outsamheadsort outsamhead | ||
&& samtools calmd outsamheadsort reference.fa > outsamcalm | ||
#if $outform.out_format == 'bam-long': | ||
&& samtools view -b -o outsam outsamcalm | ||
#else if $outform.out_format == 'cram-long': | ||
&& samtools view -C -o outsam -T reference.fa outsamcalm | ||
#end if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&& samtools dict reference.fa > outsamhead | |
&& tail -n +3 outsam.sam >> outsamhead | |
&& samtools sort -o outsamheadsort outsamhead | |
&& samtools calmd outsamheadsort reference.fa > outsamcalm | |
#if $outform.out_format == 'bam-long': | |
&& samtools view -b -o outsam outsamcalm | |
#else if $outform.out_format == 'cram-long': | |
&& samtools view -C -o outsam -T reference.fa outsamcalm | |
#end if | |
&& samtools view -T reference.fa -u -x ^ outsam.sam | samtools sort -@ \${GALAXY_SLOTS:-1} -T "\${TMPDIR:-.}" | | |
#if $outform.out_format == 'bam-long': | |
-u | samtools calmd -b --threads {GALAXY_SLOTS:-1} - reference.fa > outsamcalm | |
#else if $outform.out_format == 'cram-long': | |
-O cram --reference reference.fa -o outsamcalm | |
#end if |
what about this instead?
untested, but likely working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the samtools view command adds the required header and removes the MD and NM tags calculated by nucmer
- samtools sort gets called with multiple threads and with the admin supplied tempdir for efficiency
- samtools calmd is required only for bam; cram format autocalculates MD/NM on decoding (http://www.htslib.org/doc/samtools-calmd.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, this looks good to me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will test and get this suggestion merged later today - thanks @wm75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the samtools view command adds the required header and removes the MD and NM tags calculated by nucmer
I seem to remember this is why I was manually excising the header - samtools view barfs on the broken input so won't remove it.
[main_samview] fail to read the header from "outsam.sam".
It seems there's no choice but to manually remove it. Will continue to experiment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works I think - thanks @wm75 .
#else:
&& samtools dict reference.fa > outsamhead
&& tail -n +3 outsam.sam >> outsamhead
&& samtools view -T reference.fa -u outsamhead | samtools sort -@ \${GALAXY_SLOTS:-1} -u -T "\${TMPDIR:-.}" |
#if $outform.out_format == 'bam-long':
samtools calmd -b --threads {GALAXY_SLOTS:-1} - reference.fa > outsam
#else if $outform.out_format == 'cram-long':
samtools view -C --reference reference.fa -o outsam -
#end if
#end if
Once the initial 2 rows are removed and a proper header inserted, it uses pipes for all steps.
Frustrating that we won't see this bogus sam and other warts fixed upstream unless we can find someone willing to take it on as a project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for -x ^
since we just made decent headers!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frustrating that we won't see this bogus sam and other warts fixed
A pity, yes, but at least you've come up with an acceptable workaround 👍
You should either drop the samtools dict
or the samtools view
though to avoid redundancy.
samtools view
should work after the header is removed by tail
and the intention behind -x ^
was to suppress the warnings from samtools calmd.
If you don't need MD/NM for vgp, you could also just drop samtools calmd and produce BAM without these tags. Users who need them and cannot go with cram could still run samtools calmd as a separate step of their WF.
Any of these options are fine with me, just pick what seems most practical for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should either drop the samtools dict or the samtools view though to avoid redundancy.
Oh. Yes. That was silly. Will drop that redundant step - thanks @wm75
If you don't need MD/NM for vgp,
VGP won't be the only user - but no, they won't be interested in those.
However if these could potentially cause problems - and they are wrong from nucmer - safer to emit these fixed perhaps. Arguably being a good citizen in terms of repairing broken nucmer sam before unleashing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wm75: How about we go with:
#else:
&& samtools dict reference.fa > outsamhead
&& tail -n +3 outsam.sam >> outsamhead
&& samtools sort -@ \${GALAXY_SLOTS:-1} -T "\${TMPDIR:-.}" outsamhead |
#if $outform.out_format == 'bam-long':
samtools calmd -b --threads {GALAXY_SLOTS:-1} - reference.fa > outsam
#else if $outform.out_format == 'cram-long':
samtools view -C --reference reference.fa -o outsam -
#end if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's an option, too :-)
Thanks for all your help, @bgruening and @wm75. |
FOR CONTRIBUTOR:
Mummer is old and github untouched for >4 years but still seems popular and has a potential use for self-homology mapping in the VGP. Nucmer is the nucleotide mapper. The default output is "delta" format but bam/cram alignment formats would be more useful downstream. This PR adds output conversion to
sambam/cram as an option.Nucmer now has
--sam-long
and--sam-short
output options.Both produce sam headers with spaces instead of tabs so samtools complains that the header cannot be read unless they are fixed.
Update: Removed all the sam and the sam-short based outputs and added a routine samtools calmd step to prevent downstream Mummer sam format errors.
Cannot get gnuplot tests to pass CI here so unless there are strong objections or an easy fix, will remove all the plotting functions and encourage downstream visualisation of the bam/cram with JBrowse