too many alignments with v0.12.6 #227

subwaystation · 2024-03-01T20:56:05Z

Hi developers :)

running

wfmash \
    scerevisiae8.fasta.gz \
    scerevisiae8.fasta.gz \
     \
    --threads 4 \
     \
    -n 7 -s 5000 -p 90.0  -X  -l 25000 -k 19 -H 0.001   -2 30 > scerevisiae8.fasta.gz.paf

It seems I am getting too many alignments?

I would have expected to only get the diagonal output in an all-vs-all setting, right? Or am I missing something?

Best,
Simon

The text was updated successfully, but these errors were encountered:

AndreaGuarracino · 2024-03-01T21:00:16Z

Replace -n 7 with -n 1 and you will be happy! The mapper has changed (and is changing) a lot, but not the documentation around the internet...

AndreaGuarracino · 2024-03-01T21:00:57Z

And also -Y # would help

subwaystation · 2024-03-01T21:25:01Z

LOL.

subwaystation · 2024-03-01T21:26:53Z

So PanSN-spec mandatory? Why -n 1? You make nf-core/pangenome unhappy.

subwaystation · 2024-03-01T21:32:02Z

wfmash scerevisiae8.fasta.gz scerevisiae8.fasta.gz --threads 16  -s 5000 -p 90.0  -X  -l 25000 -k 19 -H 0.001   -2 30 -Y'#' -n 1 > scerevisiae8.fasta.gz.ERIK.paf

./git/wfmash/scripts/paf2dotplot png large scerevisiae8.fasta.gz.ERIK.paf

I can't see any difference. What am I missing?

wfmash --version
v0.12.6-0-g7205bf7

ekg · 2024-03-01T22:07:02Z

Might be good to remove -X. It could be conflicting.

subwaystation · 2024-03-01T22:15:25Z

this didn't do the trick

subwaystation · 2024-03-01T22:16:27Z

The input sequences are all just SC888#ChrI and not SC888#1#ChrI. But that should not be the issue here.

Could you please send an example command line where I can just get the diagonal?

bkille · 2024-03-02T02:02:56Z

So PanSN-spec mandatory? Why -n 1? You make nf-core/pangenome unhappy.

Hi @subwaystation, we're updating wfmash w/ the improvements from MashMap3.

One of these improvements requires that filtering is applied to each pair of assemblies independently. Consider the extreme/contrived scenario where SGDref has 7 copies of a particular region, where each copy is 100% identical to a single region in genome SK1. In the old version of wfmash with -n 7, you might see that region in genome SK1 mapped to each of the 7 copies in genome SGDref, kicking out all of the mappings to the other genomes. To fix this, MashMap3 applies the -n parameter (as well as other filtering parameters) to each pair of assemblies independently.

In a similar fashion, MashMap3 also filters secondary mappings based on the best hit. Again, its important that it can do this independently across each target assembly. For example, if our query was a copy of SGDref, its "top hit" for every region would be 100%, but we shouldn't consider that the "top hit" for each target assembly.

bkille · 2024-03-02T04:26:04Z

The input sequences are all just SC888#ChrI and not SC888#1#ChrI. But that should not be the issue here.

Could you please send an example command line where I can just get the diagonal?

Also, the organization of the plotting script doesn't seem to be clustering the mappings very well. Here's what happens if I plot the alignments w/ pafplot. There aren't any contig names unfortunately, but at least you can see that things are lining up a bit more like expected.

>$ wfmash ~/Data/pangenomes/scerevisiae8.fa -m --threads 16  -s 5000 -p 90.0  -X  -l 25000 -k 19 -H 0.001   -2 30 -Y'#' -n 1 > scerevisiae8.fasta.gz.ERIK.paf
[wfmash] Warning: Detected single file all-vs-all mapping with no other options. Consider adding -L, --lower-triangular for efficiency.
[mashmap] Skipping self mappings for single file all-vs-all mapping.
[mashmap] MashMap v3.1.1
[mashmap] Reference = [/home/Users/blk6/Data/pangenomes/scerevisiae8.fa]
[mashmap] Query = [/home/Users/blk6/Data/pangenomes/scerevisiae8.fa]
[mashmap] Kmer size = 19
[mashmap] Sketch size = 298
[mashmap] Segment length = 5000 (read split allowed)
[mashmap] Block length min = 25000
[mashmap] Chaining gap max = 20000
[mashmap] Mappings per segment = 1
[mashmap] Percentage identity threshold = 90%
[mashmap] Skip self mappings
[mashmap] Skipping sequences containing the same prefix based on the delimiter "#"
[mashmap] Hypergeometric filter w/ delta = 0.3 and confidence 0.999
[mashmap] Mapping output file = /dev/stdout
[mashmap] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[mashmap] Execution threads  = 16
[mashmap::skch::Sketch::build] minmer windows picked from reference = 11171389
[mashmap::skch::Sketch::index] unique minmers = 997130
[mashmap::skch::Sketch::computeFreqHist] Frequency histogram of minmer interval points = (2, 117167) ... (11144, 1)
[mashmap::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minmers occurring >= 3208 times during lookup.
[wfmash::map] time spent computing the reference index: 6.29692 sec
[mashmap::skch::Map::mapQuery] mapped 100.00% @ 1.37e+07 bp/s elapsed: 00:00:00:07 remain: 00:00:00:00
[mashmap::skch::Map::mapQuery] count of mapped reads = 136, reads qualified for mapping = 136, total input reads = 136, total input bp = 96255507
[wfmash::map] time spent mapping the query: 7.06e+00 sec
[wfmash::map] mapping results saved in: /dev/stdout

>$ pafplot scerevisiae8.fasta.gz.ERIK.paf --png scerevisiae8-all.png --size 2000

subwaystation · 2024-03-02T06:03:31Z

It seems I just used the wrong tool for plotting. Above the plot including base pair level alignments.
From my observations so far, changing -n for this data set didn't change the mappings according to the pafplot.

subwaystation · 2024-03-02T06:08:38Z

Thanks @bkille

bkille · 2024-03-02T06:09:21Z

That's not too surprising that the results are the same. The -n flag only specifies the maximum number of matches for a segment. Unless a region has an alternative mapping of decent ANI and length, there will likely be only one mapping regardless of the maximum number of specified.

subwaystation · 2024-03-02T06:11:29Z

aah got it!

assuming there would be more mappings in the PAF, do you know how this would affect the pangenome graph construction with seqwish?

subwaystation · 2024-03-02T06:11:58Z

or would this be resolved with wfmash during the alignment step?

subwaystation · 2024-03-02T06:20:08Z

@ekg Would the lower triangle mode be sufficient as input for pggb?

baozg · 2024-03-04T13:54:50Z

Following question, @bkille

I use the sequence without "#" and follow panSN, why the running time is so different?

Name: Chr1, this alignment only have centromere / rDNA mapping in the final paf

wfmash -s 10000 -p 80 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
5101.41s user 14.62s system 1369% cpu 373.69s total 3126988Kb max memory
--
wfmash -s 10000 -p 90 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
1700.41s user 13.18s system 1140% cpu 150.23s total 3043676Kb max memory
--
wfmash -s 10000 -p 95 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
103.47s user 11.58s system 417% cpu 27.54s total 1458372Kb max memory

Name: Col-CC#1#Chr1, Rabacal-1#1#Chr1

wfmash -a -s 10000 -p 80 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
2987.04s user 13.63s system 772% cpu 388.25s total 3944900Kb max memory
--
wfmash -a -s 10000 -p 90 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
1897.45s user 13.90s system 1092% cpu 175.01s total 4115444Kb max memory
--
wfmash -a -s 10000 -p 95 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
625.62s user 12.68s system 901% cpu 70.81s total 3093592Kb max memory

subwaystation · 2024-03-06T09:05:49Z

@baozg As far as I figured just right now, you need to add --lower-triangle to your command line. You still get the all-vs-all relationship, but just from one side. Sufficient!

subwaystation · 2024-03-06T09:21:22Z

But such things are not documented... :P @bkille @AndreaGuarracino

ASLeonard · 2024-03-06T16:19:52Z

The changes are at least verbose in the actual pggb.sh script in the pggb repo. Looking through the blame, --lower-triangle was added 6 months ago in pangenome/pggb@1478789 and the same for the change in meaning for the mapping parameter -n pangenome/pggb@481ec23.

Caveat emptor when building from latest source rather than pinned releases I guess

subwaystation mentioned this issue Mar 6, 2024

fix wfmash v0.12.6 parameters and other bugs nf-core/pangenome#183

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many alignments with v0.12.6 #227

too many alignments with v0.12.6 #227

subwaystation commented Mar 1, 2024

AndreaGuarracino commented Mar 1, 2024

AndreaGuarracino commented Mar 1, 2024

subwaystation commented Mar 1, 2024

subwaystation commented Mar 1, 2024

subwaystation commented Mar 1, 2024 •

edited

Loading

ekg commented Mar 1, 2024

subwaystation commented Mar 1, 2024

subwaystation commented Mar 1, 2024

bkille commented Mar 2, 2024

bkille commented Mar 2, 2024

subwaystation commented Mar 2, 2024

subwaystation commented Mar 2, 2024

bkille commented Mar 2, 2024

subwaystation commented Mar 2, 2024

subwaystation commented Mar 2, 2024

subwaystation commented Mar 2, 2024

baozg commented Mar 4, 2024

subwaystation commented Mar 6, 2024 •

edited

Loading

subwaystation commented Mar 6, 2024

ASLeonard commented Mar 6, 2024

too many alignments with v0.12.6 #227

too many alignments with v0.12.6 #227

Comments

subwaystation commented Mar 1, 2024

AndreaGuarracino commented Mar 1, 2024

AndreaGuarracino commented Mar 1, 2024

subwaystation commented Mar 1, 2024

subwaystation commented Mar 1, 2024

subwaystation commented Mar 1, 2024 • edited Loading

ekg commented Mar 1, 2024

subwaystation commented Mar 1, 2024

subwaystation commented Mar 1, 2024

bkille commented Mar 2, 2024

bkille commented Mar 2, 2024

subwaystation commented Mar 2, 2024

subwaystation commented Mar 2, 2024

bkille commented Mar 2, 2024

subwaystation commented Mar 2, 2024

subwaystation commented Mar 2, 2024

subwaystation commented Mar 2, 2024

baozg commented Mar 4, 2024

subwaystation commented Mar 6, 2024 • edited Loading

subwaystation commented Mar 6, 2024

ASLeonard commented Mar 6, 2024

subwaystation commented Mar 1, 2024 •

edited

Loading

subwaystation commented Mar 6, 2024 •

edited

Loading