Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many alignments with v0.12.6 #227

Open
subwaystation opened this issue Mar 1, 2024 · 20 comments
Open

too many alignments with v0.12.6 #227

subwaystation opened this issue Mar 1, 2024 · 20 comments

Comments

@subwaystation
Copy link
Contributor

Hi developers :)

running

wfmash \
    scerevisiae8.fasta.gz \
    scerevisiae8.fasta.gz \
     \
    --threads 4 \
     \
    -n 7 -s 5000 -p 90.0  -X  -l 25000 -k 19 -H 0.001   -2 30 > scerevisiae8.fasta.gz.paf    

It seems I am getting too many alignments?

out

I would have expected to only get the diagonal output in an all-vs-all setting, right? Or am I missing something?

Best,
Simon

@AndreaGuarracino
Copy link
Member

Replace -n 7 with -n 1 and you will be happy! The mapper has changed (and is changing) a lot, but not the documentation around the internet...

@AndreaGuarracino
Copy link
Member

And also -Y # would help

@subwaystation
Copy link
Contributor Author

LOL.

@subwaystation
Copy link
Contributor Author

So PanSN-spec mandatory? Why -n 1? You make nf-core/pangenome unhappy.

@subwaystation
Copy link
Contributor Author

subwaystation commented Mar 1, 2024

wfmash scerevisiae8.fasta.gz scerevisiae8.fasta.gz --threads 16  -s 5000 -p 90.0  -X  -l 25000 -k 19 -H 0.001   -2 30 -Y'#' -n 1 > scerevisiae8.fasta.gz.ERIK.paf

./git/wfmash/scripts/paf2dotplot png large scerevisiae8.fasta.gz.ERIK.paf

out

I can't see any difference. What am I missing?

wfmash --version
v0.12.6-0-g7205bf7

@ekg
Copy link
Collaborator

ekg commented Mar 1, 2024

Might be good to remove -X. It could be conflicting.

@subwaystation
Copy link
Contributor Author

this didn't do the trick

@subwaystation
Copy link
Contributor Author

The input sequences are all just SC888#ChrI and not SC888#1#ChrI. But that should not be the issue here.

Could you please send an example command line where I can just get the diagonal?

@bkille
Copy link
Contributor

bkille commented Mar 2, 2024

So PanSN-spec mandatory? Why -n 1? You make nf-core/pangenome unhappy.

Hi @subwaystation, we're updating wfmash w/ the improvements from MashMap3.

One of these improvements requires that filtering is applied to each pair of assemblies independently. Consider the extreme/contrived scenario where SGDref has 7 copies of a particular region, where each copy is 100% identical to a single region in genome SK1. In the old version of wfmash with -n 7, you might see that region in genome SK1 mapped to each of the 7 copies in genome SGDref, kicking out all of the mappings to the other genomes. To fix this, MashMap3 applies the -n parameter (as well as other filtering parameters) to each pair of assemblies independently.

In a similar fashion, MashMap3 also filters secondary mappings based on the best hit. Again, its important that it can do this independently across each target assembly. For example, if our query was a copy of SGDref, its "top hit" for every region would be 100%, but we shouldn't consider that the "top hit" for each target assembly.

@bkille
Copy link
Contributor

bkille commented Mar 2, 2024

The input sequences are all just SC888#ChrI and not SC888#1#ChrI. But that should not be the issue here.

Could you please send an example command line where I can just get the diagonal?

Also, the organization of the plotting script doesn't seem to be clustering the mappings very well. Here's what happens if I plot the alignments w/ pafplot. There aren't any contig names unfortunately, but at least you can see that things are lining up a bit more like expected.

scerevisiae8-all

>$ wfmash ~/Data/pangenomes/scerevisiae8.fa -m --threads 16  -s 5000 -p 90.0  -X  -l 25000 -k 19 -H 0.001   -2 30 -Y'#' -n 1 > scerevisiae8.fasta.gz.ERIK.paf
[wfmash] Warning: Detected single file all-vs-all mapping with no other options. Consider adding -L, --lower-triangular for efficiency.
[mashmap] Skipping self mappings for single file all-vs-all mapping.
[mashmap] MashMap v3.1.1
[mashmap] Reference = [/home/Users/blk6/Data/pangenomes/scerevisiae8.fa]
[mashmap] Query = [/home/Users/blk6/Data/pangenomes/scerevisiae8.fa]
[mashmap] Kmer size = 19
[mashmap] Sketch size = 298
[mashmap] Segment length = 5000 (read split allowed)
[mashmap] Block length min = 25000
[mashmap] Chaining gap max = 20000
[mashmap] Mappings per segment = 1
[mashmap] Percentage identity threshold = 90%
[mashmap] Skip self mappings
[mashmap] Skipping sequences containing the same prefix based on the delimiter "#"
[mashmap] Hypergeometric filter w/ delta = 0.3 and confidence 0.999
[mashmap] Mapping output file = /dev/stdout
[mashmap] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[mashmap] Execution threads  = 16
[mashmap::skch::Sketch::build] minmer windows picked from reference = 11171389
[mashmap::skch::Sketch::index] unique minmers = 997130
[mashmap::skch::Sketch::computeFreqHist] Frequency histogram of minmer interval points = (2, 117167) ... (11144, 1)
[mashmap::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minmers occurring >= 3208 times during lookup.
[wfmash::map] time spent computing the reference index: 6.29692 sec
[mashmap::skch::Map::mapQuery] mapped 100.00% @ 1.37e+07 bp/s elapsed: 00:00:00:07 remain: 00:00:00:00
[mashmap::skch::Map::mapQuery] count of mapped reads = 136, reads qualified for mapping = 136, total input reads = 136, total input bp = 96255507
[wfmash::map] time spent mapping the query: 7.06e+00 sec
[wfmash::map] mapping results saved in: /dev/stdout

>$ pafplot scerevisiae8.fasta.gz.ERIK.paf --png scerevisiae8-all.png --size 2000

@subwaystation
Copy link
Contributor Author

scerevisiae8 fasta gz paf

It seems I just used the wrong tool for plotting. Above the plot including base pair level alignments.
From my observations so far, changing -n for this data set didn't change the mappings according to the pafplot.

@subwaystation
Copy link
Contributor Author

Thanks @bkille

@bkille
Copy link
Contributor

bkille commented Mar 2, 2024

That's not too surprising that the results are the same. The -n flag only specifies the maximum number of matches for a segment. Unless a region has an alternative mapping of decent ANI and length, there will likely be only one mapping regardless of the maximum number of specified.

@subwaystation
Copy link
Contributor Author

aah got it!

assuming there would be more mappings in the PAF, do you know how this would affect the pangenome graph construction with seqwish?

@subwaystation
Copy link
Contributor Author

or would this be resolved with wfmash during the alignment step?

@subwaystation
Copy link
Contributor Author

@ekg Would the lower triangle mode be sufficient as input for pggb?

@baozg
Copy link

baozg commented Mar 4, 2024

Following question, @bkille

I use the sequence without "#" and follow panSN, why the running time is so different?

Name: Chr1, this alignment only have centromere / rDNA mapping in the final paf

wfmash -s 10000 -p 80 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
5101.41s user 14.62s system 1369% cpu 373.69s total 3126988Kb max memory
--
wfmash -s 10000 -p 90 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
1700.41s user 13.18s system 1140% cpu 150.23s total 3043676Kb max memory
--
wfmash -s 10000 -p 95 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.fa Rabacal-1.fa
103.47s user 11.58s system 417% cpu 27.54s total 1458372Kb max memory

Name: Col-CC#1#Chr1, Rabacal-1#1#Chr1

wfmash -a -s 10000 -p 80 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
2987.04s user 13.63s system 772% cpu 388.25s total 3944900Kb max memory
--
wfmash -a -s 10000 -p 90 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
1897.45s user 13.90s system 1092% cpu 175.01s total 4115444Kb max memory
--
wfmash -a -s 10000 -p 95 -n 1 -k 19 -H 0.001 -Y # -t 36 --hg-filter-ani-diff 30 Col-CC.panSN.fa.gz Rabacal-1.panSN.fa.gz
625.62s user 12.68s system 901% cpu 70.81s total 3093592Kb max memory

@subwaystation
Copy link
Contributor Author

subwaystation commented Mar 6, 2024

@baozg As far as I figured just right now, you need to add --lower-triangle to your command line. You still get the all-vs-all relationship, but just from one side. Sufficient!

@subwaystation
Copy link
Contributor Author

But such things are not documented... :P @bkille @AndreaGuarracino

@ASLeonard
Copy link
Contributor

The changes are at least verbose in the actual pggb.sh script in the pggb repo. Looking through the blame, --lower-triangle was added 6 months ago in pangenome/pggb@1478789 and the same for the change in meaning for the mapping parameter -n pangenome/pggb@481ec23.

Caveat emptor when building from latest source rather than pinned releases I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants