Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jumble parameter for tree space search should be user configurable #101

Open
wsdewitt opened this issue Jun 15, 2022 · 6 comments
Open

Jumble parameter for tree space search should be user configurable #101

wsdewitt opened this issue Jun 15, 2022 · 6 comments

Comments

@wsdewitt
Copy link
Collaborator

The "Jumble" option in dnapars determines the number of random starts to try for tree space search, described as follows in the phylip docs:

The J (Jumble) option. In most of the tree construction programs (except for the "...penny" programs and Clique), the exact details of the search of different trees depend on the order of input of species. In these programs J option enables you to tell the program to use a random number generator to choose the input order of species. This option is toggled on and off by selecting option J in the menu. The program will then prompt you for a "seed" for the random number generator. The seed should be an integer between 1 and $2^{32}-3$ (which is $4,294,967,293$), and should be of form $4n+1$, which means that it must give a remainder of 1 when divided by 4. This can be judged by looking at the last two digits of the number (for example, in the upper limit given above, the last two digits are 93, which is of form $4n+1$. Each different seed leads to a different sequence of addition of species. By simply changing the random number seed and re-running the programs one can look for other, and better trees. If the seed entered is not odd, the program will not proceed, but will prompt for another seed.

This is currently fixed at 10 in the mkconfig program:

print("J")
# random seed for tree search
print("1")
print("10")

This should be made a parameter (default 10) in the command line interface.

@willdumm
Copy link
Contributor

willdumm commented Jul 12, 2022

@wsdewitt From what I can tell, changing the jumble parameter doesn't actually change the trees that dnapars finds. For example, using the sample data in example, doing

mkconfig --jumble 100 deduplicated.phylip dnapars > dnapars.cfg
dnapars < dnapars.cfg > dnapars.log

and doing

mkconfig --jumble 10 deduplicated.phylip dnapars > dnapars.cfg
dnapars < dnapars.cfg > dnapars.log

both result in 263 trees found by dnapars, in about the same amount of time. I suspect that dnapars is just ignoring the jumble parameter and only evaluating a single (?) input sequence order.

However, if I change the random seed passed to the jumble parameter in dnapars.cfg, I do get different output trees. For example, changing the config file output by the above command to mkconfig:

/home/wdumm/gctree/gctree/example/deduplicated.phylip
J
1
10
O
1
4
5
.
Y

to this config file:

/home/wdumm/gctree/gctree/example/deduplicated.phylip
J
5
10
O
1
4
5
.
Y

results in 137 (presumably different from the other 263) trees from dnapars. The number following J is the random seed, for shuffling the input sequence order, which must be of the form $4n + 1 $.

@willdumm willdumm reopened this Jul 12, 2022
@wsdewitt
Copy link
Collaborator Author

I am seeing larger parsimony forests with more jumbles, using the following commands.

deduplicate example/150228_Clone_3-8.fasta --root GL > deduplicated.phylip
mkconfig --jumble 10 deduplicated.phylip dnapars > dnapars.cfg
dnapars < dnapars.cfg > dnapars.log
head -5 outfile

DNA parsimony algorithm, version 3.697


    42 trees in all found
mkconfig --jumble 100 --quick deduplicated.phylip dnapars > dnapars.cfg
rm outfile outtree
dnapars < dnapars.cfg > dnapars.log
head -5 outfile

DNA parsimony algorithm, version 3.697


   287 trees in all found

Note these use the "quick" option. I haven't had time to try without the quick option yet.

@wsdewitt
Copy link
Collaborator Author

Ok I retried without --quick and found 263 trees with 1 jumble and with 2 jumbles (you found the same number using 10 or 100 jumbles). I'm not sure if this expected (we could ask Joe), or maybe it's just a feature of this data set that jumbling doesn't help the non-quick tree search find any more trees.

@willdumm
Copy link
Contributor

The funny thing is that changing the random seed for shuffling the sequences does change the trees that dnapars finds. You'd think that would be the same as shuffling multiple times starting with one seed.

@wsdewitt
Copy link
Collaborator Author

Good point. Ok maybe we should ask Joe. Do you agree @matsen?

@matsen
Copy link
Contributor

matsen commented Jul 15, 2022

Y'all are much deeper in this than me, so I don't have much to say, but I'm sure that Joe would be happy to hear from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants