Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please describe in detail the differences between the webserver and local installation, and how to achieve similar results on both #176

Open
lukasz-kozlowski opened this issue Nov 25, 2024 · 7 comments

Comments

@lukasz-kozlowski
Copy link

After running the same cases both locally and remotely via the webserver, I noticed differences in the results. This is primarily due to the MSA option, which is understandable, as MSAs add more information and generally improve the outcomes.

I have reviewed the README at this link, but I still have several questions:
a) Are the results presented in the preprint or whitepapers based on the MSA version, or the version without MSA?
b) What exact database is used on the webserver backend? Specifically, I’d like to know the database name, the MSA tool utilized, the parameters applied, and any other relevant details.

Ideally, it would be incredibly helpful if you could share scripts for the pipeline you are using.

At the moment, replicating your results is quite challenging. Most local users will likely run the method only with use_esm_embeddings=True, which leads to suboptimal results.

@wukevin
Copy link
Contributor

wukevin commented Nov 27, 2024

The chai-1 model is non-deterministic, so we expect a degree of variance in the results, particularly in the case of MSAs that you have pointed out.

a) Are the results presented in the preprint or whitepapers based on the MSA version, or the version without MSA?

Our preprint describes results for running the model with MSAs in conjunction with language model embeddings, and without them in "single sequence mode" which only uses protein language model embeddings. Unless a result is specifically labeled as single-sequence, it is using MSAs. Let me know if you have a question about a specific result.

b) What exact database is used on the webserver backend? Specifically, I’d like to know the database name, the MSA tool utilized, the parameters applied, and any other relevant details.

The webserver uses the MSA setup described in the preprint in our methods section. We use jackhmmer with the following flags to query UniProt, UniRef90, and MGNify using the datasource specific --seq_limit described in our paper.

-N 1 -E 0.0001 --incE 0.0001 --F1 0.0005 --F2 0.00005 --F3 0.0000005

Ideally, it would be incredibly helpful if you could share scripts for the pipeline you are using.

We don't have any special scripts for preprocessing/postprocessing the MSAs returned by the above call; most of the MSA code that isn't available under this repo is just passing stuff around the server backend to keep things manageable, so it shouldn't impact the results and probably isn't very helpful unless you're setting up your own MSA server.

If you are looking for an easy way to run MSAs without setting everything up locally, we also have added integration with the mmseqs server; see our updated README for an example. Note however, that we did not use mmseqs to generate results for our paper nor do we use it on the server; this is provided mostly as a convenience for prototyping and experimentation.

@komatsuna-san
Copy link

@wukevin
In my environment (HMMER ver. 3.4), the --seq_limit option for jackhmmer does not seem to exist. How are you utilizing --seq_limit in your execution environment?

Additionally, the preprint mentions using three databases: UniRef90, UniProt, and MG-nify. However, in the table explaining the --seq_limit values, reduced BFD is also listed. Is it correct to understand that reduced BFD was also used in the processing described in the preprint and on your opened server?

@lukasz-kozlowski
Copy link
Author

This is exactly the point: running the pipeline always involves executing specific commands, but there are countless combinations of different options, programs, and their versions. Therefore, exact commands (e.g., a bash script) are essential to replicate what you did on the backend or for the paper. Without this, it’s merely guessing. For example, what thresholds were used? Did you stick to defaults? Note that default thresholds may differ across different versions of MSA programs (see comment above).

You stated, "We use jackhmmer with the following flags to query UniProt, UniRef90, and MGNify using the datasource-specific --seq_limit described in our paper." So, do you actually query each database iteratively and end up with three separate MSA results? How do you merge those MSA files?

If not, how do you combine UniProt, UniRef90, and MGNify databases before running jackhmmer? Do you cluster the merged database beforehand?

Combining unclustered UniProt/MGNify with clustered UniRef90 into one run is biologically problematic. The inconsistencies could lead to biased, suboptimal, or misleading results.

Finally, was the 'reduced BFD' used anywhere? If so, why was the reduced version chosen instead of the full BFD?

@wukevin
Copy link
Contributor

wukevin commented Dec 3, 2024

@wukevin In my environment (HMMER ver. 3.4), the --seq_limit option for jackhmmer does not seem to exist. How are you utilizing --seq_limit in your execution environment?

I'm not sure what you mean by the --seq_limit option being missing; here's our internal jackhmmer:

>>> jackhmmer -h
# jackhmmer :: iteratively search a protein sequence against a protein database
# HMMER 3.4 (Aug 2023); http://hmmer.org/
# Copyright (C) 2023 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: jackhmmer [-options] <seqfile> <seqdb>
...
  --seq_limit <n> : maximum number of output sequences  (n>=0)
...

Perhaps you can do a clean install of hmmer or reach out to the maintainers for more help on that front.

Additionally, the preprint mentions using three databases: UniRef90, UniProt, and MG-nify. However, in the table explaining the --seq_limit values, reduced BFD is also listed. Is it correct to understand that reduced BFD was also used in the processing described in the preprint and on your opened server?

We do not search BFD on the server, but it is used during training.

@wukevin
Copy link
Contributor

wukevin commented Dec 3, 2024

This is exactly the point: running the pipeline always involves executing specific commands, but there are countless combinations of different options, programs, and their versions. Therefore, exact commands (e.g., a bash script) are essential to replicate what you did on the backend or for the paper. Without this, it’s merely guessing. For example, what thresholds were used? Did you stick to defaults? Note that default thresholds may differ across different versions of MSA programs (see comment above).

We understand the importance of reproducibility, which is why we have detailed the exact command used for jackhmmer. As stated in our preprint, we used version 3.4.

You stated, "We use jackhmmer with the following flags to query UniProt, UniRef90, and MGNify using the datasource-specific --seq_limit described in our paper." So, do you actually query each database iteratively and end up with three separate MSA results? How do you merge those MSA files?

Yes, we end up with separate MSA results, which are then combined following the procedure outlined in the AF3 supplement.

Finally, was the 'reduced BFD' used anywhere? If so, why was the reduced version chosen instead of the full BFD?

We use reduced BFD during training; we use the reduced version in particular as it is focused on the representative sequences.

@komatsuna-san
Copy link

komatsuna-san commented Dec 4, 2024

@wukevin

I'm not sure what you mean by the --seq_limit option being missing; here's our internal jackhmmer: Perhaps you can do a clean install of hmmer or reach out to the maintainers for more help on that front.

Thank you for your response. I came across the following issue in the HMMER GitHub repository.
EddyRivasLab/hmmer#330
From this, it appears that the official HMMER (jackhmmer) does not include the --seq_limit option.

On the other hand, in another issue, there is a proposal on how to patch jackhmmer to enable the --seq_limit option. Additionally, there is mention of a patch to output .a3m files instead of Stockholm format from jackhmmer.
EddyRivasLab/hmmer#323
Could it be that the jackhmmer installed in your environment is not the native version but one that has been patched as described above?

Yes, we end up with separate MSA results, which are then combined following the procedure outlined in the AF3 supplement.

Is it correct to understand that this can be achieved by placing the separate MSA results in any directory and running chai_lab/data/parsing/msas/aligned_pqt.py on that directory?

@jackdent
Copy link
Contributor

jackdent commented Dec 4, 2024

Yes, we applied the patches in this thread to add the seq_limit flag to hmmer and limit the depth of the MSA. The seq_limit flag is not strictly necessary: it's an optimization that helps reduce data processing time in some instances. It does not impact results.

We also applied the patch mentioned here to write the outputs in the a3m format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants