-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please describe in detail the differences between the webserver and local installation, and how to achieve similar results on both #176
Comments
The chai-1 model is non-deterministic, so we expect a degree of variance in the results, particularly in the case of MSAs that you have pointed out.
Our preprint describes results for running the model with MSAs in conjunction with language model embeddings, and without them in "single sequence mode" which only uses protein language model embeddings. Unless a result is specifically labeled as single-sequence, it is using MSAs. Let me know if you have a question about a specific result.
The webserver uses the MSA setup described in the preprint in our methods section. We use jackhmmer with the following flags to query UniProt, UniRef90, and MGNify using the datasource specific
We don't have any special scripts for preprocessing/postprocessing the MSAs returned by the above call; most of the MSA code that isn't available under this repo is just passing stuff around the server backend to keep things manageable, so it shouldn't impact the results and probably isn't very helpful unless you're setting up your own MSA server. If you are looking for an easy way to run MSAs without setting everything up locally, we also have added integration with the mmseqs server; see our updated README for an example. Note however, that we did not use mmseqs to generate results for our paper nor do we use it on the server; this is provided mostly as a convenience for prototyping and experimentation. |
@wukevin Additionally, the preprint mentions using three databases: |
This is exactly the point: running the pipeline always involves executing specific commands, but there are countless combinations of different options, programs, and their versions. Therefore, exact commands (e.g., a bash script) are essential to replicate what you did on the backend or for the paper. Without this, it’s merely guessing. For example, what thresholds were used? Did you stick to defaults? Note that default thresholds may differ across different versions of MSA programs (see comment above). You stated, "We use jackhmmer with the following flags to query UniProt, UniRef90, and MGNify using the datasource-specific --seq_limit described in our paper." So, do you actually query each database iteratively and end up with three separate MSA results? How do you merge those MSA files? If not, how do you combine UniProt, UniRef90, and MGNify databases before running jackhmmer? Do you cluster the merged database beforehand? Combining unclustered UniProt/MGNify with clustered UniRef90 into one run is biologically problematic. The inconsistencies could lead to biased, suboptimal, or misleading results. Finally, was the 'reduced BFD' used anywhere? If so, why was the reduced version chosen instead of the full BFD? |
I'm not sure what you mean by the
Perhaps you can do a clean install of
We do not search BFD on the server, but it is used during training. |
We understand the importance of reproducibility, which is why we have detailed the exact command used for
Yes, we end up with separate MSA results, which are then combined following the procedure outlined in the AF3 supplement.
We use reduced BFD during training; we use the reduced version in particular as it is focused on the representative sequences. |
Thank you for your response. I came across the following issue in the On the other hand, in another issue, there is a proposal on how to patch
Is it correct to understand that this can be achieved by placing the separate MSA results in any directory and running |
Yes, we applied the patches in this thread to add the We also applied the patch mentioned here to write the outputs in the |
After running the same cases both locally and remotely via the webserver, I noticed differences in the results. This is primarily due to the MSA option, which is understandable, as MSAs add more information and generally improve the outcomes.
I have reviewed the README at this link, but I still have several questions:
a) Are the results presented in the preprint or whitepapers based on the MSA version, or the version without MSA?
b) What exact database is used on the webserver backend? Specifically, I’d like to know the database name, the MSA tool utilized, the parameters applied, and any other relevant details.
Ideally, it would be incredibly helpful if you could share scripts for the pipeline you are using.
At the moment, replicating your results is quite challenging. Most local users will likely run the method only with use_esm_embeddings=True, which leads to suboptimal results.
The text was updated successfully, but these errors were encountered: