Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: support larger crop sizes (>2048 tokens) #153

Open
arogozhnikov opened this issue Nov 16, 2024 · 2 comments
Open

Feature request: support larger crop sizes (>2048 tokens) #153

arogozhnikov opened this issue Nov 16, 2024 · 2 comments
Labels
enhancement New feature or request user support

Comments

@arogozhnikov
Copy link
Contributor

arogozhnikov commented Nov 16, 2024

Chai-1 is limited to 2048 tokens (token=canonical AA or atom), and the main reason is high memory consumption.

We received several requests to support larger crop sizes, but it requires significant engineering investments, though some scenarios are much simpler than others.

If you're critically bound by this limitation, please leave the following information:

  1. your case (more specifically, why it demands larger crop size), max crop size and approx. how many inference runs you need (1? 10? 10000?)
  2. your current hardware setup, single node: GPU model, GPU memory, number of GPUs in a node and connection between GPUs with a node.

An example would be 1. fake subunit to integrate into viral capsids, ~4000AAs, compare 10-100 designs. 2. 8xA100 80GB + nvlink (plot twist: you likely don't need to model full capsid to handle such scenario, and this would be way faster/cheaper/simpler)

@arogozhnikov arogozhnikov added enhancement New feature or request user support labels Nov 16, 2024
@arogozhnikov arogozhnikov pinned this issue Nov 16, 2024
@lukasz-kozlowski
Copy link

In my case, I am currently working with over 500 sequences, each exceeding 2048 amino acids. Most are close to this limit, but a few extend beyond 5,000-7,000 amino acids. Of course, this is just my specific use case, and there will always be others with different requirements.

Support for a token limit closer to that of AF3 would be greatly appreciated.

I would suggest approaching this issue differently. Consider transitioning to H100 cards (and, in the future, others such as Blackwells). Conduct engineering tests to determine the maximum token limit that CHAI-1 can handle effectively.

Additionally, it might be helpful to include memory consumption guidelines in the README file. For example:

  • A100 40GB: Maximum token size ~1500
  • A100 80GB: Maximum token size ~2048
    ... and so on.

A quick workaround could also be advising users to sort sequences in their batch by size and run until GPU memory becomes insufficient.

@amelie-iska
Copy link

amelie-iska commented Nov 29, 2024

Use case: Modeling entire hemichannels or gap junctions, and small ligands bound to them
Hardware: 4xA100 (80 GB) node
Number of Residues: ~2600 for a hemichannel + ligands, ~5200 for a gap junction + ligands

Ultimately, being able to model the open state vs. the closed state, potentially using some contact constraints would be VERY useful to me, and I will need to do this with many kinds of gap junctions in a medium throughput workflow (maybe 100-500 predictions per connexin type). Anything that passes the simpler connexin-ligand validation step then goes into the hemichannel-6x-ligand validation step, or into the gap junction-6x-ligand validation step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request user support
Projects
None yet
Development

No branches or pull requests

3 participants