Feature request: support larger crop sizes (>2048 tokens) #153

arogozhnikov · 2024-11-16T12:00:29Z

Chai-1 is limited to 2048 tokens (token=canonical AA or atom), and the main reason is high memory consumption.

We received several requests to support larger crop sizes, but it requires significant engineering investments, though some scenarios are much simpler than others.

If you're critically bound by this limitation, please leave the following information:

your case (more specifically, why it demands larger crop size), max crop size and approx. how many inference runs you need (1? 10? 10000?)
your current hardware setup, single node: GPU model, GPU memory, number of GPUs in a node and connection between GPUs with a node.

An example would be 1. fake subunit to integrate into viral capsids, ~4000AAs, compare 10-100 designs. 2. 8xA100 80GB + nvlink (plot twist: you likely don't need to model full capsid to handle such scenario, and this would be way faster/cheaper/simpler)

lukasz-kozlowski · 2024-11-25T16:34:21Z

In my case, I am currently working with over 500 sequences, each exceeding 2048 amino acids. Most are close to this limit, but a few extend beyond 5,000-7,000 amino acids. Of course, this is just my specific use case, and there will always be others with different requirements.

Support for a token limit closer to that of AF3 would be greatly appreciated.

I would suggest approaching this issue differently. Consider transitioning to H100 cards (and, in the future, others such as Blackwells). Conduct engineering tests to determine the maximum token limit that CHAI-1 can handle effectively.

Additionally, it might be helpful to include memory consumption guidelines in the README file. For example:

A100 40GB: Maximum token size ~1500
A100 80GB: Maximum token size ~2048
... and so on.

A quick workaround could also be advising users to sort sequences in their batch by size and run until GPU memory becomes insufficient.

amelie-iska · 2024-11-29T23:47:26Z

Use case: Modeling entire hemichannels or gap junctions, and small ligands bound to them
Hardware: 4xA100 (80 GB) node
Number of Residues: ~2600 for a hemichannel + ligands, ~5200 for a gap junction + ligands

Ultimately, being able to model the open state vs. the closed state, potentially using some contact constraints would be VERY useful to me, and I will need to do this with many kinds of gap junctions in a medium throughput workflow (maybe 100-500 predictions per connexin type). Anything that passes the simpler connexin-ligand validation step then goes into the hemichannel-6x-ligand validation step, or into the gap junction-6x-ligand validation step.

arogozhnikov added enhancement New feature or request user support labels Nov 16, 2024

arogozhnikov pinned this issue Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: support larger crop sizes (>2048 tokens) #153

Feature request: support larger crop sizes (>2048 tokens) #153

arogozhnikov commented Nov 16, 2024 •

edited

Loading

lukasz-kozlowski commented Nov 25, 2024

amelie-iska commented Nov 29, 2024 •

edited

Loading

Feature request: support larger crop sizes (>2048 tokens) #153

Feature request: support larger crop sizes (>2048 tokens) #153

Comments

arogozhnikov commented Nov 16, 2024 • edited Loading

lukasz-kozlowski commented Nov 25, 2024

amelie-iska commented Nov 29, 2024 • edited Loading

arogozhnikov commented Nov 16, 2024 •

edited

Loading

amelie-iska commented Nov 29, 2024 •

edited

Loading