Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable clinvar messages to be ingested from kafka out of order #775

Open
theferrit32 opened this issue May 31, 2023 · 0 comments
Open

Enable clinvar messages to be ingested from kafka out of order #775

theferrit32 opened this issue May 31, 2023 · 0 comments
Assignees
Labels
clinvar Clinvar data exchange and reporting enhancement New feature or request

Comments

@theferrit32
Copy link
Contributor

Recent changes to the variation-normalization deployment to use nginx load balancing enables more performant parallel request handling.

In the vrs_analysis module, I enabled parallelism that outputs finished inputs out of order, as soon as completed, in order to maximize parallelism while not blowing up memory usage. In the snapshot module, we now take advantage of the nginx load balancing to improve the parallelism of each batch of kafka records, but if one event takes a long time to finish, the snapshot-populate function cannot proceed to processing the next batch of kafka messages. The greedy parallelism only happens within each batch of kafka messages.

One way to speed up the snapshot-populate function therefore would be to enable it to start processing messages in the next batches of kafka messages before it has finished the current one. In order to control RAM usage this will also require outputting finished events out of order by using claypoole/upmap. But we do have some constraints around what has to be processed before other things. At the moment, clinical_assertions need their variations, traits, and trait_sets to exist in the local database beforehand because they get embedded/linked to a specific version during the add-data-for-clinical-assertion transformer function.

There are two possible solutions to enable out of order processing:

  1. do not do any lookups to other objects in the local db during ingest/transform. This will push that logic into the query side, the snapshot function and any other function that wants to get a full clinical_assertion document will need to do additional queries for the individual subcomponents
  2. process out-of-order only within a data type. As in, the order of processing variations doesn't matter, but variations need to all be processed before clinical_assertions. We can use a lazy-seq partitioning on entity_type to do this. Where each partition is processed with claypoole/upmap, but the list of partitions is processed one at a time.

I think (2) is a preferable option because it doesn't require changing any code in the transformers, and this logic ideally could be applied to other spots we ingest messages which can have some disordering but do need some amount of order.

Genegraph unordered parallel ingest

@theferrit32 theferrit32 self-assigned this Jun 1, 2023
@theferrit32 theferrit32 added enhancement New feature or request clinvar Clinvar data exchange and reporting labels Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clinvar Clinvar data exchange and reporting enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant