Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure concurrency safety when running multiple instances #54

Open
GianlucaFicarelli opened this issue Nov 28, 2024 · 1 comment
Open
Assignees

Comments

@GianlucaFicarelli
Copy link
Collaborator

GianlucaFicarelli commented Nov 28, 2024

At the moment we deploy only one instance of the service, but in the future we may want to allow to start multiple instances to scale horizontally.

Before doing that we should ensure that the service works correctly even in that case.

In particular:

  1. alembic migration: it's executed when the container is started, before starting uvicorn. If multiple containers are started at the same time there can be a race condition that could cause the migration to fail. Possible solutions:
    1. Ensure that the container running the migration acquires a lock. The other containers will wait, then skip the migration because the db is already updated.
    2. Run the migration as a step of the CI executing the deployment
    3. Use other mechanisms for leader election
    4. Run the migration manually (I would avoid that if possible)
  2. tasks: the container runs some tasks:
    1. queue_consumers (they consume messages from the oneshot, longrun, storage queues)
    2. job_chargers (they run periodically and charge the user for the running or finished uncharged oneshot, longrun, storage jobs)

(migrated from https://bbpteam.epfl.ch/project/issues/browse/NSETM-2332)

@GianlucaFicarelli GianlucaFicarelli self-assigned this Nov 28, 2024
@GianlucaFicarelli
Copy link
Collaborator Author

  1. Alembic migration, using an exclusive transaction-level advisory lock: Acquire exclusive lock during db migration #50
  2. Job charger tasks, using a lock on the task row in the task_registry table: Prevent charger tasks from running concurrently #49
  3. Queue consumer tasks: a lock shouldn't be needed, since the data are retrieved from the queues, and SQS ensures that the messages are processed in order for each group (if using appropriate message group IDs). See:
    1. From https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/using-messagegroupid-property.html: MessageGroupId is the tag that specifies that a message belongs to a specific message group. Messages that belong to the same message group are always processed one by one, in a strict order relative to the message group (however, messages that belong to different message groups might be processed out of order).
    2. From https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/interleaving-multiple-ordered-message-groups.html: To interleave multiple ordered message groups within a single FIFO queue, use message group ID values (for example, session data for multiple users). In this scenario, multiple consumers can process the queue, but the session data of each user is processed in a FIFO manner. When messages that belong to a particular message group ID are invisible, no other consumer can process messages with the same message group ID.

So the requirements for 3 are:

 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant