Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize event processing #705

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

tnavatar
Copy link
Contributor

@tnavatar tnavatar commented Nov 7, 2022

Updating Genegraph to allow parallel execution of events.

One component of this is changing the way Jena datasets are handled somewhat. Currently there is one singleton persistent Jena dataset shared by the entirety of a Genegraph instance. A better model is used by the RocksDB code in Genegraph, where there are multiple RocksDB instances, each used within a specific context. This code follows the second model and should facilitate moving towards it.

Jena datasets are enclosed in a record, which includes the configuration and state necessary to handle a queue and thread dedicated to processing asynchronous writes to the dataset. The queue is a blocking queue with a configurable size set at dataset initialization and a default of 100. This should allow efficient batching of writes, since the instantiation of a write transaction is somewhat expensive, while allowing back pressure in case database writes are a bottleneck (as is likely).

The main method for modifying the dataset is to issue a sequence of commands: each of which is a map with the desired command (either replacing a named graph, or removing one), as well as an optional promise to be delivered when the command has been committed to the dataset. This should also facilitate multiple named graphs being written by the same event.

:write-queue-size write-queue-size
:write-queue (ArrayBlockingQueue.
write-queue-size)}))]
(.start (Thread. #(write-loop persistent-dataset)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerned about how many threads are going to be started here. Wouldn't a configurable threadpool be more prudent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be just one thread per dataset. All the thread is doing is looking at the write queue and writing out the effects of the commands there to the dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what would be the delineation of a dataset? What datasets are you thinking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands now, there might be just one, the one we define in genegraph.database.instance. This would leave the flexibility to have more if needed, much in the same way we have a few rocks-db instances per running Genegraph. Perhaps the data aggregations necessary to put together the data for ClinVar could happen in their own dataset, for instance.

(ns genegraph.database.dataset
"Namespace for handling operations on persistent Jena datasets.
Specifically designed around handling asychronous writes. "
(:require [clojure.spec.alpha :as spec]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be used currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now my plan is to use spec to describe and validate the expected parameters for dataset commands and the parameters for opening a dataset. Have not put that in yet though...

[org.apache.jena.query.text TextDatasetFactory]))


(defrecord PersistentDataset [dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know defrecord doesn't take a doc string, but it would be nice to describe the args here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, per above comment would like to use spec for this.


(defn execute-async
"execute command list asynchronously"
[{:keys [run-atom write-queue] :as dataset} commands]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset not referenced

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I ought to clean that up.

Copy link
Contributor

@toneillbroad toneillbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is draft and some of my comments might be picky.
I think you should have both Tom and Kyle weigh in.

@larrybabb
Copy link
Contributor

@tnavatar should this be changed so others can "review" ? if not, can we move this back to "in progress" or "backlog"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants