-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize event processing #705
base: master
Are you sure you want to change the base?
Conversation
:write-queue-size write-queue-size | ||
:write-queue (ArrayBlockingQueue. | ||
write-queue-size)}))] | ||
(.start (Thread. #(write-loop persistent-dataset))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerned about how many threads are going to be started here. Wouldn't a configurable threadpool be more prudent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be just one thread per dataset. All the thread is doing is looking at the write queue and writing out the effects of the commands there to the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what would be the delineation of a dataset? What datasets are you thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it stands now, there might be just one, the one we define in genegraph.database.instance. This would leave the flexibility to have more if needed, much in the same way we have a few rocks-db instances per running Genegraph. Perhaps the data aggregations necessary to put together the data for ClinVar could happen in their own dataset, for instance.
(ns genegraph.database.dataset | ||
"Namespace for handling operations on persistent Jena datasets. | ||
Specifically designed around handling asychronous writes. " | ||
(:require [clojure.spec.alpha :as spec] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to be used currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now my plan is to use spec to describe and validate the expected parameters for dataset commands and the parameters for opening a dataset. Have not put that in yet though...
[org.apache.jena.query.text TextDatasetFactory])) | ||
|
||
|
||
(defrecord PersistentDataset [dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know defrecord doesn't take a doc string, but it would be nice to describe the args here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, per above comment would like to use spec for this.
|
||
(defn execute-async | ||
"execute command list asynchronously" | ||
[{:keys [run-atom write-queue] :as dataset} commands] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset not referenced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, I ought to clean that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is draft and some of my comments might be picky.
I think you should have both Tom and Kyle weigh in.
@tnavatar should this be changed so others can "review" ? if not, can we move this back to "in progress" or "backlog" |
Updating Genegraph to allow parallel execution of events.
One component of this is changing the way Jena datasets are handled somewhat. Currently there is one singleton persistent Jena dataset shared by the entirety of a Genegraph instance. A better model is used by the RocksDB code in Genegraph, where there are multiple RocksDB instances, each used within a specific context. This code follows the second model and should facilitate moving towards it.
Jena datasets are enclosed in a record, which includes the configuration and state necessary to handle a queue and thread dedicated to processing asynchronous writes to the dataset. The queue is a blocking queue with a configurable size set at dataset initialization and a default of 100. This should allow efficient batching of writes, since the instantiation of a write transaction is somewhat expensive, while allowing back pressure in case database writes are a bottleneck (as is likely).
The main method for modifying the dataset is to issue a sequence of commands: each of which is a map with the desired command (either replacing a named graph, or removing one), as well as an optional promise to be delivered when the command has been committed to the dataset. This should also facilitate multiple named graphs being written by the same event.