Parallelize event processing #705

tnavatar · 2022-11-07T16:04:39Z

Updating Genegraph to allow parallel execution of events.

One component of this is changing the way Jena datasets are handled somewhat. Currently there is one singleton persistent Jena dataset shared by the entirety of a Genegraph instance. A better model is used by the RocksDB code in Genegraph, where there are multiple RocksDB instances, each used within a specific context. This code follows the second model and should facilitate moving towards it.

Jena datasets are enclosed in a record, which includes the configuration and state necessary to handle a queue and thread dedicated to processing asynchronous writes to the dataset. The queue is a blocking queue with a configurable size set at dataset initialization and a default of 100. This should allow efficient batching of writes, since the instantiation of a write transaction is somewhat expensive, while allowing back pressure in case database writes are a bottleneck (as is likely).

The main method for modifying the dataset is to issue a sequence of commands: each of which is a map with the desired command (either replacing a named graph, or removing one), as well as an optional promise to be delivered when the command has been committed to the dataset. This should also facilitate multiple named graphs being written by the same event.

toneillbroad · 2022-11-07T20:48:54Z

src/genegraph/database/dataset.clj

+                              :write-queue-size write-queue-size
+                              :write-queue (ArrayBlockingQueue.
+                                            write-queue-size)}))]
+    (.start (Thread. #(write-loop persistent-dataset)))


Concerned about how many threads are going to be started here. Wouldn't a configurable threadpool be more prudent?

Should be just one thread per dataset. All the thread is doing is looking at the write queue and writing out the effects of the commands there to the dataset.

So what would be the delineation of a dataset? What datasets are you thinking?

As it stands now, there might be just one, the one we define in genegraph.database.instance. This would leave the flexibility to have more if needed, much in the same way we have a few rocks-db instances per running Genegraph. Perhaps the data aggregations necessary to put together the data for ClinVar could happen in their own dataset, for instance.

toneillbroad · 2022-11-08T13:30:02Z

src/genegraph/database/dataset.clj

+(ns genegraph.database.dataset
+  "Namespace for handling operations on persistent Jena datasets.
+  Specifically designed around handling asychronous writes. "
+  (:require [clojure.spec.alpha :as spec]


This doesn't appear to be used currently.

Right now my plan is to use spec to describe and validate the expected parameters for dataset commands and the parameters for opening a dataset. Have not put that in yet though...

toneillbroad · 2022-11-08T13:32:52Z

src/genegraph/database/dataset.clj

+           [org.apache.jena.query.text TextDatasetFactory]))
+
+
+(defrecord PersistentDataset [dataset


I know defrecord doesn't take a doc string, but it would be nice to describe the args here

Yup, per above comment would like to use spec for this.

toneillbroad · 2022-11-08T13:48:56Z

src/genegraph/database/dataset.clj

+
+(defn execute-async
+  "execute command list asynchronously"
+  [{:keys [run-atom write-queue] :as dataset} commands]


dataset not referenced

Fair, I ought to clean that up.

toneillbroad

I know this is draft and some of my comments might be picky.
I think you should have both Tom and Kyle weigh in.

larrybabb · 2023-04-25T19:48:58Z

@tnavatar should this be changed so others can "review" ? if not, can we move this back to "in progress" or "backlog"

tnavatar added 2 commits November 4, 2022 16:02

Working asyc jena instance

4fbf8fa

Adding promise functionality

b121924

toneillbroad reviewed Nov 7, 2022

View reviewed changes

toneillbroad reviewed Nov 8, 2022

View reviewed changes

toneillbroad approved these changes Nov 8, 2022

View reviewed changes

Wiring into the interceptor stack

15150a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize event processing #705

Parallelize event processing #705

tnavatar commented Nov 7, 2022 •

edited

Loading

toneillbroad Nov 7, 2022

tnavatar Nov 7, 2022

toneillbroad Nov 7, 2022

tnavatar Nov 7, 2022

toneillbroad Nov 8, 2022

tnavatar Nov 8, 2022

toneillbroad Nov 8, 2022

tnavatar Nov 8, 2022

toneillbroad Nov 8, 2022

tnavatar Nov 8, 2022

toneillbroad left a comment

larrybabb commented Apr 25, 2023

		[org.apache.jena.query.text TextDatasetFactory]))


		(defrecord PersistentDataset [dataset

Parallelize event processing #705

Are you sure you want to change the base?

Parallelize event processing #705

Conversation

tnavatar commented Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toneillbroad left a comment

Choose a reason for hiding this comment

larrybabb commented Apr 25, 2023

tnavatar commented Nov 7, 2022 •

edited

Loading