update Broker javadoc

conveyal · Sep 20, 2023 · 9195be0 · 9195be0
1 parent 587bdb5
commit 9195be0
Showing 1 changed file with 29 additions and 24 deletions.
diff --git a/src/main/java/com/conveyal/analysis/components/broker/Broker.java b/src/main/java/com/conveyal/analysis/components/broker/Broker.java
@@ -48,36 +48,41 @@
 /**
  * This class distributes the tasks making up regional jobs to workers.
  * <p>
- * It should aim to draw tasks fairly from all organizations, and fairly from all jobs within each
- * organization, while attempting to respect the transport network affinity of each worker, giving
- * the worker tasks that require the same network it has been using recently.
+ * It respects the declared transport network affinity of each worker, giving the worker tasks that
+ * relate to the same network it has been using recently, and is therefore already loaded in memory.
  * <p>
- * Previously workers long-polled for work, holding lots of connections open. Now they short-poll
- * and sleep for a while if there's no work. This is simpler and allows us to work withing much more
- * standard HTTP frameworks.
+ * In our initial design, workers long-polled for work, holding lots of connections open. This was
+ * soon revised to short-poll and sleep for a while when there's no work. This was found to be
+ * simpler, allowing use of standard HTTP frameworks instead of custom networking code.
  * <p>
- * The fact that workers continuously re-poll for work every 10-30 seconds serves as a signal to the
- * broker that they are still alive and waiting. This also allows the broker to maintain a catalog
- * of active workers.
+ * Issue conveyal/r5#596 arose because we'd only poll when a worker was running low on tasks or
+ * idling. A worker could disappear for a long time, leaving the backend to assume it had shut down
+ * or crashed. Workers were revised to poll more frequently even when they were busy and didn't
+ * need any new tasks to work on, providing a signal to the broker that they are still alive and
+ * functioning. This allows the broker to maintain a more accurate catalog of active workers.
  * <p>
- * Because (at least currently) two organizations never share the same graph, we can get by with
- * pulling tasks cyclically or randomly from all the jobs, and actively shape the number of workers
- * with affinity for each graph by forcing some of them to accept tasks on graphs other than the one
- * they have declared affinity for.
+ * Most methods on this class are synchronized because they can be called from many HTTP handler
+ * threads at once (when many workers are polling at once). We should occasionally evaluate whether
+ * synchronizing all methods to make this threadsafe is a performance issue. If so, fine-grained
+ * locking may be advantageous, but as a rule it is much harder to design, test, and maintain.
  * <p>
- * This could be thought of as "affinity homeostasis". We  will constantly keep track of the ideal
+ * Workers were originally intended to migrate from one network to another to handle subsequent jobs
+ * without waiting for more cloud compute instances to start up. In practice we currently assign
+ * each worker a single network, but the balance of workers assigned to each network and the reuse
+ * of workers could in principle be made more sophisticated. The remainder of the comments below
+ * provide context for how this could be refined or improved.
+ *
+ * Because (at least currently) two organizations never share the same graph, we could get by with
+ * pulling tasks cyclically or randomly from all the jobs, and could actively shape the number of
+ * workers with affinity for each graph by forcing some of them to accept tasks on graphs other than
+ * the one they have declared affinity for. If the pool of workers was allowed to grow very large,
+ * we could aim to draw tasks fairly from all organizations, and fairly from all jobs within each
+ * organization.
+ * <p>
+ * We have described this approach as "affinity homeostasis": constantly keep track of the ideal
  * proportion of workers by graph (based on active jobs), and the true proportion of consumers by
  * graph (based on incoming polling), then we can decide when a worker's graph affinity should be
- * ignored and what it should be forced to.
- * <p>
- * It may also be helpful to mark jobs every time they are skipped in the LRU queue. Each time a job
- * is serviced, it is taken out of the queue and put at its end. Jobs that have not been serviced
- * float to the top.
- * <p>
- * Most methods on this class are synchronized, because they can be called from many HTTP handler
- * threads at once.
- *
- * TODO evaluate whether synchronizing all methods to make this threadsafe is a performance issue.
+ * ignored and what graph it should be forced to.
  */
 public class Broker implements Component {