Refactor `Storm` platform to introduce `Edge`s #728

ttim · 2017-06-05T07:43:09Z

In order to introduce grouped leftJoin I would like to make tuples we send between nodes to be more precise. In particular I want to send (K, V) tuples as tuples with key and value fields.

In this PR I did a small refactoring to introduce Edge trait which correspond to edges in Storms topology DAG - it contains a way how to serialize/deserialize data into Fields and how to group data over this edge.

pankajroark

Quick first pass is all about comments. Can you explain the reason for the change as well, i.e. how it helps with implementing grouped join.

pankajroark · 2017-06-05T16:09:07Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/Edge.scala

+import scala.collection.{ Map => CMap }
+import scala.util.Try
+
+sealed trait Edge[T] {


Doc comments for trait and every public field and method.

pankajroark · 2017-06-05T16:22:33Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/BaseBolt.scala

    ackOnEntry: AckOnEntry,
    maxExecutePerSec: MaxExecutePerSecond,
-    decoder: Injection[I, JList[AnyRef]],
-    encoder: Injection[O, JList[AnyRef]],
+    inputEdges: Map[String, Edge[I]],


Add docs for all the params, I know existing code doesn't have any, but it's a good time to add them. What does the string represent. Why are there multiple incoming edges but only one outgoing edge?

Thanks, the comment looks very useful.

pankajroark · 2017-06-05T16:25:01Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/BaseBolt.scala

@@ -192,6 +192,12 @@ case class BaseBolt[I, O](jobID: JobId,
    executor.cleanup
  }

+  def applyGroupings(declarer: BoltDeclarer): Unit = {


Doc comment.

ttim · 2017-06-05T19:02:09Z

Currently when we send tuples over the wire we send them only in two formats: as an item (then it's just one value Field) or as something partially aggregated (aggKey and aggValue with (Int, Map[K, V]) as a content).

This breaks if we want to implement grouped join because we want to have fields we are able to group by (therefore first format is out) but also we don't want to have things partially aggregated (therefore second is out).

Also right now we have runtime issues in topologies where we do flatMap and sumByKey in parallel branches from same flatMap node (see #725). This happens because of the same reason - flatMap node should emit aggregated and non aggregated values at the same time.

I can see three ways to fix this issues:

simplest - send all key value pairs in aggregated format, even if it's not suppose to be aggregated.
Pros: all bolts have simple contract with one input and output format, simplest implementation. Cons: on each (K, V) pair we will have some size overhead (shardId Int, two tuple objects) and some performance overhead (but I think it's not really substantial)
medium - all bolts send only one type of tuples (regardless to downstream), but on receive side you can distinguish based on input component. We need second part for next use case: imagine you have flatMap which emits to both sumByKey and another flatMap. In this case the only way to make topology correct is to emit single element aggregated (K, V) on source flatMap, with treating them as (K, V) on flatMap downstream node and as aggregated (K, V) on sumByKey downstream node.
Pros: contract is more or less simple, no size runtime overhead for cases which works today
Cons: still overhead in some cases, not that simple to implement (I chase this one, that's why I have Map[String, Edge[]] as an input type for Bolt)
most complex and precise - implement different strategies which depends on both sending and receiving sides.
Pros: smallest possible runtime overhead (especially if we will implement customizable OperationContainer which will be able to do and not to do partial aggregation at the same time on the same node based on downstream)
Cons: involves biggest change, especially in BaseBolt to support customized emits based on target componentId

Which one is the best, what do you think @johnynek @pankajroark ?

johnynek · 2017-06-05T19:03:05Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/Edge.scala

+
+sealed trait EdgeGrouping {
+  def apply(declarer: BoltDeclarer, parentName: String): Unit
+}


can we put this in a separate file: EdgeGrouping.scala?

And can we use the style to put inner classes to not clutter the name space:

object EdgeGrouping { case object Shuffle extends EdgeGrouping case object LocalOrShuffle extends EdgeGrouping }

Done, looks way better!

johnynek · 2017-06-05T19:04:00Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/Edge.scala

+    declarer.fieldsGrouping(parentName, fields)
+}
+
+case class ItemEdge[T] private (edgeGrouping: EdgeGrouping) extends Edge[T] {


similarly, can we put instances of Edge in the object:

object Edge { case class Item[T]... case class AggregatedKeyValues[K, V]... }

johnynek · 2017-06-05T19:06:57Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/FlatMapBoltProvider.scala

@@ -162,4 +156,15 @@ case class FlatMapBoltProvider(storm: Storm, jobID: JobId, stormDag: Dag[Storm],
      case None => getIntermediateFMBolt[Any, Any].asInstanceOf[BaseBolt[Any, Any]]
    }
  }
+
+  private def inputEdges[Input](): Map[String, Edge[Input]] = {


scala style is to use unary methods foo(): when there is a side effect. These don't have that so you should do:

def inputEdges[Input]: Map[String, Edge[Input]] =

johnynek · 2017-06-05T19:07:46Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/Edge.scala

+    }
+  }
+
+  def forKeyValue[K, V](): Injection[(K, V), JList[AnyRef]] = new Injection[(K, V), JList[AnyRef]] {


just def forKeyValue[K, V]: Injection...

johnynek · 2017-06-05T19:08:03Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/Edge.scala

+}
+
+object EdgeInjections {
+  def forItem[T](): Injection[T, JList[AnyRef]] = new Injection[T, JList[AnyRef]] {


just forItem[T]: Injection... no ()

ttim · 2017-06-06T01:53:25Z

@pankajroark added docs and explained motivation above.

I think it makes sense to use second approach for now, because I realized we need some customization of input side anyway - for flatMap node doing emits to both flatMap and sumByKey for one case we want to use shuffle grouping and for another we want to use fields grouping.

codecov-io · 2017-06-06T03:41:47Z

Codecov Report

Merging #728 into develop will increase coverage by 0.02%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           develop    #728      +/-   ##
==========================================
+ Coverage    71.38%   71.4%   +0.02%     
==========================================
  Files          141     142       +1     
  Lines         3491    3504      +13     
  Branches       195     197       +2     
==========================================
+ Hits          2492    2502      +10     
- Misses         999    1002       +3

Impacted Files	Coverage Δ
...cala/com/twitter/summingbird/storm/Constants.scala	`100% <ø> (ø)`	⬆️
.../com/twitter/summingbird/storm/StormPlatform.scala	`82.9% <100%> (+0.55%)`	⬆️
...witter/summingbird/storm/spout/KeyValueSpout.scala	`78.94% <100%> (+1.16%)`	⬆️
...scala/com/twitter/summingbird/storm/EdgeType.scala	`100% <100%> (ø)`
...a/com/twitter/summingbird/storm/EdgeGrouping.scala	`66.66% <66.66%> (ø)`
...scala/com/twitter/summingbird/storm/BaseBolt.scala	`77.01% <75%> (-0.1%)`	⬇️
...witter/summingbird/storm/FlatMapBoltProvider.scala	`98.55% <92.3%> (-1.45%)`	⬇️
.../main/scala/com/twitter/summingbird/Producer.scala	`75.75% <0%> (-1.52%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 731da7f...50ac628. Read the comment docs.

johnynek

I'm really positive about this change. Just a couple of questions for clarification.

johnynek · 2017-06-06T04:10:20Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/FlatMapBoltProvider.scala

-      new SingleItemInjection[ExecutorInput],
-      new SingleItemInjection[ExecutorOutput],
+      inputEdges[ExecutorInput],
+      // Output edge's grouping isn't important for now.


can you comment why it odes not matter what the output grouping is?

It explained in declaration site, but what is more important - I expect that to change in subsequent review.

johnynek · 2017-06-06T04:11:56Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/FlatMapBoltProvider.scala

@@ -162,4 +156,15 @@ case class FlatMapBoltProvider(storm: Storm, jobID: JobId, stormDag: Dag[Storm],
      case None => getIntermediateFMBolt[Any, Any].asInstanceOf[BaseBolt[Any, Any]]
    }
  }
+
+  private def inputEdges[Input]: Map[String, Edge[Input]] = {


seems like this should be passed in on construction from the nodes it depends on. Is that something we will get to? A win, seems to me, would be we share the same Edge instances between the output at one level and then input at the next, and it could reduce the chance for error there.

Yes, I want to make this logic in a way you said - use same Edge instances for input and output with double check on edges compatibility at topology construction time (that's why I left shardsCount for example in grouping declaration).

johnynek · 2017-06-06T04:12:38Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/StormPlatform.scala

-        new KeyValueInjection[Int, CMap[ExecutorKeyType, ExecutorValueType]],
-        new SingleItemInjection[ExecutorOutputType],
+        inputEdges,
+        // Output edge's grouping isn't important for now.


can you comment why?

Same as above.

pankajroark · 2017-06-07T15:57:26Z

On strategies: the second strategy, the one you're pursuing, does seem like the right one to me. I'm still reviewing the code.

pankajroark

Overall looks like a great refactoring that simplifies code. I really like that we pin down Edges and EdgeGrouping as concrete concepts.

pankajroark · 2017-06-07T15:58:45Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/BaseBolt.scala

+  *                     false otherwise.
+  * @param hasDependants does this node have any downstream nodes?
+  * @param ackOnEntry ack tuples in the beginning of processing.
+  * @param maxExecutePerSec limits number of executes per second, will block processing thread after.


Can we add: "Used for rate limiting."

pankajroark · 2017-06-07T15:59:52Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/BaseBolt.scala

+  * @param ackOnEntry ack tuples in the beginning of processing.
+  * @param maxExecutePerSec limits number of executes per second, will block processing thread after.
+  * @param inputEdges is a map from name of downstream node to `Edge` from it.
+  * @param outputEdge is an edge from this node. To be precise there are number of output edges,


We should document this design somewhere. The package object is a common place for documenting the desing.

Let's keep it until subsequent PR, there will be clearer place for this later.

pankajroark · 2017-06-07T16:00:34Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/BaseBolt.scala

    ackOnEntry: AckOnEntry,
    maxExecutePerSec: MaxExecutePerSecond,
-    decoder: Injection[I, JList[AnyRef]],
-    encoder: Injection[O, JList[AnyRef]],
+    inputEdges: Map[String, Edge[I]],


Thanks, the comment looks very useful.

pankajroark · 2017-06-07T16:05:13Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/Edge.scala

+
+  def forKeyValue[K, V]: Injection[(K, V), JList[AnyRef]] = new Injection[(K, V), JList[AnyRef]] {
+    override def apply(item: (K, V)): JAList[AnyRef] = {
+      val (key, v) = item


nit: Why key expanded and not value? Better to be consistent.

pankajroark · 2017-06-07T16:08:07Z

summingbird-storm/src/main/scala/com/twitter/summingbird/storm/EdgeGrouping.scala

+  * This trait is used to represent different grouping strategies in `Storm`.
+  */
+sealed trait EdgeGrouping {
+  def apply(declarer: BoltDeclarer, parentName: String): Unit


Can you comment what this does? Like what do the implementors of this trait need to do to confirm with the protocol.

pankajroark · 2017-06-07T21:02:39Z

lgtm

ttim · 2017-06-07T21:06:01Z

@johnynek are you good with this change? I'm preparing next review where I did this outputEdge thing unnecessary.

johnynek

shipit!

Introduce Edge

27957b9

pankajroark reviewed Jun 5, 2017

View reviewed changes

johnynek suggested changes Jun 5, 2017

View reviewed changes

Timur Abishev added 2 commits June 5, 2017 13:50

Address Oscar's comments

e48da88

Add docs

2d02285

johnynek reviewed Jun 6, 2017

View reviewed changes

pankajroark approved these changes Jun 7, 2017

View reviewed changes

Timur Abishev added 2 commits June 7, 2017 12:51

Pankaj's comments

cab49ba

Rename Edge to EdgeType

89835ab

Timur Abishev added 2 commits June 7, 2017 14:03

Add comment on EdgeGrouping

f93f02c

Fix compilation error

50ac628

johnynek approved these changes Jun 8, 2017

View reviewed changes

ttim merged commit 6f7fcdd into twitter:develop Jun 8, 2017

ttim deleted the introduce_edges branch June 8, 2017 05:04

Refactor Storm platform to introduce Edges #728

Refactor Storm platform to introduce Edges #728

Conversation

ttim commented Jun 5, 2017

pankajroark left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttim commented Jun 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttim commented Jun 6, 2017

codecov-io commented Jun 6, 2017 • edited Loading

Codecov Report

johnynek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pankajroark commented Jun 7, 2017

pankajroark left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttim Jun 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pankajroark commented Jun 7, 2017

ttim commented Jun 7, 2017

johnynek left a comment

Choose a reason for hiding this comment

Refactor `Storm` platform to introduce `Edge`s #728

Refactor `Storm` platform to introduce `Edge`s #728

ttim commented Jun 5, 2017 •

edited

Loading

codecov-io commented Jun 6, 2017 •

edited

Loading

ttim Jun 7, 2017 •

edited

Loading