-
Notifications
You must be signed in to change notification settings - Fork 267
Refactor Storm
platform to introduce Edge
s
#728
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick first pass is all about comments. Can you explain the reason for the change as well, i.e. how it helps with implementing grouped join.
import scala.collection.{ Map => CMap } | ||
import scala.util.Try | ||
|
||
sealed trait Edge[T] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc comments for trait and every public field and method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ackOnEntry: AckOnEntry, | ||
maxExecutePerSec: MaxExecutePerSecond, | ||
decoder: Injection[I, JList[AnyRef]], | ||
encoder: Injection[O, JList[AnyRef]], | ||
inputEdges: Map[String, Edge[I]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docs for all the params, I know existing code doesn't have any, but it's a good time to add them. What does the string represent. Why are there multiple incoming edges but only one outgoing edge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the comment looks very useful.
@@ -192,6 +192,12 @@ case class BaseBolt[I, O](jobID: JobId, | |||
executor.cleanup | |||
} | |||
|
|||
def applyGroupings(declarer: BoltDeclarer): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Currently when we send tuples over the wire we send them only in two formats: as an item (then it's just one This breaks if we want to implement grouped join because we want to have fields we are able to group by (therefore first format is out) but also we don't want to have things partially aggregated (therefore second is out). Also right now we have runtime issues in topologies where we do I can see three ways to fix this issues:
Which one is the best, what do you think @johnynek @pankajroark ? |
|
||
sealed trait EdgeGrouping { | ||
def apply(declarer: BoltDeclarer, parentName: String): Unit | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put this in a separate file: EdgeGrouping.scala
?
And can we use the style to put inner classes to not clutter the name space:
object EdgeGrouping {
case object Shuffle extends EdgeGrouping
case object LocalOrShuffle extends EdgeGrouping
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, looks way better!
declarer.fieldsGrouping(parentName, fields) | ||
} | ||
|
||
case class ItemEdge[T] private (edgeGrouping: EdgeGrouping) extends Edge[T] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, can we put instances of Edge
in the object:
object Edge {
case class Item[T]...
case class AggregatedKeyValues[K, V]...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -162,4 +156,15 @@ case class FlatMapBoltProvider(storm: Storm, jobID: JobId, stormDag: Dag[Storm], | |||
case None => getIntermediateFMBolt[Any, Any].asInstanceOf[BaseBolt[Any, Any]] | |||
} | |||
} | |||
|
|||
private def inputEdges[Input](): Map[String, Edge[Input]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala style is to use unary methods foo():
when there is a side effect. These don't have that so you should do:
def inputEdges[Input]: Map[String, Edge[Input]] =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
} | ||
|
||
def forKeyValue[K, V](): Injection[(K, V), JList[AnyRef]] = new Injection[(K, V), JList[AnyRef]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just def forKeyValue[K, V]: Injection...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
object EdgeInjections { | ||
def forItem[T](): Injection[T, JList[AnyRef]] = new Injection[T, JList[AnyRef]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just forItem[T]: Injection...
no ()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@pankajroark added docs and explained motivation above. I think it makes sense to use second approach for now, because I realized we need some customization of input side anyway - for |
Codecov Report
@@ Coverage Diff @@
## develop #728 +/- ##
==========================================
+ Coverage 71.38% 71.4% +0.02%
==========================================
Files 141 142 +1
Lines 3491 3504 +13
Branches 195 197 +2
==========================================
+ Hits 2492 2502 +10
- Misses 999 1002 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really positive about this change. Just a couple of questions for clarification.
new SingleItemInjection[ExecutorInput], | ||
new SingleItemInjection[ExecutorOutput], | ||
inputEdges[ExecutorInput], | ||
// Output edge's grouping isn't important for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment why it odes not matter what the output grouping is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It explained in declaration site, but what is more important - I expect that to change in subsequent review.
@@ -162,4 +156,15 @@ case class FlatMapBoltProvider(storm: Storm, jobID: JobId, stormDag: Dag[Storm], | |||
case None => getIntermediateFMBolt[Any, Any].asInstanceOf[BaseBolt[Any, Any]] | |||
} | |||
} | |||
|
|||
private def inputEdges[Input]: Map[String, Edge[Input]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this should be passed in on construction from the nodes it depends on. Is that something we will get to? A win, seems to me, would be we share the same Edge
instances between the output at one level and then input at the next, and it could reduce the chance for error there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I want to make this logic in a way you said - use same Edge
instances for input and output with double check on edges compatibility at topology construction time (that's why I left shardsCount
for example in grouping declaration).
new KeyValueInjection[Int, CMap[ExecutorKeyType, ExecutorValueType]], | ||
new SingleItemInjection[ExecutorOutputType], | ||
inputEdges, | ||
// Output edge's grouping isn't important for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
On strategies: the second strategy, the one you're pursuing, does seem like the right one to me. I'm still reviewing the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks like a great refactoring that simplifies code. I really like that we pin down Edges and EdgeGrouping as concrete concepts.
* false otherwise. | ||
* @param hasDependants does this node have any downstream nodes? | ||
* @param ackOnEntry ack tuples in the beginning of processing. | ||
* @param maxExecutePerSec limits number of executes per second, will block processing thread after. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add: "Used for rate limiting."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* @param ackOnEntry ack tuples in the beginning of processing. | ||
* @param maxExecutePerSec limits number of executes per second, will block processing thread after. | ||
* @param inputEdges is a map from name of downstream node to `Edge` from it. | ||
* @param outputEdge is an edge from this node. To be precise there are number of output edges, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document this design somewhere. The package object is a common place for documenting the desing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it until subsequent PR, there will be clearer place for this later.
ackOnEntry: AckOnEntry, | ||
maxExecutePerSec: MaxExecutePerSecond, | ||
decoder: Injection[I, JList[AnyRef]], | ||
encoder: Injection[O, JList[AnyRef]], | ||
inputEdges: Map[String, Edge[I]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the comment looks very useful.
|
||
def forKeyValue[K, V]: Injection[(K, V), JList[AnyRef]] = new Injection[(K, V), JList[AnyRef]] { | ||
override def apply(item: (K, V)): JAList[AnyRef] = { | ||
val (key, v) = item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why key expanded and not value? Better to be consistent.
* This trait is used to represent different grouping strategies in `Storm`. | ||
*/ | ||
sealed trait EdgeGrouping { | ||
def apply(declarer: BoltDeclarer, parentName: String): Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment what this does? Like what do the implementors of this trait need to do to confirm with the protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
lgtm |
@johnynek are you good with this change? I'm preparing next review where I did this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shipit!
In order to introduce grouped
leftJoin
I would like to make tuples we send between nodes to be more precise. In particular I want to send(K, V)
tuples as tuples withkey
andvalue
fields.In this PR I did a small refactoring to introduce
Edge
trait which correspond to edges inStorm
s topology DAG - it contains a way how to serialize/deserialize data intoFields
and how to group data over this edge.