You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both DList#groupBy and DList#groupWith differ only in that one of them abuses implicit resolution. One of them must be deleted.
A quick-fix is to delete DList#groupWith where the abuse is manifested, but there may be a more significant improvement available by resolving the underlying tension that compels such an abuse.
The text was updated successfully, but these errors were encountered:
Imho Grouping has been, and is, really useful, but perhaps it should be retired completely. It seems like it tries to do too much (e.g. (secondary) sorting), while not being able to do enough (it's can't use any information from the values, unlike hadoop's Partitioner) ..and its a little awkward to use (you just hope that keys/values in the subsequent parallel do obey your constraints.
So my proposal would be to just jack hadoops interface verbatim. (And use implicits to provide one for all types)
And DList[(K,V)].groupByKey require a Ordering[K] along with a Partioner[K, V]. We could provide an implicit that provides a Partioner[K,V] for all K's and V's. Which would quite elegantly cover the basic uses.
Then to support the crazy-ass stuff, I think we could just have a single function like:
DList[(K,V)] {
defcrazyDo[B]( // none of these implicit, since you really want to manually specify thempartioner: Partioner[K,V],
keyOrder: Ordering[K], // order the keys are presented to the reducergrouping: KeyCollapser[K], // more on this below
):GroupedDList[K, V]
}
Where KeyCollapser might be defined as: trait KeyCollapser[K]{ def shouldCollapse[K](a: K, b: K): Boolean }
which conceptually is the function that slides over the list of Keys, after they have gone to a partition, and after hadoop has sorted them -- decided if they should be compressed into one or not.
(Although, it might be safer to not go with a KeyCollapser, and just require another Ordering).
And GroupedList[K, V] would be a special kind of a DList that only accepts a doFn (or something) where it guarantees that it'll be in the sorted/grouped fasion
And I'd assume the way I propose could result in performance advantages -- as we would only have to tell hadoop to sort the keys if it is actually needed
Both
DList#groupBy
andDList#groupWith
differ only in that one of them abuses implicit resolution. One of them must be deleted.A quick-fix is to delete
DList#groupWith
where the abuse is manifested, but there may be a more significant improvement available by resolving the underlying tension that compels such an abuse.The text was updated successfully, but these errors were encountered: