-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph API for Scalding #1583
base: develop
Are you sure you want to change the base?
Graph API for Scalding #1583
Conversation
How does this work relate to https://github.com/twitter/scalding/blob/b1d99378b25b27fe128cb083e46032c83e9e8a88/scalding-core/src/main/scala/com/twitter/scalding/mathematics/TypedSimilarity.scala, which also includes a simple graph abstraction? It might be informative to see what those algorithms look like implemented in terms of this API (or if that's possible). |
Seems like most of those algorithms could be written in terms of the Graph structure. The nice thing of this new abstraction is working natively with vertices and minimizing data duplication across edges. Collecting neighbors has many useful properties for doing efficient graph calculations. Let me see what I can come up with. |
Added an example of doing cosine similarity with the Graph class. The intersection methods clearly need unit tests but wanted to show an example. |
*/ | ||
package com.twitter.scalding.graph | ||
|
||
case class Edge[T: Ordering, S](source: T, dest: T, attr: S) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about T: Ordering
here. This will have the Ordering
serialized with each edge, sadly. Can we move the T: Ordering
to methods that actually require an ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, never really thought about the overhead of ordering but it could probably be pretty large.
Decided to simply this PR a bit and removed the example vertex similarity code and the neighbor intersection. I can do a followup PR adding the intersection logic back in. It becomes a bit tricky to think through the case of mutuals(directed edges in both directions). |
Also for now I think it might be helpful to keep collectNeighborIds separate from collectNeighbor. Worried mostly about the memory overhead of first collecting the vertices with the attributes then doing the filtering. For very large set of neighbors just directly getting ids could be much more efficient. |
Richard Whitcomb seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
I had been using the GraphX library in Spark and realized that scalding does a much better job with very large graphs with heavy skew. Some of the features in the GraphX library were really nice though and abstracted away some of the complexity of graph work.
This review is a first pass at creating a similar API in scalding. Currently I have only been thinking about directed graphs but undirected graphs should also be supported in a reasonable library. The tests also need further work but I want to give people a chance to comment on the API.