Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph API for Scalding #1583

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from
Open

Conversation

richwhitjr
Copy link
Contributor

I had been using the GraphX library in Spark and realized that scalding does a much better job with very large graphs with heavy skew. Some of the features in the GraphX library were really nice though and abstracted away some of the complexity of graph work.

This review is a first pass at creating a similar API in scalding. Currently I have only been thinking about directed graphs but undirected graphs should also be supported in a reasonable library. The tests also need further work but I want to give people a chance to comment on the API.

@avibryant
Copy link
Contributor

How does this work relate to https://github.com/twitter/scalding/blob/b1d99378b25b27fe128cb083e46032c83e9e8a88/scalding-core/src/main/scala/com/twitter/scalding/mathematics/TypedSimilarity.scala, which also includes a simple graph abstraction? It might be informative to see what those algorithms look like implemented in terms of this API (or if that's possible).

@richwhitjr
Copy link
Contributor Author

Seems like most of those algorithms could be written in terms of the Graph structure. The nice thing of this new abstraction is working natively with vertices and minimizing data duplication across edges. Collecting neighbors has many useful properties for doing efficient graph calculations.

Let me see what I can come up with.

@richwhitjr
Copy link
Contributor Author

Added an example of doing cosine similarity with the Graph class. The intersection methods clearly need unit tests but wanted to show an example.

*/
package com.twitter.scalding.graph

case class Edge[T: Ordering, S](source: T, dest: T, attr: S)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about T: Ordering here. This will have the Ordering serialized with each edge, sadly. Can we move the T: Ordering to methods that actually require an ordering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, never really thought about the overhead of ordering but it could probably be pretty large.

@richwhitjr
Copy link
Contributor Author

Decided to simply this PR a bit and removed the example vertex similarity code and the neighbor intersection. I can do a followup PR adding the intersection logic back in. It becomes a bit tricky to think through the case of mutuals(directed edges in both directions).

@richwhitjr
Copy link
Contributor Author

Also for now I think it might be helpful to keep collectNeighborIds separate from collectNeighbor. Worried mostly about the memory overhead of first collecting the vertices with the attributes then doing the filtering. For very large set of neighbors just directly getting ids could be much more efficient.

@richwhitjr richwhitjr changed the title Experimental Graph API for Scalding Graph API for Scalding Mar 10, 2017
@CLAassistant
Copy link

CLAassistant commented Jul 18, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Richard Whitcomb seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants