Graph API for Scalding #1583

richwhitjr · 2016-07-24T18:41:36Z

I had been using the GraphX library in Spark and realized that scalding does a much better job with very large graphs with heavy skew. Some of the features in the GraphX library were really nice though and abstracted away some of the complexity of graph work.

This review is a first pass at creating a similar API in scalding. Currently I have only been thinking about directed graphs but undirected graphs should also be supported in a reasonable library. The tests also need further work but I want to give people a chance to comment on the API.

avibryant · 2016-07-25T16:39:31Z

How does this work relate to https://github.com/twitter/scalding/blob/b1d99378b25b27fe128cb083e46032c83e9e8a88/scalding-core/src/main/scala/com/twitter/scalding/mathematics/TypedSimilarity.scala, which also includes a simple graph abstraction? It might be informative to see what those algorithms look like implemented in terms of this API (or if that's possible).

richwhitjr · 2016-07-25T19:01:05Z

Seems like most of those algorithms could be written in terms of the Graph structure. The nice thing of this new abstraction is working natively with vertices and minimizing data duplication across edges. Collecting neighbors has many useful properties for doing efficient graph calculations.

Let me see what I can come up with.

richwhitjr · 2016-07-25T20:22:08Z

Added an example of doing cosine similarity with the Graph class. The intersection methods clearly need unit tests but wanted to show an example.

johnynek · 2016-08-25T18:13:03Z

scalding-core/src/main/scala/com/twitter/scalding/graph/Edge.scala

+*/
+package com.twitter.scalding.graph
+
+case class Edge[T: Ordering, S](source: T, dest: T, attr: S)


I'm a bit worried about T: Ordering here. This will have the Ordering serialized with each edge, sadly. Can we move the T: Ordering to methods that actually require an ordering?

Sure, never really thought about the overhead of ordering but it could probably be pretty large.

richwhitjr · 2016-08-27T00:38:49Z

Decided to simply this PR a bit and removed the example vertex similarity code and the neighbor intersection. I can do a followup PR adding the intersection logic back in. It becomes a bit tricky to think through the case of mutuals(directed edges in both directions).

richwhitjr · 2016-08-27T00:42:31Z

Also for now I think it might be helpful to keep collectNeighborIds separate from collectNeighbor. Worried mostly about the memory overhead of first collecting the vertices with the attributes then doing the filtering. For very large set of neighbors just directly getting ids could be much more efficient.

CLAassistant · 2019-07-18T15:09:59Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Richard Whitcomb seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Experimental Graph API for Scalding

176d30d

richwhitjr mentioned this pull request Jul 24, 2016

Graph Library for Scalding? #1577

Open

Add missing types to Graph subclasses which is breaking 2.10 build

6670dfa

Add exact cosine example

54a446c

johnynek reviewed Aug 25, 2016
View reviewed changes

Simplify review, more unit tests, address feedback

d8e7a43

richwhitjr changed the title ~~Experimental Graph API for Scalding~~ Graph API for Scalding Mar 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph API for Scalding #1583

Graph API for Scalding #1583

richwhitjr commented Jul 24, 2016

avibryant commented Jul 25, 2016

richwhitjr commented Jul 25, 2016

richwhitjr commented Jul 25, 2016

johnynek Aug 25, 2016

richwhitjr Aug 25, 2016

richwhitjr commented Aug 27, 2016

richwhitjr commented Aug 27, 2016

CLAassistant commented Jul 18, 2019 •

edited

Loading

Graph API for Scalding #1583

Are you sure you want to change the base?

Graph API for Scalding #1583

Conversation

richwhitjr commented Jul 24, 2016

avibryant commented Jul 25, 2016

richwhitjr commented Jul 25, 2016

richwhitjr commented Jul 25, 2016

johnynek Aug 25, 2016

Choose a reason for hiding this comment

richwhitjr Aug 25, 2016

Choose a reason for hiding this comment

richwhitjr commented Aug 27, 2016

richwhitjr commented Aug 27, 2016

CLAassistant commented Jul 18, 2019 • edited Loading

CLAassistant commented Jul 18, 2019 •

edited

Loading