Use-case: Computing PageRank on PySpark #144

daveaitel · 2021-11-02T13:20:37Z

Ran a fairly big pagerank job across my data set in pySpark...worked super well. Thanks for the library!

EnricoMi · 2021-11-03T09:14:44Z

Thanks for the feedback, this sounds awesome. Can you give some more details on your dataset, dgraph and spark cluster?

How many triples, nodes, types and predicates do you read in to compute the PageRank?
How many Dgraph alpha nodes and Spark nodes do you use?
How many CPU cores do your Dgraph alpha and Spark worker nodes have?
How long does the reading phase (ex. PageRank computation) take?

Did you try different configuration values for dgraph.chunkSize or dgraph.partitioner.uidRange.uidsPerPartition?

daveaitel · 2021-11-03T23:12:42Z

I mean, I read in my whole DB! This is my basic script:

#new
pyspark --packages uk.co.gresearch.spark:spark-dgraph-connector_2.12:0.7.0-3.1,graphframes:graphframes:0.8.1-spark3.0-s_2.12

from gresearch.spark.dgraph.connector import *

#triples: DataFrame = spark.read.dgraph.triples("localhost:9080")
edges: DataFrame = spark.read.option("dgraph.chunksize", 300).dgraph.edges("localhost:9080")
nodes: DataFrame = spark.read.option("dgraph.chunksize", 300).dgraph.nodes("localhost:9080")

from graphframes import *
nodes2 = nodes.withColumnRenamed("subject", "id")
edges2 = edges.withColumnRenamed("subject", "src").withColumnRenamed("objectUid","dst")
g = GraphFrame(nodes2, edges2)
sc.setCheckpointDir("/tmp")

g.outDegrees.orderBy("outDegree", ascending=False).limit(10).show()
g.inDegrees.orderBy("inDegree", ascending=False).limit(10).show()

g.triangleCount().orderBy("count", ascending=False).limit(10).show()
pr = g.pageRank(resetProbability=0.15, tol=0.01)

#pr.vertices.orderBy("pagerank", ascending=False).limit(100).show(100)
pr.vertices.select("id","pagerank").dropDuplicates().orderBy("pagerank", ascending=False).limit(100).show(100)

pr.vertices.select("id","pagerank").dropDuplicates().orderBy("pagerank", ascending=False).write.json("pagerank.json")

result = g.labelPropagation(maxIter=10) #FAILS TO COMPLETE

Right now I think I am using the "all in one" Docker Image (aka, one Zero/Alpha). But performance is fine really...I have 8 cores and 64G of RAM.

g.vertices.count()
21418467

g.edges.count()
13125761

I can't remember how long pagerank took but the counting of vertices took a few minutes, if that's any judge. :) I can get better data for you at some point!

Will this version work on the new version of DGraph that is about to come out?

Thanks!

daveaitel · 2021-11-03T23:14:07Z

edges2 = edges.withColumnRenamed("subject", "src").withColumnRenamed("objectUid","dst") <---btw. I can't remember if you have this in the documentation but I think you need to rename the columns to make it work? I could be misremembering.

EnricoMi · 2021-11-04T11:24:58Z

Thanks for the insights.

The time that your counts take is pretty much what is needed to transfer the graph from Dgraph over to Spark. There is not much more overhead include. This is a good way of measuring read speed.

Looks like you are running Dgraph and the Spark app on the same machine, so there are 8 concurrent Spark tasks reading from your single alpha node. From my experience I would expect your CPU to be 100% utilized by Dgraph alpha in this setup. So you could improve speed by putting Dgraph on a separate machine with at least twice as many CPUs as your Spark job has. But given that graph reads in minutes, there is not much need to improve read speed.

You are setting dgraph.chunksize to 300, which is very low. Is the default not working for you? I would expect larger values like 10000 to be faster (assuming the Dgraph alpha is not saturating the CPU).

I will look into adding support for GraphFrame in PySpark so that you can reduce this

edges: DataFrame = spark.read.option("dgraph.chunksize", 300).dgraph.edges("localhost:9080")
nodes: DataFrame = spark.read.option("dgraph.chunksize", 300).dgraph.nodes("localhost:9080")

nodes2 = nodes.withColumnRenamed("subject", "id")
edges2 = edges.withColumnRenamed("subject", "src").withColumnRenamed("objectUid","dst")
g = GraphFrame(nodes2, edges2)

to

g = spark.read.option("dgraph.chunksize", 300).dgraph.graphframes("localhost:9080")

daveaitel · 2021-11-04T12:04:30Z

When I had the default chunksize I got an error (gRPC call too big!). I also have my SPARK memory configuration set to 35G in spark-defaults.conf: spark.driver.memory 35g spark.executor.memory 35g (There is a lot of data in this DB and without that it would run out of memory). Of course, for running SPARK and DGraph on the same box I also had to reset the default port for SPARK. spark-env.sh: # - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master SPARK_MASTER_PORT=8001 SPARK_MASTER_WEBUI_PORT=8081 Hopefully other people will see how great (and FREE) this pipeline is! Thanks! -dave

…

On Thu, Nov 4, 2021 at 7:25 AM Enrico Minack ***@***.***> wrote: Thanks for the insights. The time that your counts take is pretty much what is needed to transfer the graph from Dgraph over to Spark. There is not much more overhead include. This is a good way of measuring read speed. Looks like you are running Dgraph and the Spark app on the same machine, so there are 8 concurrent Spark tasks reading from your single alpha node. From my experience I would expect your CPU to be 100% utilized by Dgraph alpha in this setup. So you could improve speed by putting Dgraph on a separate machine with at least twice as many CPUs as your Spark job has. But given that graph reads in minutes, there is not much need to improve read speed. You are setting dgraph.chunksize to 300, which is very low. Is the default not working for you? I would expect larger values like 10000 to be faster (assuming the Dgraph alpha is not saturating the CPU). I will look into adding support for GraphFrame in PySpark so that you can reduce this edges: DataFrame = spark.read.option("dgraph.chunksize", 300).dgraph.edges("localhost:9080") nodes: DataFrame = spark.read.option("dgraph.chunksize", 300).dgraph.nodes("localhost:9080") nodes2 = nodes.withColumnRenamed("subject", "id") edges2 = edges.withColumnRenamed("subject", "src").withColumnRenamed("objectUid","dst") g = GraphFrame(nodes2, edges2) to g = spark.read.option("dgraph.chunksize", 300).dgraph.graphframes("localhost:9080") — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE25MYRDTEY7BLGPS4SINJTUKJURLANCNFSM5HGOIEOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

EnricoMi · 2021-11-04T15:18:49Z

An excellent use case for #74. Thanks for the valuable insights!

EnricoMi · 2021-11-04T18:04:12Z

Will this version work on the new version of DGraph that is about to come out?

I forgot to answer this bit: We will see once it comes out. Earlier releases introduced breaking changes. Watch this space.

EnricoMi · 2021-11-04T21:05:07Z

I can confirm that v21.09.0 will not work with ≤ 0.7.0 releases of the connector as there is a breaking change in their /state endpoint. Anyway, there will be a new release of the connector once this is sorted out.

daveaitel · 2021-11-04T21:27:27Z

Awesome, thanks so much. I DO want to upgrade to the new release (which is supposed to be much faster for some things) at some point. Appreciate your continued support of this project!

…

-dave

On Thu, Nov 4, 2021 at 5:05 PM Enrico Minack ***@***.***> wrote: I can confirm that v21.09.0 will not work with ≤ 0.7.0 releases of the connector as there is a breaking change in their /state endpoint. Anyway, there will be a new release of the connector once this is sorted out. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE25MYRRFVWWZOHI4Q4UEDDUKLYQ5ANCNFSM5HGOIEOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

daveaitel · 2021-11-05T19:22:40Z

Also, is there any way to pass in a DQL query that then filters down the graphframe to a subgraph?

…

-dave

On Thu, Nov 4, 2021 at 5:27 PM Dave Aitel ***@***.***> wrote: Awesome, thanks so much. I DO want to upgrade to the new release (which is supposed to be much faster for some things) at some point. Appreciate your continued support of this project! -dave On Thu, Nov 4, 2021 at 5:05 PM Enrico Minack ***@***.***> wrote: > I can confirm that v21.09.0 will not work with ≤ 0.7.0 releases of the > connector as there is a breaking change in their /state endpoint. Anyway, > there will be a new release of the connector once this is sorted out. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#144 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AE25MYRRFVWWZOHI4Q4UEDDUKLYQ5ANCNFSM5HGOIEOA> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > >

EnricoMi · 2021-11-11T22:02:17Z

This is not implemented but should be doable for a subset of DQL queries. Ingesting a partitioning will be interesting, though.

There are quite a few filter-pushdowns implemented, so maybe you can use those to filter down to a subgraph on the Dgraph side.

EnricoMi · 2021-12-02T21:15:36Z

Hey @daveaitel, Dgraph v21.12.0 has just been released. I have prepared a SNAPSHOT release of the connector for you to try against that Dgraph: 0.8.0-3.1-20211202.210227-1. You may need to add this url to your list of repositories: https://oss.sonatype.org/content/repositories/snapshots.

daveaitel · 2021-12-03T20:42:05Z

Awesome! I currently am trying to re-implement clustering algorithms for part of this project but then I'll do the PageRank demo for the other project I have that uses DGraph (and is on the latest branch :)

…

On Thu, Dec 2, 2021 at 4:15 PM Enrico Minack ***@***.***> wrote: Hey @daveaitel <https://github.com/daveaitel>, Dgraph v21.12.0 has just been released. I have prepared a SNAPSHOT release of the connector for you to try against that Dgraph: 0.8.0-3.1-20211202.210227-1. You may need to add this url to your list of repositories: https://oss.sonatype.org/content/repositories/snapshots. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE25MYRPWIWUHSUOBIT2F2TUO7OYJANCNFSM5HGOIEOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

daveaitel · 2021-12-07T15:19:35Z

The string I ended up using was: 0.8.0-3.1-SNAPSHOT which I found by searching on the this page: https://oss.sonatype.org/#nexus-search;quick~spark-dgraph-connector Hopefully that was the right version. I am currently running this command: #I added your repo to the spark/conf/spark-defaults.conf file #With master branch: pyspark --packages uk.co.gresearch.spark:spark-dgraph-connector_2.12:0.8.0-3.1-SNAPSHOT,graphframes:graphframes:0.8.1-spark3.0-s_2.12 #Inside pySpark shell (some lines left out) #using chunksize of 300 btw

>> nodes2 = nodes.withColumnRenamed("subject", "id") #wish we didn't need

this :)

>> edges2 = edges.withColumnRenamed("subject",

"src").withColumnRenamed("objectUid","dst") #Also this? I'm sure there is a good reason though! :)

>> g = GraphFrame(nodes2, edges2) >> sc.setCheckpointDir("/tmp") >> pr = g.pageRank(resetProbability=0.15, tol=0.01)

[Stage 3:> (0 + 4) / 4][Stage 4:> (0 + 0) / 4][Stage 5:> (0 + 4) / 4]4] It's not done yet but it looks good so far! Thanks and great work! -dave

…

On Thu, Dec 2, 2021 at 4:15 PM Enrico Minack ***@***.***> wrote: Hey @daveaitel <https://github.com/daveaitel>, Dgraph v21.12.0 has just been released. I have prepared a SNAPSHOT release of the connector for you to try against that Dgraph: 0.8.0-3.1-20211202.210227-1. You may need to add this url to your list of repositories: https://oss.sonatype.org/content/repositories/snapshots. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE25MYRPWIWUHSUOBIT2F2TUO7OYJANCNFSM5HGOIEOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

daveaitel · 2021-12-07T16:10:56Z

Just for performance metrics: Somehow used up all 64G of my RAM during my first run, had to reboot and try again, and it seems to have completed. Dgraph itself is using 7 CPUs and Java is using about half a CPU: [image: image.png] PageRank results look "real" to me (i.e. they make sense in my dataset). Completes in something like a half hour on this box on my dataset which is roughly 3M nodes.

…

-dave

On Tue, Dec 7, 2021 at 10:19 AM Dave Aitel ***@***.***> wrote: The string I ended up using was: 0.8.0-3.1-SNAPSHOT which I found by searching on the this page: https://oss.sonatype.org/#nexus-search;quick~spark-dgraph-connector Hopefully that was the right version. I am currently running this command: #I added your repo to the spark/conf/spark-defaults.conf file #With master branch: pyspark --packages uk.co.gresearch.spark:spark-dgraph-connector_2.12:0.8.0-3.1-SNAPSHOT,graphframes:graphframes:0.8.1-spark3.0-s_2.12 #Inside pySpark shell (some lines left out) #using chunksize of 300 btw >>> nodes2 = nodes.withColumnRenamed("subject", "id") #wish we didn't need this :) >>> edges2 = edges.withColumnRenamed("subject", "src").withColumnRenamed("objectUid","dst") #Also this? I'm sure there is a good reason though! :) >>> g = GraphFrame(nodes2, edges2) >>> sc.setCheckpointDir("/tmp") >>> pr = g.pageRank(resetProbability=0.15, tol=0.01) [Stage 3:> (0 + 4) / 4][Stage 4:> (0 + 0) / 4][Stage 5:> (0 + 4) / 4]4] It's not done yet but it looks good so far! Thanks and great work! -dave On Thu, Dec 2, 2021 at 4:15 PM Enrico Minack ***@***.***> wrote: > Hey @daveaitel <https://github.com/daveaitel>, Dgraph v21.12.0 has just > been released. I have prepared a SNAPSHOT release of the connector for you > to try against that Dgraph: 0.8.0-3.1-20211202.210227-1. You may need to > add this url to your list of repositories: > https://oss.sonatype.org/content/repositories/snapshots. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#144 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AE25MYRPWIWUHSUOBIT2F2TUO7OYJANCNFSM5HGOIEOA> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > >

EnricoMi · 2021-12-07T19:49:36Z

Yes, spark-dgraph-connector_2.12:0.8.0-3.1-SNAPSHOT should pick up the right snapshot.

I presume that PageRank ran against the latest v21.12.0 Dgraph. So that works then. Great!

EnricoMi · 2022-01-21T19:15:32Z

@daveaitel I have released the Dgraph 20.12.0 support in 0.8.0.

daveaitel · 2022-01-22T20:31:30Z

Awesome!! I'm actually hoping they fix this crash bug that's been riddling me so that I can restart my inputs. :(

…

On Fri, Jan 21, 2022, 2:15 PM Enrico Minack ***@***.***> wrote: @daveaitel <https://github.com/daveaitel> I have released the Dgraph 20.12.0 support in 0.8.0. — Reply to this email directly, view it on GitHub <#144 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE25MYQM5JXPWOACMSW7AHDUXGWF7ANCNFSM5HGOIEOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

EnricoMi mentioned this issue Nov 4, 2021

Add support for GraphFrames and GraphX to PySpark API #145

Open

EnricoMi closed this as completed Nov 4, 2021

EnricoMi changed the title ~~FWIW~~ Use-case: Computing PageRank on PySpark Nov 5, 2021

EnricoMi pinned this issue Nov 5, 2021

EnricoMi mentioned this issue Nov 15, 2021

Question about subgraphs/filtering #152

Open

matthewmcneely mentioned this issue Mar 29, 2023

Create a Colab/notebook example #215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use-case: Computing PageRank on PySpark #144

Use-case: Computing PageRank on PySpark #144

daveaitel commented Nov 2, 2021

EnricoMi commented Nov 3, 2021

daveaitel commented Nov 3, 2021

daveaitel commented Nov 3, 2021

EnricoMi commented Nov 4, 2021

daveaitel commented Nov 4, 2021 via email

EnricoMi commented Nov 4, 2021

EnricoMi commented Nov 4, 2021

EnricoMi commented Nov 4, 2021

daveaitel commented Nov 4, 2021 via email

daveaitel commented Nov 5, 2021 via email

EnricoMi commented Nov 11, 2021

EnricoMi commented Dec 2, 2021

daveaitel commented Dec 3, 2021 via email

daveaitel commented Dec 7, 2021 via email

daveaitel commented Dec 7, 2021 via email

EnricoMi commented Dec 7, 2021

EnricoMi commented Jan 21, 2022

daveaitel commented Jan 22, 2022 via email

Use-case: Computing PageRank on PySpark #144

Use-case: Computing PageRank on PySpark #144

Comments

daveaitel commented Nov 2, 2021

EnricoMi commented Nov 3, 2021

daveaitel commented Nov 3, 2021

daveaitel commented Nov 3, 2021

EnricoMi commented Nov 4, 2021

daveaitel commented Nov 4, 2021 via email

EnricoMi commented Nov 4, 2021

EnricoMi commented Nov 4, 2021

EnricoMi commented Nov 4, 2021

daveaitel commented Nov 4, 2021 via email

daveaitel commented Nov 5, 2021 via email

EnricoMi commented Nov 11, 2021

EnricoMi commented Dec 2, 2021

daveaitel commented Dec 3, 2021 via email

daveaitel commented Dec 7, 2021 via email

daveaitel commented Dec 7, 2021 via email

EnricoMi commented Dec 7, 2021

EnricoMi commented Jan 21, 2022

daveaitel commented Jan 22, 2022 via email