-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use-case: Computing PageRank on PySpark #144
Comments
Thanks for the feedback, this sounds awesome. Can you give some more details on your dataset, dgraph and spark cluster?
Did you try different configuration values for |
I mean, I read in my whole DB! This is my basic script: #new from gresearch.spark.dgraph.connector import * #triples: DataFrame = spark.read.dgraph.triples("localhost:9080") from graphframes import * g.outDegrees.orderBy("outDegree", ascending=False).limit(10).show() g.triangleCount().orderBy("count", ascending=False).limit(10).show() #pr.vertices.orderBy("pagerank", ascending=False).limit(100).show(100) pr.vertices.select("id","pagerank").dropDuplicates().orderBy("pagerank", ascending=False).write.json("pagerank.json") result = g.labelPropagation(maxIter=10) #FAILS TO COMPLETE Right now I think I am using the "all in one" Docker Image (aka, one Zero/Alpha). But performance is fine really...I have 8 cores and 64G of RAM.
I can't remember how long pagerank took but the counting of vertices took a few minutes, if that's any judge. :) I can get better data for you at some point! Will this version work on the new version of DGraph that is about to come out? Thanks! |
edges2 = edges.withColumnRenamed("subject", "src").withColumnRenamed("objectUid","dst") <---btw. I can't remember if you have this in the documentation but I think you need to rename the columns to make it work? I could be misremembering. |
Thanks for the insights. The time that your counts take is pretty much what is needed to transfer the graph from Dgraph over to Spark. There is not much more overhead include. This is a good way of measuring read speed. Looks like you are running Dgraph and the Spark app on the same machine, so there are 8 concurrent Spark tasks reading from your single alpha node. From my experience I would expect your CPU to be 100% utilized by Dgraph alpha in this setup. So you could improve speed by putting Dgraph on a separate machine with at least twice as many CPUs as your Spark job has. But given that graph reads in minutes, there is not much need to improve read speed. You are setting I will look into adding support for GraphFrame in PySpark so that you can reduce this
to
|
When I had the default chunksize I got an error (gRPC call too big!). I
also have my SPARK memory configuration set to 35G in spark-defaults.conf:
spark.driver.memory 35g
spark.executor.memory 35g
(There is a lot of data in this DB and without that it would run out of
memory).
Of course, for running SPARK and DGraph on the same box I also had to reset
the default port for SPARK.
spark-env.sh:
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
for the master
SPARK_MASTER_PORT=8001
SPARK_MASTER_WEBUI_PORT=8081
Hopefully other people will see how great (and FREE) this pipeline is!
Thanks!
-dave
…On Thu, Nov 4, 2021 at 7:25 AM Enrico Minack ***@***.***> wrote:
Thanks for the insights.
The time that your counts take is pretty much what is needed to transfer
the graph from Dgraph over to Spark. There is not much more overhead
include. This is a good way of measuring read speed.
Looks like you are running Dgraph and the Spark app on the same machine,
so there are 8 concurrent Spark tasks reading from your single alpha node.
From my experience I would expect your CPU to be 100% utilized by Dgraph
alpha in this setup. So you could improve speed by putting Dgraph on a
separate machine with at least twice as many CPUs as your Spark job has.
But given that graph reads in minutes, there is not much need to improve
read speed.
You are setting dgraph.chunksize to 300, which is very low. Is the
default not working for you? I would expect larger values like 10000 to
be faster (assuming the Dgraph alpha is not saturating the CPU).
I will look into adding support for GraphFrame in PySpark so that you can
reduce this
edges: DataFrame = spark.read.option("dgraph.chunksize", 300).dgraph.edges("localhost:9080")
nodes: DataFrame = spark.read.option("dgraph.chunksize", 300).dgraph.nodes("localhost:9080")
nodes2 = nodes.withColumnRenamed("subject", "id")
edges2 = edges.withColumnRenamed("subject", "src").withColumnRenamed("objectUid","dst")
g = GraphFrame(nodes2, edges2)
to
g = spark.read.option("dgraph.chunksize", 300).dgraph.graphframes("localhost:9080")
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE25MYRDTEY7BLGPS4SINJTUKJURLANCNFSM5HGOIEOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
An excellent use case for #74. Thanks for the valuable insights! |
I forgot to answer this bit: We will see once it comes out. Earlier releases introduced breaking changes. Watch this space. |
I can confirm that v21.09.0 will not work with ≤ 0.7.0 releases of the connector as there is a breaking change in their /state endpoint. Anyway, there will be a new release of the connector once this is sorted out. |
Awesome, thanks so much. I DO want to upgrade to the new release (which is
supposed to be much faster for some things) at some point. Appreciate your
continued support of this project!
…-dave
On Thu, Nov 4, 2021 at 5:05 PM Enrico Minack ***@***.***> wrote:
I can confirm that v21.09.0 will not work with ≤ 0.7.0 releases of the
connector as there is a breaking change in their /state endpoint. Anyway,
there will be a new release of the connector once this is sorted out.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE25MYRRFVWWZOHI4Q4UEDDUKLYQ5ANCNFSM5HGOIEOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Also, is there any way to pass in a DQL query that then filters down the
graphframe to a subgraph?
…-dave
On Thu, Nov 4, 2021 at 5:27 PM Dave Aitel ***@***.***> wrote:
Awesome, thanks so much. I DO want to upgrade to the new release (which is
supposed to be much faster for some things) at some point. Appreciate your
continued support of this project!
-dave
On Thu, Nov 4, 2021 at 5:05 PM Enrico Minack ***@***.***>
wrote:
> I can confirm that v21.09.0 will not work with ≤ 0.7.0 releases of the
> connector as there is a breaking change in their /state endpoint. Anyway,
> there will be a new release of the connector once this is sorted out.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#144 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AE25MYRRFVWWZOHI4Q4UEDDUKLYQ5ANCNFSM5HGOIEOA>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
|
This is not implemented but should be doable for a subset of DQL queries. Ingesting a partitioning will be interesting, though. There are quite a few filter-pushdowns implemented, so maybe you can use those to filter down to a subgraph on the Dgraph side. |
Hey @daveaitel, Dgraph v21.12.0 has just been released. I have prepared a SNAPSHOT release of the connector for you to try against that Dgraph: |
Awesome! I currently am trying to re-implement clustering algorithms for
part of this project but then I'll do the PageRank demo for the other
project I have that uses DGraph (and is on the latest branch :)
…On Thu, Dec 2, 2021 at 4:15 PM Enrico Minack ***@***.***> wrote:
Hey @daveaitel <https://github.com/daveaitel>, Dgraph v21.12.0 has just
been released. I have prepared a SNAPSHOT release of the connector for you
to try against that Dgraph: 0.8.0-3.1-20211202.210227-1. You may need to
add this url to your list of repositories:
https://oss.sonatype.org/content/repositories/snapshots.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE25MYRPWIWUHSUOBIT2F2TUO7OYJANCNFSM5HGOIEOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
The string I ended up using was: 0.8.0-3.1-SNAPSHOT which I found by
searching on the this page:
https://oss.sonatype.org/#nexus-search;quick~spark-dgraph-connector
Hopefully that was the right version. I am currently running this command:
#I added your repo to the spark/conf/spark-defaults.conf file
#With master branch:
pyspark --packages
uk.co.gresearch.spark:spark-dgraph-connector_2.12:0.8.0-3.1-SNAPSHOT,graphframes:graphframes:0.8.1-spark3.0-s_2.12
#Inside pySpark shell (some lines left out)
#using chunksize of 300 btw
>> nodes2 = nodes.withColumnRenamed("subject", "id") #wish we didn't need
this :)
>> edges2 = edges.withColumnRenamed("subject",
"src").withColumnRenamed("objectUid","dst") #Also this? I'm sure there is a
good reason though! :)
>> g = GraphFrame(nodes2, edges2)
>> sc.setCheckpointDir("/tmp")
>> pr = g.pageRank(resetProbability=0.15, tol=0.01)
[Stage 3:> (0 + 4) / 4][Stage 4:> (0 + 0) / 4][Stage 5:> (0 + 4) /
4]4]
It's not done yet but it looks good so far!
Thanks and great work!
-dave
…On Thu, Dec 2, 2021 at 4:15 PM Enrico Minack ***@***.***> wrote:
Hey @daveaitel <https://github.com/daveaitel>, Dgraph v21.12.0 has just
been released. I have prepared a SNAPSHOT release of the connector for you
to try against that Dgraph: 0.8.0-3.1-20211202.210227-1. You may need to
add this url to your list of repositories:
https://oss.sonatype.org/content/repositories/snapshots.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE25MYRPWIWUHSUOBIT2F2TUO7OYJANCNFSM5HGOIEOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Just for performance metrics: Somehow used up all 64G of my RAM during my
first run, had to reboot and try again, and it seems to have completed.
Dgraph itself is using 7 CPUs and Java is using about half a CPU:
[image: image.png]
PageRank results look "real" to me (i.e. they make sense in my dataset).
Completes in something like a half hour on this box on my dataset which is
roughly 3M nodes.
…-dave
On Tue, Dec 7, 2021 at 10:19 AM Dave Aitel ***@***.***> wrote:
The string I ended up using was: 0.8.0-3.1-SNAPSHOT which I found by
searching on the this page:
https://oss.sonatype.org/#nexus-search;quick~spark-dgraph-connector
Hopefully that was the right version. I am currently running this command:
#I added your repo to the spark/conf/spark-defaults.conf file
#With master branch:
pyspark --packages
uk.co.gresearch.spark:spark-dgraph-connector_2.12:0.8.0-3.1-SNAPSHOT,graphframes:graphframes:0.8.1-spark3.0-s_2.12
#Inside pySpark shell (some lines left out)
#using chunksize of 300 btw
>>> nodes2 = nodes.withColumnRenamed("subject", "id") #wish we didn't need
this :)
>>> edges2 = edges.withColumnRenamed("subject",
"src").withColumnRenamed("objectUid","dst") #Also this? I'm sure there is a
good reason though! :)
>>> g = GraphFrame(nodes2, edges2)
>>> sc.setCheckpointDir("/tmp")
>>> pr = g.pageRank(resetProbability=0.15, tol=0.01)
[Stage 3:> (0 + 4) / 4][Stage 4:> (0 + 0) / 4][Stage 5:> (0 + 4)
/ 4]4]
It's not done yet but it looks good so far!
Thanks and great work!
-dave
On Thu, Dec 2, 2021 at 4:15 PM Enrico Minack ***@***.***>
wrote:
> Hey @daveaitel <https://github.com/daveaitel>, Dgraph v21.12.0 has just
> been released. I have prepared a SNAPSHOT release of the connector for you
> to try against that Dgraph: 0.8.0-3.1-20211202.210227-1. You may need to
> add this url to your list of repositories:
> https://oss.sonatype.org/content/repositories/snapshots.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#144 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AE25MYRPWIWUHSUOBIT2F2TUO7OYJANCNFSM5HGOIEOA>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
|
Yes, I presume that PageRank ran against the latest v21.12.0 Dgraph. So that works then. Great! |
@daveaitel I have released the Dgraph 20.12.0 support in |
Awesome!! I'm actually hoping they fix this crash bug that's been riddling
me so that I can restart my inputs. :(
…On Fri, Jan 21, 2022, 2:15 PM Enrico Minack ***@***.***> wrote:
@daveaitel <https://github.com/daveaitel> I have released the Dgraph
20.12.0 support in 0.8.0.
—
Reply to this email directly, view it on GitHub
<#144 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE25MYQM5JXPWOACMSW7AHDUXGWF7ANCNFSM5HGOIEOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ran a fairly big pagerank job across my data set in pySpark...worked super well. Thanks for the library!
The text was updated successfully, but these errors were encountered: