-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark] Parallel stats collection within each partition #2203
Conversation
Is this only improving the scenario where user recompute stats for existing files? I'm asking because generating stats takes considerable amount of time when inserting data, in my experiments around 20% of overhead in comparison when stats.collect is disabled. |
AFAIK the method |
@@ -27,6 +28,7 @@ import org.apache.spark.sql.delta.actions.AddFile | |||
import org.apache.spark.sql.delta.sources.DeltaSQLConf | |||
import org.apache.spark.sql.delta.stats.DeltaStatistics._ | |||
import org.apache.spark.sql.delta.util.{DeltaFileOperations, JsonUtils} | |||
import org.apache.spark.sql.delta.util.threads.DeltaThreadPool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package name seems to be wrong. Are you missing a change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I am depending on this PR #2154 being merged first. I didn't want to include the changes of it in this PR, as it would make the code review harder. But once the other PR is merged, this should match up.
@@ -137,6 +189,21 @@ object StatsCollectionUtils | |||
} | |||
} | |||
|
|||
object ParallelFetchPool { | |||
val NUM_THREADS_PER_CORE = 10 | |||
val MAX_THREADS = 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thread count is a bit high. Have you seen any issues with your testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any issues when trying it out on a 4 worker cluster. I think you could run into some throttling issues if the cluster you use is very large and you have to collect stats for DVs for many files. But then you can always increase the number of files per partition for stats collection, that should reduce throttling issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
682266f
to
f96a232
Compare
f96a232
to
1bb6352
Compare
Which Delta project/connector is this regarding?
Description
How was this patch tested?
Does this PR introduce any user-facing changes?
No