[Spark] Parallel stats collection within each partition #2203

fred-db · 2023-10-18T07:38:44Z

Which Delta project/connector is this regarding?

Description

Do stats collection in parallel within a partition instead of sequentially to reduce the time spent idle waiting for network requests to go through while fetching the file status or the parquet footers from cloud store
Using global threadpool on each executor to do parallel stats collection
Code to partition the dataset before collecting stats to increase the achievable throughput

How was this patch tested?

Existing UTs should cover the correct collection of statistics.

Does this PR introduce any user-facing changes?

No

felipepessoto · 2023-10-18T08:07:15Z

Is this only improving the scenario where user recompute stats for existing files?

I'm asking because generating stats takes considerable amount of time when inserting data, in my experiments around 20% of overhead in comparison when stats.collect is disabled.

vkorukanti · 2023-10-18T17:05:01Z

Is this only improving the scenario where user recompute stats for existing files?

I'm asking because generating stats takes considerable amount of time when inserting data, in my experiments around 20% of overhead in comparison when stats.collect is disabled.

AFAIK the method computeStats is used only when updating/deleting table with deletion vectors (DVs). It is a must to recompute stats for files that don't have stats when the AddFile contains a DV.

vkorukanti · 2023-10-18T17:06:57Z

spark/src/main/scala/org/apache/spark/sql/delta/stats/StatsCollectionUtils.scala

@@ -27,6 +28,7 @@ import org.apache.spark.sql.delta.actions.AddFile
 import org.apache.spark.sql.delta.sources.DeltaSQLConf
 import org.apache.spark.sql.delta.stats.DeltaStatistics._
 import org.apache.spark.sql.delta.util.{DeltaFileOperations, JsonUtils}
+import org.apache.spark.sql.delta.util.threads.DeltaThreadPool


The package name seems to be wrong. Are you missing a change?

Ah yes, I am depending on this PR #2154 being merged first. I didn't want to include the changes of it in this PR, as it would make the code review harder. But once the other PR is merged, this should match up.

vkorukanti · 2023-10-18T17:09:38Z

spark/src/main/scala/org/apache/spark/sql/delta/stats/StatsCollectionUtils.scala

@@ -137,6 +189,21 @@ object StatsCollectionUtils
  }
 }

+object ParallelFetchPool {
+  val NUM_THREADS_PER_CORE = 10
+  val MAX_THREADS = 1024


The thread count is a bit high. Have you seen any issues with your testing?

I didn't see any issues when trying it out on a 4 worker cluster. I think you could run into some throttling issues if the cluster you use is very large and you have to collect stats for DVs for many files. But then you can always increase the number of files per partition for stats collection, that should reduce throttling issues.

vkorukanti

lgtm.

vkorukanti requested changes Oct 18, 2023

View reviewed changes

fred-db requested a review from vkorukanti October 19, 2023 08:54

vkorukanti changed the title ~~Parallel stats collection within each partition~~ [Spark] Parallel stats collection within each partition Oct 20, 2023

vkorukanti approved these changes Oct 20, 2023

View reviewed changes

fred-db force-pushed the multi-threaded-stats-collection branch from 682266f to f96a232 Compare October 26, 2023 07:17

johanl-db mentioned this pull request Oct 27, 2023

[Spark][1.0] Fix a data loss bug in MergeIntoCommand #2128

Open

5 tasks

update

1bb6352

fred-db force-pushed the multi-threaded-stats-collection branch from f96a232 to 1bb6352 Compare October 27, 2023 08:45

allisonport-db closed this in ee8d095 Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Parallel stats collection within each partition #2203

[Spark] Parallel stats collection within each partition #2203

fred-db commented Oct 18, 2023

felipepessoto commented Oct 18, 2023

vkorukanti commented Oct 18, 2023

vkorukanti Oct 18, 2023

fred-db Oct 19, 2023

vkorukanti Oct 18, 2023

fred-db Oct 19, 2023

vkorukanti left a comment

[Spark] Parallel stats collection within each partition #2203

[Spark] Parallel stats collection within each partition #2203

Conversation

fred-db commented Oct 18, 2023

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

felipepessoto commented Oct 18, 2023

vkorukanti commented Oct 18, 2023

vkorukanti Oct 18, 2023

Choose a reason for hiding this comment

fred-db Oct 19, 2023

Choose a reason for hiding this comment

vkorukanti Oct 18, 2023

Choose a reason for hiding this comment

fred-db Oct 19, 2023

Choose a reason for hiding this comment

vkorukanti left a comment

Choose a reason for hiding this comment