Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta log getting too big, resulting in spark job failures while writing. #779

Open
nnani opened this issue Sep 9, 2021 · 6 comments
Open
Labels
need author feedback Issue is waiting for the author to respond question Questions on how to use Delta Lake

Comments

@nnani
Copy link

nnani commented Sep 9, 2021

Hello,

We have been using delta library for more than 2 years now on HDI Cluster. Recently we came across few cases where the spark job starts failing when trying to append data to a existing partitioned table. It fails with java.lang.OutOfMemory: Jave heap size at ...... issue.
Delta library used - 0.5 version
Table partitioned on 3 columns.

We tried querying this delta table through Jupyter and when no filters applied, it fails with same error.

After searching for this issue, it looks like the delta library is trying read / store the list of parquet files that need to be scanned into an array, however its failing to do so.

When we try with huge (10GB) driver memory the spark job goes through. However, we cannot afford to allocate that much of driver memory due to # of jobs and infra limitations.

Based on this we have the below questions. It would be nice if you can help answer the doubts

  1. We find almost 1K json files under _delta_log folder and too many checkpoints file around 90MB each. When do these files get deleted from the folder and what triggers this deletion ?
  2. We have HDFS based on Azure Blob storage. In this case, we see one of the DELTA table has almost 1.7 Million blobs (files), but latest checkpoint size is only 35 MB, while other DELTA table has same number of blobs and checkpoint file size around 90MB ? How is this possible ? Table structure is exactly same.
  3. When delta tries to write in any mode (overwrite / append), does it read the complete table first, before writing. If yes, is this as designed or done with specific purpose ? As we see the spark job goes through when reading, but fails, when writing every time.
  4. When delta table is read from spark, does it really need 10GB to read a 90 MB parquet file ? Anything else happening behind the scenes ?
  5. What is the max size a checkpoint file can be extended to ?
  6. Any good way to compact the complete table, as even compaction is failing due to OOM issues. seems it tries to read all the data and fails.

Note - We have already vacuumed all the data for these tables.

@dennyglee dennyglee added the question Questions on how to use Delta Lake label Oct 11, 2021
@dennyglee
Copy link
Contributor

Hi @nnani - there are a number of questions here and it may be worth pinging us in the Delta Users Slack.

  1. If you have a lot of files in _delta_log, you can reduce the number by more aggressively removing them via VACUUM with the delta.logRetentionDuration property. More information in the Delta documentation > Table Utility Commands
  2. In this case, by any chance are you overwriting the table? If so, then the number of actual files for the table would be the same (hence the same 35MB checkpoint size).
  3. Delta will overwrite or append based on your specification - i.e. are you specifying in df.write.format("delta").mode("...") such that the mode is either append or overwrite. More information in the Delta documentation > Write to a table.
  4. When Delta is read from Spark, the 10GB size may have to do with the fact that it needs to read such a large transaction log. By resolving [1], that should reduce the size.
  5. The checkpoint file has no set max per se - it is the 10th transaction to backup the transaction log such that Spark can read it faster.
  6. If possible, please vacuum both the log and data.

@dennyglee dennyglee added the need author feedback Issue is waiting for the author to respond label Oct 11, 2021
@kikalyan
Copy link

kikalyan commented Nov 2, 2021

@dennyglee Is delta.logRetentionDuration property supported in 0.6.1 version?
if so, can you share an example. I tried using alter table properties command by passing the delta table path, it throws an error table not found.

@dennyglee
Copy link
Contributor

We started supporting delta.logRetentionDuration in Delta 0.7 per https://docs.delta.io/0.7.0/delta-batch.html#data-retention. HTH!

@chengshaoli
Copy link

The checkpoint.parquet file size of one of my delta tables had reached 118M, which caused my spark program to process each batch slowly. One of the job that merged transaction logs was executed for 1min each time.

  1. Is this a normal phenomenon? Is this checkpoint.parquet arriving so big?
  2. In addition, I tried to expand the parallelism processing of this file slice, but the spark parameter setting did not take effect.

image

@Bennyelg
Copy link

bump

@machielg
Copy link

I have a delta lake table with 100s of checkpoint files created per minute. The delta log folder is reaching over 8 terabytes and 3 million files. The table itselfs is about 1 terabyte. The table is now beyond vacuum because the driver crashes, probably due to the vast number of checkpoint files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need author feedback Issue is waiting for the author to respond question Questions on how to use Delta Lake
Projects
None yet
Development

No branches or pull requests

6 participants